Abstract
PURPOSE: Many studies in medical research are currently based on large-scale health surveys. Data collected in these surveys are usually obtained by following complex sampling designs, which include techniques such as stratification and clustering. Thus, special care should be taken with this kind of data, given that traditional statistical techniques are usually not valid in this context. In this study, we focus on the estimation of the discrimination ability of logistic regression models by means of the area under the receiver operating characteristic (ROC) curve (AUC). An AUC estimator which accounts for complex sampling designs has recently been proposed. The purpose of this study is to compare the performance of traditional and new design-based AUC estimators to estimate the AUC of logistic regression models fitted to complex sampling-design health data. METHODS: A simulation study has been carried out to compare the performance of traditional and design-based AUC estimators when working with complex survey data. For this purpose, the population of COVID-19 patients in the Basque Country has been considered. This population has been sampled several times following different sampling designs, a logistic regression model has been fitted to each of these samples, and the AUC has been estimated using traditional and design-based estimators. Those estimates have been compared to the true population AUC. RESULTS: While the design-based AUC estimator offers unbiased results, the traditional AUC estimator may be biased depending on the scenario. Both the sampling design and the variables considered in this design have an effect on the performance of those estimators. In particular, the type of design affects the variability of both estimators, being larger when clustering is involved. In addition, the stronger the relationship between design variables and outcome, the more biased results offers the traditional AUC estimator. CONCLUSION: The use of the design-based AUC estimator is recommended over the traditional one when working with complex survey data in order to avoid biased AUC estimates.