Abstract
INTRODUCTION: Smoking status is an important confounder for many epidemiologic studies, yet it is not well documented in common sources of real-world data, including administrative claims. Probabilistic models can be used to create a proxy for smoking status, yet most published models have been trained using self-reported data. The objective of this study was to train a smoking status probabilistic model using cotinine values available in a large claims database. METHODS: Beneficiaries were included if they had at least one cotinine measurement and were categorized as a 'current smoker' if their serum or plasma cotinine value was ≥5ng/mL or urine cotinine value was ≥30ng/mL. Predictors were collected across one year prior to the cotinine assessment date. The model was fit using logistic regression with stepwise forward selection. Model performance was assessed using discrimination and calibration. RESULTS: The final model yielded an area under the receiver operating characteristic curve of 0.77 (95%CI:0.75-0.78) and was well calibrated across most prediction deciles. The strongest predictors included diagnosis codes for smoking and drug abuse, and number of medications. The model was found to be highly specific, yet not sensitive at probability cutoffs ≥ 0.2. CONCLUSIONS: A smoking status model was developed and internally validated for application in claims data, using available cotinine values to define smoking status and found to have acceptable discrimination and calibration. The model is based on 26 predictors, fewer than other similar published smoking status models. External validation of the model should be a next step toward utilizing the model for epidemiological research. IMPLICATIONS: This study tests the utility of cotinine values to validate a smoking status probabilistic model, which has not been done in the literature to date. The results were robust to various cotinine levels used to define smoking status, per current guidance. The final model uses only 26 factors to predict smoking status, simplifying the application of the model in other claims databases.