Abstract
BACKGROUND: Missing survey data can threaten the validity and generalizability of findings from longitudinal cohort studies. Respondent characteristics and survey attributes may contribute to patterns of survey non-completion, a form of missing data in which respondents begin but do not finish a survey, that can lead to biased conclusions. The objectives of the present research are to demonstrate how machine learning can identify survey non-completion and to characterize individual and methodological factors that are associated with this form of data missingness. METHODS: The present study developed a novel machine learning algorithm to characterize survey non-completion in the Millennium Cohort Study during the 2019-2021 data collection cycle that included a 30- to 45-min paper or web-based follow-up survey for previously enrolled panels (Panels 1-4, n = 80,986) and a 30- to 45-min web-based baseline survey for new enrollees (Panel 5, n = 58,609). We then examined the effect of individual characteristics and survey attributes on survey non-completion. RESULTS: This algorithm achieved 99% accuracy and showed that 0.29% of follow-up respondents and 15.43% of new enrollees were survey non-completers. Our findings suggest that certain military and sociodemographic characteristics (e.g., enlisted pay grades) were associated with increased survey non-completion in the 2019-2021 cycle. Survey attributes explained a large proportion of the variability in survey non-completion, with our analyses indicating a higher likelihood of survey non-completion in Sects. (1) located toward the beginning of the survey, (2) with sensitive questions, and (3) with fewer questions. CONCLUSION: This research highlights the importance of accounting for potential respondent bias due to survey non-completion and identifies factors associated with this type of missing data.