Abstract
INTRODUCTION: The UK Biobank (UKB) dataset contains a large number of spouse/partner pairs (hereafter, "couples"), making it a valuable resource for research on human mating. UKB data does not report relationships between participants, and UKB's preliminary participant colocation data (based on participants sharing the same street address) is not available for new researchers to request. This has led to different criteria for identification of spouse/partner pairs and highly discrepant sample sizes across papers, potentially contributing to heterogeneity in results. To address this, we developed and validated a standardized method for identifying couples that maximizes sample size while minimizing selection bias. METHODS: We evaluated combinations of geographically-derived variables for identifying colocated UKB participants and selected six variables that performed well when compared to with the original UKB colocation data. These variables, such as "distance to coast," do not reveal participants' precise locations but are unlikely to match for non-colocated individuals. We then established additional criteria to confirm that colocated participants shared a household and were likely to be couples. These criteria were designed to be compatible with either geographic colocation or the original UKB colocation data. RESULTS: The geographically-derived variables identified 92,510 putative couples, compared to 89,278 detected using the UKB colocation data. Further analyses suggested that the additional pairs identified in our geographic sample were likely valid pairs not captured by the UKB colocation data. We also assessed certain criteria used in previous studies-such as age, income, and duration of residence-and demonstrated how they could result in biased or non-representative samples. DISCUSSION: Our approach produced a large, robust sample of couples while minimizing false positives. The criteria are flexible and can be applied using either geographic or UKB colocation data. To facilitate further research, we have made the R code for implementing these criteria publicly available at https://github.com/kkellysci/UKBCouples/.