Abstract
Drug development is an expensive endeavor, with costs averaging $879.3 million and only 14.3% of them ultimately securing regulatory approval. One fundamental challenge is ensuring that the enrolled patient population in a clinical trial accurately reflects the broader target population's baseline characteristics (BCx). This concern ties directly to the generalizability of study findings beyond the trial setting. If the trial population differs significantly from real-world patients in terms of BCx (e.g., age, comorbidities, disease severity), the applicability of the results to routine clinical practice may be limited. Both regulatory agencies and health technology assessment (HTA) bodies may question the generalizability and transportability of trial results to real-world settings. HTA bodies may further require robust comparative effectiveness estimates, which necessitate alignment with the patient populations of previous studies. Real-world data (RWD) can help assess and calibrate the representativeness of clinical trial populations, though limitations persist, including underrepresentation of key subgroups and missing biomarker data. Historical clinical trial data can offer valuable insights into BCx distributions and recruitment patterns, as well as enable more reliable estimates of comparative effectiveness. Machine learning (ML) can enhance this process by leveraging clustering methods to identify optimal BCx distributions, thereby improving trial generalizability. Moreover, Bayesian clustering frameworks that incorporate RWD can further refine these estimates, ensuring better alignment with real-world epidemiology. This paper proposes that, by combining insights from prior trials with RWD using ML techniques, researchers can more effectively capture population heterogeneity and design more generalizable, patient-centric recruitment strategies.