Abstract
High-quality, widely accessible international longitudinal cohort data for people living with HIV (PWH) have long been needed for advancing open science and data-driven innovation, yet stringent and incongruent privacy regulations have made data sharing difficult. Synthetic data generation offers a promising privacy-preserving alternative, but producing realistic synthetic cohorts of PWH remains challenging due to complex temporal dynamics, interdependent clinical variables, long follow-up periods, and high missingness inherent in such data. Here, we introduce Medical Longitudinal latent Diffusion (MeLD), a generative model designed to synthesize variable-length, decades-spanning, mixed-type clinical trajectories with missingness. Using the Caribbean, Central, and South America Network for HIV Epidemiology (CCASAnet) cohort, one of the world's largest international HIV datasets with over 30 years of follow-up on nearly 50,000 PWH, we show that MeLD consistently outperforms state-of-the-art methods across data utility, fidelity, and privacy. Notably, MeLD excels in longitudinal inference utility, accurately reproducing time-to-death estimates and risk factor effects, while maintaining strong privacy protection. This work delivers the first in-depth, large-scale, and openly accessible synthetic longitudinal cohort of PWH that faithfully preserves the distributional patterns and clinical associations observed in real data, offering an immediately deployable resource for hypothesis generation, methods innovation, medical training, and reproducible HIV research.