Abstract
Multi-omics data are instrumental in obtaining a comprehensive picture of complex biological systems. This is particularly useful for women's health conditions such as endometriosis, which has been historically understudied despite having a high prevalence (around 10% of women of reproductive age). Subsequently, endometriosis has limited genetic characterization: current genome-wide association studies explain only 11% of its 47% total estimated heritability, underscoring the need for integrative approaches. Graph representations provide an intuitive and meaningful way to harmonize biological data, using nodes to represent biological concepts (e.g., genes, single nucleotide polymorphisms, proteins, and phenotypes) and edges to represent their relationships. We present DRIVE-KG (Disease Risk Inference and Variant Exploration Knowledge Graph), which uses a heterogeneous graph representation to integrate data from diverse multi-omics datasets. We trained two distinct models using DRIVE-KG: a link prediction model to suggest associations between SNPs and two pilot phenotypes (endometriosis and obesity), and a graph convolutional network (GCN) for patient-level classification of endometriosis/adenomyosis as a combined phenotype. We conducted patient-level classification using data from 1,441 Penn Medicine BioBank participants with gold standard chart-reviewed endometriosis/adenomyosis status. The link prediction model uncovered 66 high-confidence (model score ≥ 0.95) candidate SNP-endometriosis associations, representing largely distinct genetic signals (R2 < 0.1). These variants were enriched for obesity/body mass index traits (24.2%), lipid metabolism (6%), and depressive disorders (4.5%), showing agreement with emerging hypotheses about endometriosis etiology. In contrast, of the high-confidence, candidate SNP-obesity associations that could be evaluated using LDlink, 38.22% were in high linkage disequilibrium (R2 ≥ 0.8) with known obesity or comorbidity associations. The GCN to classify patient endometriosis/adenomyosis status had an F1 score of 0.752 compared to 0.698 for a genetic risk score. Despite this moderate improvement, we found that the GCN learned meaningful stratification of underlying adenomyosis signal and severe endometriosis grades. Together, these results demonstrate that heterogeneous integration of multi-omics data is valuable for diverse downstream tasks-including discovery and clinical prediction-particularly for understudied diseases where traditional genomic approaches are insufficient.