Abstract
In population-scale genomic variation studies based on shallow whole genome sequencing, pangenomes have become an effective tool for identifying population-specific single-nucleotide polymorphisms and indels. Extending these advantages to copy number variation (CNV), however, remains challenging due to two unresolved issues. First, current pangenome frameworks exhibit pronounced population-representation bias arising from uneven sampling across populations. As the number of samples increases, the pangenome tends to capture variations primarily from majority populations while suppressing signals from minority populations. Second, in population-scale genomic variation analyses, common but benign population-specific copy number polymorphisms (CNPs) frequently obscure pathogenic CNVs. Existing pangenome frameworks lack dedicated mechanisms for representing CNPs and CNVs, limiting their ability to distinguish pathogenic CNVs from benign, population-specific CNPs. In this study, we present PangenomeX, a graph-convolutional pangenome framework tailored for low-coverage, population-scale CNV analysis. To address CNP representation, we embed known CNPs as prior knowledge into the pangenome graph and construct a CNV relationship network guided by a phylogenetic tree. A graph convolutional network (GCN) then learns the interactions between CNV and CNP nodes. To mitigate population-representation bias, the GCN aggregates information from only one- and two-hop neighborhoods, preserving local population context while preventing majority group signals from dominating. Evaluation on simulated cohorts and 561 real samples shows that PangenomeX distinguishes pathogenic CNVs from common population CNPs markedly better than existing methods. Overall, PangenomeX offers a methodological blueprint for large-cohort variant screening and provides a practical path for bringing graph-based genomics into clinical practice.