Abstract
The relentless emergence of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) variants continues to challenge global health, as high mutation rates and complex pathogenicity obscure molecular mechanisms and impede clinical progress. Despite extensive research across viral evolution, structural biology, immunology, diagnostics, and therapeutics, the resulting vast and rapidly outdated literature has widened the gap between fundamental discovery and medical application. Here, we systematically mined 439,724 coronavirus disease 2019 (COVID-19) publications using fine-tuned large language models to extract and distill knowledge across nine domains: antibodies, vaccines, serology, biochemistry, therapeutics, clinical presentation, risk factors, biomarkers, and diagnostics. These insights were integrated into a unified graph of 1,427,596 triples (CoVAR-KG). Covering 90 % of known spike-protein variant sites, our knowledge graph forges molecular-to-clinical links that reveal how specific mutations influence antigenicity, transmissibility, and treatment response. By resolving data fragmentation, this resource accelerates target identification and streamlines hypothesis generation. Building on CoVAR-KG, we developed COVID-19 variant risk watcher (CVRW), an early-warning framework that quantifies the threat of emerging variants for real-time surveillance. Coupling the graph with retrieval-augmented GPT-4o enables rapid and in-depth comparisons of variant functionality and immune escape potential. These integrative tools furnish timely insights for vaccine design, therapeutic optimization, and pandemic preparedness, establishing a versatile platform for combating current and future viral threats.