Abstract
Chikungunya virus (CHIKV) poses a significant public health threat, and its continuous evolution necessitates high-resolution genomic surveillance. Current methods lack the speed and resolution to efficiently discriminate sub-lineages. To address this, we developed CHIKVGenotyper, an interpretable machine learning framework for high-resolution CHIKV lineage classification. This study leveraged a comprehensive dataset of 6886 CHIKV genome sequences, from which a high-quality set of 3014 sequences was established for model development. A hierarchical assignment pipeline that integrated a probability-based sequence matching model, machine learning refinement, and phylogenetic validation was developed to assign high-confidence labels across eight CHIKV lineages, thereby constructing a reliable dataset for subsequent analysis. Multiple machine learning models were trained and evaluated, with the optimal Random Forest model achieving near-perfect accuracy (F1-score: 99.53%) on high-coverage whole-genome test data and maintaining robust performance (F1-score: 96.50%) on an independent low-coverage set. The E2 glycoprotein alone yielded comparable accuracy (F1-score: 99.52%), highlighting its discriminative power. SHapley Additive exPlanations (SHAP) analysis identified key lineage-defining amino acid mutations, such as E1-K211E and E2-V264A, for the Indian Ocean Lineage, which were corroborated by established biological knowledge. This work provides an accurate, scalable, and interpretable tool for CHIKV molecular epidemiology, offering insights into viral evolution and aiding outbreak response.