Abstract
OBJECTIVES: The macrolide resistance in Bordetella pertussis cannot be fully explained by 23S rRNA mutations, underscoring the need for comprehensive methods to detect resistant isolates and clarify mechanisms. METHODS: Whole-genome sequencing data from 556 isolates with macrolide resistance information, including 398 resistant and 158 sensitive strains, were retrieved from the National Center for Biotechnology Information (NCBI). A k-mer-based genome-wide k-mer-based association studies using Pyseer identified 1322 resistance-associated k-mers. Refinement with Scoary2, least absolute shrinkage and selection operator (LASSO) and variable selection using random forests (VSURF) yielded six key k-mers, enabling the construction of a simplified model for predicting resistance. RESULTS: A total of 1322 different k-mers were involved in resistance. In the further simplified model, only six k-mers were included, in which a DHCW motif cupin fold protein (odds ratio [OR]: 25.84, 95% confidence interval [CI]: 15.58-44.03) and an IS481 insertion sequence located near infA (OR: 28.96, 95% CI: 7.64-243.86) showed strong associations with resistance. Despite the reduced feature set, the simplified model achieved classification performance comparable to the initial model, with similar sensitivity (97.7% versus 98.7%), specificity (92.1% versus 99.5%), and accuracy (93.4% versus 99.3%). Notably, it maintained a robust area under the receiver operating characteristic curve (area under the curve = 0.98), indicating strong predictive capability. CONCLUSIONS: This study developed a simplified k-mer-based model for accurately identifying macrolide-resistant B. pertussis isolates and uncovered novel resistance features.