Abstract
BACKGROUND: Bacterial infections rank as the second leading cause of death globally, with virulence factors (VFs) being crucial to their pathogenicity. Predicting VFs accurately can uncover mechanisms of bacterial diseases and suggest new treatments. Current machine learning (ML) methods face challenges, such as outdated feature extraction, simplistic forecasting frameworks, and lack of differentiation between gram-positive (G +) and gram-negative (G -) bacteria. RESULTS: In this study, we introduced pLM4VF, a predictive framework that utilized ESM protein language models to extract VF characteristics of G + and G - bacteria separately, and further integrated the models using the stacking strategy. Extensive benchmarking experiments on the independent test demonstrated that pLM4VF outperformed state-of-the-art methods, exhibiting improved accuracy by 0.088-0.320 and 0.063-0.307 for VF prediction of G + and G - bacteria, respectively. Biological validations through cytotoxicity and acute toxicity assays further corroborated the reliability of pLM4VF. Additionally, an online tool ( https://compbiolab.hainanu.edu.cn ) has been developed that enables inexperienced researchers on ML to obtain VFs of various bacteria at the whole-genome scale. CONCLUSIONS: We believe that pLM4VF will offer substantial support in uncovering pathogenic mechanisms, developing novel antibacterial treatments and vaccines, thereby aiding in the prevention and management of bacterial diseases.