Abstract
BACKGROUND: Lactic acid bacteria (LAB) play vital roles in food production and clinical applications. Accurate classification of LAB strains facilitates their functional development and targeted utilization. Although machine learning and deep learning methods have been widely applied to genome sequence classification, challenges remain in capturing comprehensive feature representations and enhancing model generalizability. RESULTS: We present HKDE-LACM, a hybrid classification model that integrates high-dimensional k-mer frequency features with contextual embeddings derived from DNABERT-2. To optimize model hyperparameters, we introduce a Cyclic Differential Evolution and Bayesian Optimization with Failure Avoidance (C-DBFA) framework. We conducted 10-fold cross-validation on three LAB datasets and evaluated performance. Experimental results demonstrate that HKDE-LACM outperforms existing methods in terms of both classification accuracy and robustness. CONCLUSIONS: HKDE-LACM overcomes the limitations of traditional k-mer features by incorporating semantic embeddings, thereby enriching the representation of genomic sequences. In addition, the model can automatically identify optimal combinations of feature extractors and classifiers through the C-DBFA optimization framework. These advantages effectively enhance the model's generalization ability, making it a promising tool for genome-based LAB classification and related tasks.