Abstract
BACKGROUND: The regulation of gene expression in plants is governed by complex interactions between cis-regulatory elements and epigenetic modifications such as histone marks. While deep learning models have achieved success in predicting regulatory features from DNA sequence, their cross-species generalizability in plants remains largely unexplored. RESULTS: We systematically evaluate the ability of deep learning models to predict histone modifications across plant species using a multi-stage framework based on the Sei architecture. We train species-specific models for Arabidopsis (Arabidopsis thaliana), rice (Oryza sativa), and maize (Zea mays), achieving high within-species predictive performance and strong agreement between predictions and experimental ChIP-seq profiles. However, cross-species predictions show reduced performance with increasing phylogenetic distance, highlighting limited model transferability between monocots and dicots. To improve generalization, we construct a Poaceae family-level model by jointly training on rice and maize, and an Arabidopsis-trained model based solely on Arabidopsis. These models demonstrate robust predictive power in completely unprofiled species that are not used in training set, highlighting the model's adaptability to novel plant genomes based solely on conserved regulatory syntax. In contrast, cross-family models produce less consistent results, with reliable performance only in species sharing conserved regulatory features. We also develop an easy-to-use pipeline that predicts genome-wide chromatin signals directly from DNA sequences. CONCLUSIONS: Our findings demonstrate that phylogenetically informed model training significantly improves cross-species epigenomic prediction, offering a scalable computational strategy for functional annotation in non-model and agriculturally important plants.