Abstract
Early warning of overgrowth in strawberry seedlings is essential to balance vegetative and reproductive growth. However, existing monitoring methods face major challenges, including subtle visual symptoms and limited abnormal samples. To address this, we propose MM-CAPNet, a multimodal fusion framework for early detection of seedling overgrowth. We first developed a representative sample collection of strawberry seedlings through a systematic induction experiment, integrating historical environmental time-series data with contemporaneous plant images. The MM-CAPNet architecture uses a dual-stream design to process these inputs, with a Transformer encoder for environmental sequences and a MobileNetV2 encoder for images. A critical component of the proposed framework lies in the image-guided Cross-Attention mechanism, which uniquely treats the current phenotype as an active query to adaptively retrieve and aggregate the most diagnostically relevant segments of past environmental data. Experiments show MM-CAPNet outperforms baselines, reaching 87.6% accuracy and 0.901 AUC, with strong discriminative ability for early overgrowth categories. Ablation studies confirm its interpretability by linking visual phenotypes to key environmental drivers. This work provides growers with a proof-of-concept framework to regulate fertilization, irrigation, and light management during the nursery stage, thereby reducing the risk of excessive vegetative growth. The proposed framework supports precision cultivation strategies that enhance resource efficiency and crop resilience.