Abstract
To address the complexity of crispness evolution in Lin'an mountain walnuts under eight distinct thermal processing methods, this study proposes a novel Multi-modal Cross-Attention Fusion Network (MCAFNet), aiming to achieve intelligent and non-destructive evaluation of the physical quality of nut-based foods. First, the Markov Transition Field (MTF) is creatively introduced to transform one-dimensional electronic nose time-series signals into multi-channel images, combined with multi-scale parallel convolutions to capture implicit temporal dynamic features across different time scales. Second, spectral indices and a reconstruction-stacking strategy are adopted to synergistically extract local and global features via multi-stage convolutions, significantly enhancing the representation capability of spectral information. Third, to overcome the semantic gap between multi-modal data, MCAFNet designs a Dual-Branch Feature Fusion (DBFF) module to perform feature-level fusion on one-dimensional image and spectral features, while introducing a bidirectional cross-attention mechanism to achieve deep interaction and model-level fusion between the spatial features of pseudo-spectral images and the temporal features of MTF images. Finally, the prediction results are output via a hybrid stacking ensemble strategy. Experimental results demonstrate the superior performance of MCAFNet, achieving a coefficient of determination (R2) of 0.968 and a residual predictive deviation (RPD) of 5.578 for crispness prediction. In conclusion, MCAFNet not only offers an efficient solution for the non-destructive crispness evaluation of nut-based products but also establishes a solid theoretical and practical foundation for intelligent quality monitoring in food engineering.