Abstract
Accurate estimation of chlorophyll contents from spectral reflectance is necessary for monitoring plant physiological status and for supporting precision agriculture. This study, which uses four machine learning models (1D Convolutional Neural Network (1D-CNN), Self-Supervised Learning (SSL), Vision Transformer (ViT), and Conformer), elucidates the effects of four preprocessing techniques on the performance of chlorophyll content prediction: Original Reflectance (OR), Continuum Removal (CR), De-trending (DT), and Standard Normal Variate (SNV). Reflectance data were collected from tea leaves (Camellia sinensis) and were analysed using ten-fold cross-validation. Correlation analysis revealed that SNV and DT enhanced the spectral sensitivity to chlorophyll content, particularly around the chlorophyll absorption regions (450-500 nm and 650-700 nm), whereas CR emphasized negative correlation in the visible spectrum. Prediction results demonstrated that the SSL model combined with SNV preprocessing achieved the highest accuracy (R² = 0.82, RPD = 2.37), outperforming other model-preprocessing combinations. The 1D-CNN model performed best with DT, leveraging local spectral features, whereas ViT and Conformer models benefited most from CR, which emphasizes absorption depth and spectral shape. These results highlight that the optimal preprocessing method depends on the model architecture, and that proper pairing between preprocessing and modelling approaches is crucially important for maximizing prediction performance. The study results underscore the importance of customized preprocessing strategies for hyperspectral analysis and provide practical insights for improving biochemical trait estimation in plant phenotyping.