Abstract
Soil arsenic (As) contamination presents serious threats to ecosystems and human health, necessitating the development of accurate and efficient monitoring techniques. This study introduces a novel multi-source data fusion approach to enhance the hyperspectral inversion of soil arsenic concentrations by integrating dimensionality-reduced spectral data with soil components significantly correlated with arsenic (e.g., Cd, Cr, Cu, Ni, Pb, Zn, S, and total Fe(2)O(3)(T-Fe(2)O(3))). Principal Component Analysis (PCA) was utilized to reduce the dimensionality of hyperspectral data, effectively addressing issues of collinearity and redundancy while preserving critical spectral information. The performances of three models, namely Partial Least Squares Regression (PLSR), Artificial Neural Networks (ANN), and Random Forest (RF), were assessed under four input variable combinations: (1) original spectral data, (2) original spectral data with soil components, (3) PCA dimensionality-reduced spectral data, and (4) PCA dimensionality-reduced spectral data combined with soil components. The results demonstrated that the RF model, when applied to the multi-source data of PCA-reduced spectra and soil components, achieved the highest inversion accuracy with an R(2) value of 0.86, significantly outperforming the PLSR model (R(2) = 0.75). This study underscores the effectiveness of enhancing model performance and highlights the superior capability of the RF model in handling complex, high-dimensional datasets. The findings of soil arsenic estimation provide theoretical foundation for optimizing hyperspectral remote sensing technology in monitoring soil heavy metal contamination and establishing a robust framework for future research and practical applications in environmental science.