Abstract
Fine particulate matter (PM(2.5)) is a significant air pollutant in the Indo Gangetic Basin (IGB), where levels frequently exceed national and WHO air quality standards. Ground observations from 183 CPCB automatic stations, along with MERRA-2 reanalysis products and meteorological variables, were utilized in this study to analyse PM(2.5) characteristics over a recent decade for the period from 2014–2023. A machine learning (ML) framework was developed using Random Forest, Extra Trees, LightGBM, and a stacking ensemble model to improve surface PM(2.5) estimation in four major IGB cities: Delhi, Kanpur, Lucknow, and Patna. It is found that the raw MERRA-2 estimates systematically underestimated PM(2.5), with R(2) values of only 0.28–0.42 and RMSE as high as 82 µg m(−3). By contrast, the stacking ensemble achieved R(2) values of 0.79–0.82, FAC2 above 0.94, RMSE reduced to 27–31 µg m(−3), and near-zero bias (1.7–2.3 µg m(−3)). The model successfully reproduced extreme winter pollution episodes as well as monsoon conditions, highlighting the critical role of meteorological parameters such as boundary layer height, wind speed, and precipitation in regulating PM(2.5) variability. Trajectory clustering and concentration-weighted trajectory (CWT) analysis showed that north-westerly transport contributes 55–65% of wintertime PM(2.5) in Delhi, Kanpur, and Lucknow, while Patna is affected by both regional inflows and local sources. Major contributing regions include Punjab, Haryana, Rajasthan, and the Nepal plains, associated with crop residue burning and dust transport. By integrating ground observations, reanalysis data, meteorological predictors, and atmospheric transport analysis, this study provides a robust framework for improving PM(2.5) prediction and identifying dominant pollution sources in the IGB. The results provide scientific evidence for designing both regional and city-specific mitigation strategies to reduce exposure in one of the world’s most polluted and densely populated regions. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1038/s41598-026-37934-9.