Abstract
BACKGROUND: Tuberculosis (TB) remains a critical global public health issue, particularly in rapidly urbanizing regions where population density and migration facilitate transmission. While whole-genome sequencing (WGS) has revolutionized our understanding of TB transmission dynamics, its high cost and technical complexity limit widespread application in resource-limited settings. To address these limitations, we developed a machine learning (ML) framework that integrates routinely collected surveillance data-demographic, clinical, and spatial variables-to predict recent TB transmission hotspots without relying solely on WGS. METHODS: We trained six ML models using sequenced TB cases (n = 1,442) from Songjiang District, Shanghai, to classify cases into recent transmission clusters (≤ 12 SNPs) versus non-clustered cases. Individual-level data (e.g., age, sex, treatment history) and contextual variables (e.g., population density, land use) were incorporated. Model performance was evaluated using 10-fold cross-validation and an independent test set. Spatial analysis, including Getis-Ord Gi* statistics, was employed to identify and compare notification rate hotspots with predicted transmission hotspots. RESULTS: Among the six ML models tested, CATBoost achieved the highest predictive performance (AUC of 0.83 in cross-validation) and maintained robustness on the independent test set. Spatial analysis revealed significant disparities: only 12% of high-notification areas overlapped with recent transmission hotspots, highlighting the limitations of traditional surveillance strategies. Key predictors of recent transmission included population density, industrial land use, and migrant proportion. Notably, our approach identified three recent transmission hotspots that would have been missed if relying solely on sequenced cases. CONCLUSIONS: Our framework provides a less resource-intensive alternative to WGS-dependent approaches for identifying TB transmission hotspots, validated in Songjiang District and potentially adaptable to other urban settings. By leveraging routinely collected surveillance data, this model enables targeted screening and optimized resource allocation. Its flexible design allows adaptation to other urban settings and retraining as new data becomes available, supporting its potential application in resource-limited contexts. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12879-026-13018-x.