Identification of tuberculosis transmission hotspots in urban China using surveillance data: a machine learning approach based on genomic and spatial analysis

利用监测数据识别中国城市结核病传播热点:一种基于基因组和空间分析的机器学习方法

阅读:2

Abstract

BACKGROUND: Tuberculosis (TB) remains a critical global public health issue, particularly in rapidly urbanizing regions where population density and migration facilitate transmission. While whole-genome sequencing (WGS) has revolutionized our understanding of TB transmission dynamics, its high cost and technical complexity limit widespread application in resource-limited settings. To address these limitations, we developed a machine learning (ML) framework that integrates routinely collected surveillance data-demographic, clinical, and spatial variables-to predict recent TB transmission hotspots without relying solely on WGS. METHODS: We trained six ML models using sequenced TB cases (n = 1,442) from Songjiang District, Shanghai, to classify cases into recent transmission clusters (≤ 12 SNPs) versus non-clustered cases. Individual-level data (e.g., age, sex, treatment history) and contextual variables (e.g., population density, land use) were incorporated. Model performance was evaluated using 10-fold cross-validation and an independent test set. Spatial analysis, including Getis-Ord Gi* statistics, was employed to identify and compare notification rate hotspots with predicted transmission hotspots. RESULTS: Among the six ML models tested, CATBoost achieved the highest predictive performance (AUC of 0.83 in cross-validation) and maintained robustness on the independent test set. Spatial analysis revealed significant disparities: only 12% of high-notification areas overlapped with recent transmission hotspots, highlighting the limitations of traditional surveillance strategies. Key predictors of recent transmission included population density, industrial land use, and migrant proportion. Notably, our approach identified three recent transmission hotspots that would have been missed if relying solely on sequenced cases. CONCLUSIONS: Our framework provides a less resource-intensive alternative to WGS-dependent approaches for identifying TB transmission hotspots, validated in Songjiang District and potentially adaptable to other urban settings. By leveraging routinely collected surveillance data, this model enables targeted screening and optimized resource allocation. Its flexible design allows adaptation to other urban settings and retraining as new data becomes available, supporting its potential application in resource-limited contexts. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12879-026-13018-x.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。