Identification of city specific important bacterial signature for the MetaSUB CAMDA challenge microbiome data

针对 MetaSUB CAMDA 挑战赛微生物组数据,识别城市特异性重要细菌特征

阅读:1

Abstract

BACKGROUND: Metagenomic data of whole genome sequences (WGS) from samples across several cities around the globe may unravel city specific signatures of microbes. Illumina MiSeq sequencing data was provided from 12 cities in 7 different countries as part of the 2018 CAMDA "MetaSUB Forensic Challenge", including also samples from three mystery sets. We used appropriate machine learning techniques on this massive dataset to effectively identify the geographical provenance of "mystery" samples. Additionally, we pursued compositional data analysis to develop accurate inferential techniques for such microbiome data. It is expected that this current data, which is of higher quality and higher sequence depth compared to the CAMDA 2017 MetaSUB challenge data, along with improved analytical techniques would yield many more interesting, robust and useful results that can be beneficial for forensic analysis. RESULTS: A preliminary quality screening of the data revealed a much better dataset in terms of Phred quality score (hereafter Phred score), and larger paired-end MiSeq reads, and a more balanced experimental design, though still not equal number of samples across cities. PCA (Principal Component Analysis) analysis showed interesting clusters of samples and a large amount of the variability in the data was explained by the first three components (~ 70%). The classification analysis proved to be consistent across both the testing mystery sets with a similar percentage of the samples correctly predicted (up to 90%). The analysis of the relative abundance of bacterial "species" showed that some "species" are specific to some regions and can play important roles for predictions. These results were also corroborated by the variable importance given to the "species" during the internal cross validation (CV) run with Random Forest (RF). CONCLUSIONS: The unsupervised analysis (PCA and two-way heatmaps) of the log2-cpm normalized data and relative abundance differential analysis seemed to suggest that the bacterial signature of common "species" was distinctive across the cities; which was also supported by the variable importance results. The prediction of the city for mystery sets 1 and 3 showed convincing results with high classification accuracy/consistency. The focus of this work on the current MetaSUB data and the analytical tools utilized here can be of great help in forensic, metagenomics, and other sciences to predict city of provenance of metagenomic samples, as well as in other related fields. Additionally, the pairwise analysis of relative abundance showed that the approach provided consistent and comparable "species" when compared with the classification importance variables. REVIEWERS: This article was reviewed by Manuela Oliveira, Dimitar Vassilev, and Patrick Lee.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。