Abstract
BACKGROUND: Microbiome sequencing data are often collected from several body sites and exhibit dependencies. Our objective is to develop a model that enables joint analysis of data from different sites by capturing the underlying cross-site dependencies. The proposed model incorporates (i) latent factors shared across sites to explain common subject effects and to serve as the source of correlation between the sites and (ii) mixtures of latent factors to allow heterogeneity among the subjects in cross-site associations. RESULTS: Our simulation studies demonstrate that stronger associations between two sites lead to greater efficiency loss in regression analysis when such dependence is ignored in modeling. In a case study involving samples collected from a study on the female urogenital microbiome with aging, our model leads to the detection of covariate associations of the vaginal and urine microbiomes that are otherwise not statistically significant under a similar regression model applied to the two sites separately. CONCLUSIONS: We propose a latent factor model for microbiome sequencing data collected from multiple sites. It captures the presumptive underlying cross-site associations without compromising estimation accuracy or inference efficiency in the absence of such associations. In addition, our proposed model improves predictive performance by enabling the prediction of microbial abundance at one site based on observations from another. We also provide an extended framework that allows for clustering of subjects (samples) and cluster-specific levels of paired association. Under this extended framework, clusters can be classified according to their association strengths.