Abstract
Environmental health studies are increasingly measuring endogenous omics data ($ \boldsymbol{M} $) to study intermediary biological pathways by which an exogenous exposure ($ \boldsymbol{A} $) affects a health outcome ($ \boldsymbol{Y} $), given confounders ($ \boldsymbol{C} $). Mediation analysis is frequently performed to understand such mechanisms. If intermediary pathways are of interest, then there is likely literature establishing statistical and biological significance of the total effect, defined as the effect of $ \boldsymbol{A} $ on $ \boldsymbol{Y} $ given $ \boldsymbol{C} $. For mediation models with continuous outcomes and mediators, we show that leveraging external summary-level information on the total effect can improve estimation efficiency of the direct and indirect effects. Moreover, the efficiency gain depends on the asymptotic partial $ R^{2} $ between the outcome ($ \boldsymbol{Y}\mid\boldsymbol{M},\boldsymbol{A},\boldsymbol{C} $) and total effect ($ \boldsymbol{Y}\mid\boldsymbol{A},\boldsymbol{C} $) models, with smaller (larger) values benefiting direct (indirect) effect estimation. We propose a robust data-adaptive estimation procedure, Mediation with External Summary Statistic Information, to improve estimation efficiency in settings with congenial external information, while simultaneously protecting against bias in settings with incongenial external information. In congenial simulation scenarios, we observe relative efficiency gains for mediation effect estimation of up to 40%. We illustrate our methodology using data from the Puerto Rico Testsite for Exploring Contamination Threats, where Cytochrome p450 metabolites are hypothesized to mediate the effect of phthalate exposure on gestational age at delivery. External summary information on the total effect comes from a recently published pooled analysis of 16 studies. The proposed framework blends mediation analysis with emerging data integration techniques.