Abstract
BACKGROUND: Large language models (LLMs) have significantly advanced natural language processing in biomedical research; however, their reliance on implicit, statistical representations often results in factual inaccuracies or hallucinations, posing significant concerns in high-stakes biomedical contexts. RESULTS: To overcome these limitations, we developed BioThings Explorer-Retrieval-Augmented Generation (BTE-RAG), a Retrieval-Augmented Generation framework that integrates the reasoning capabilities of advanced language models with explicit mechanistic evidence sourced from BTE, an API federation of more than sixty authoritative biomedical knowledge sources. We systematically evaluated BTE-RAG in comparison to traditional LLM-only methods across three benchmark datasets that we created from DrugMechDB. These datasets specifically targeted gene-centric mechanisms (798 questions), metabolite effects (201 questions), and drug-biological process relationships (842 questions). On the gene-centric task, BTE-RAG increased accuracy from 51 to 75.8% for GPT-4o mini and from 69.8 to 78.6% for GPT-4o. In metabolite-focused questions, the proportion of responses with cosine similarity scores of at least 0.90 rose by 82% for GPT-4o mini and 77% for GPT-4o. While overall accuracy was consistent in the drug-biological process benchmark, the retrieval method enhanced response concordance, producing a greater than 10% increase in high-agreement answers (from 129 to 144) using GPT-4o. We additionally evaluated BTE-RAG alongside GeneGPT-based models on the GeneTuring gene-disease association benchmark and on our mechanistic gene benchmark, demonstrating that the BTE-RAG layer consistently improves accuracy relative to alternative approaches. CONCLUSION: Federated knowledge retrieval provides transparent improvements in accuracy for LLMs, establishing BTE-RAG as a valuable and practical tool for mechanistic exploration and translational biomedical research.