Abstract
The rapid expansion of biomedical literature has made comprehensive manual synthesis increasingly difficult to perform effectively, creating a pressing need for AI systems capable of reasoning across verified evidence rather than merely retrieving it. However, existing retrieval- augmented generation (RAG) methods often fall short when faced with complex biomedical questions that require iterative reasoning and multi-step synthesis. Here, we developed Queryome, a deep research system consisting of specialized large language model (LLM) agents that can adapt their orchestration dynamically to a wide range of queries. Using a hybrid semantic-lexical retrieval engine spanning 28.3 million PubMed abstracts, it performs iterative, evidence-grounded synthesis. On the MIRAGE benchmark, Queryome achieved 88.98 % accuracy, surpassing prior systems by up to 14 points, and improved reasoning accuracy on the biomedical Human's Last Exam (HLE) subset from 15.8% to 19.3%. Moreover, in a task for constructing a review article, it earned the highest composite score in comparison with Deep Research from OpenAI, Google, Perplexity, and Scite.AI, reflecting its strong literature retrieval and synthesis capabilities.