Abstract
BACKGROUND: The prevalence of chronic gastritis is high, and if not intervened in a timely manner, it may eventually lead to gastric cancer. Managing chronic gastritis essentially requires comprehensive lifestyle changes. However, the current health care environment does not support continuous follow-up by professional health care providers, making self-management a key component of postdiagnosis care. Increasingly, researchers are exploring the use of large language models (LLMs) for patient management. However, LLMs have limitations, including hallucinations, limited knowledge scope, and lack of timeliness. Artificial intelligence (AI) agents may provide a more effective solution. Nevertheless, it remains uncertain whether AI agents can effectively support postdiagnosis self-management for patients with chronic gastritis. OBJECTIVE: The purpose of this study was to explore the effectiveness of AI agents in the postdiagnosis management of patients with chronic gastritis from different perspectives. METHODS: In this study, we developed an agent framework for the health management of patients with chronic gastritis based on LLMs in conjunction with retrieval-augmented generation and a search engine tool. We collected real questions from patients with chronic gastritis in clinical settings and tested the framework's performance across different difficulty levels and scenarios. We analyzed its safety and robustness and compared it with state-of-the-art models to comprehensively evaluate its effectiveness. RESULTS: Using a dual-evaluation framework comprising automated metrics and expert manual assessments, our results demonstrated that AI agents substantially outperformed LLMs in addressing high-complexity questions (embedding average score: 82.849 for AI agents vs 77.825 for LLMs) and were particularly effective in clinical consultation tasks. Clinical evaluation of safety based on a 5-point Likert scale by physicians indicated that the safety of the agents was 4.98 (SD 0.15; 95% CI 4.96-4.99). After 30 repeated experiments, the mean absolute deviation of the AI agents in the embedding average score and BERTScore metrics were 0.0167 and 0.0387, respectively. Therefore, the safety and robustness analysis confirmed that the AI agents can produce safe, stable, and minimally variable responses. In addition, comparative results with those of advanced medical-domain LLMs (Baichuan-14B-M1 and MedGemma-27B) and general-domain LLMs (Qwen3-32B) also demonstrated that the AI agents in this study performed outstandingly in the field of chronic gastritis. Our findings underscore the superior reliability, interpretability, and practical applicability of AI agents over conventional LLMs in chronic gastritis management, offering a robust foundation for their broader adoption in health care settings. CONCLUSIONS: AI agents based on LLMs have high application value in the management of chronic gastritis. They can effectively guide patients with chronic diseases in addressing common issues, which may potentially reduce the workload of physicians and improve the quality of patient home care.