Large Language Models for Automating Clinical Trial Criteria Conversion to Observational Medical Outcomes Partnership Common Data Model Queries: Validation and Evaluation Study

用于自动化临床试验标准转换为观察性医学结果的大型语言模型：合作通用数据模型查询的验证和评估研究

阅读：3

作者：Lee,Kye Hwa,Jang,Sujung,Kim,Grace Juyun,Park,Sukyoung,Kim,Doeun,Kwon,Oh Jin,Lee,Jae-Ho,Kim,Young-Hak

期刊：	JMIR Medical Informatics	影响因子：	3.800
时间：	2025	起止号：	2025 Oct 16;13:e71252
doi：	10.2196/71252

Abstract

BACKGROUND: Real-world data-based feasibility assessments enhance clinical trial design, but automating eligibility criteria conversion to database queries is hindered by challenges related to ensuring high accuracy and generating clear, usable outputs. OBJECTIVE: The aim of this study is to develop an automated system converting free-text eligibility criteria from ClinicalTrials.gov into Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)-compatible Structured Query Language (SQL) queries and systematically evaluate hallucination patterns across multiple large language models (LLMs) to identify the optimal deployment strategies. METHODS: Our system employs a three-stage preprocessing pipeline (segmentation, filtering, and simplification) achieving 58.2% token reduction while preserving clinical semantics. We compared GPT-4 concept mapping performance against USAGI using 357 clinical terms from 30 trials. For comprehensive evaluation, we analyzed 760 SQL generation attempts (19 trials×8 LLMs×5 prompting strategies) using the SynPUF (Synthetic Public Use Files) dataset and validated selected queries against National COVID Cohort Collaborative reference concept sets using Asan Medical Center's OMOP CDM database. RESULTS: GPT-4 achieved a 48.5% concept mapping accuracy versus USAGI's 32.0% (P<.001), with domain-specific performance ranging from 72.7% (drug) to 38.3% (measurement). Surprisingly, the open-source llama3: 8b model achieved the highest effective SQL rate (75.8%) compared to GPT-4 (45.3%), attributed to lower hallucination rates (21.1% vs 33.7%). The overall hallucination rate was 32.7%, with wrong domain assignments (34.2%) and placeholder insertions (28.7%) being the most common. Clinical validation revealed mixed performance: high concordance for type 1 diabetes (Jaccard=0.81), complete failure for pregnancy (Jaccard=0.00), and minimal overlap for type 2 diabetes (Jaccard=0.03), despite perfect overlap coefficients in both diabetes cases. Moderate performance was observed for uncontrolled hypertension (Jaccard=0.18). CONCLUSIONS: While LLMs can accelerate eligibility criteria transformation, hallucination rates of 21-50% necessitate careful model selection and validation strategies. Our findings challenge assumptions about model superiority, demonstrating that smaller, cost-effective models can outperform larger commercial alternatives. Future work should focus on hybrid approaches combining LLM capabilities with rule-based methods for handling complex clinical concepts.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用；引用内容仅为补充信息，不代表本站立场。

2、若认为本页面引用内容涉及侵权，请及时与本站联系，我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容，需注明“来源：[生知库]”并获得授权；使用引用内容的，需自行联系原作者获得许可。

4、投稿及合作请联系：info@biocloudy.com。

肿瘤免疫

炎症

T细胞

线粒体

凋亡

转录调控

巨噬细胞

自噬

传染病

氧化应激

肠道菌群

磷酸化

血管生成

囊泡

3D/类器官

单细胞

中性粒细胞

外泌体

DNA甲基化

miRNA

药物研究

铁死亡

细胞衰老

乙酰化

缺氧低氧

泛素化

树突状细胞

炎性小体

组蛋白修饰

肿瘤微环境

lncRNA

代谢重编程

焦亡

m6A/m5C/m7G

内质网应激

空间多组学

细胞基因治疗

治疗耐药

相分离

Treg

上皮间质转化

免疫代谢

染色质重塑

脂质过氧化

蛋白质稳态

脂代谢

细胞极性

铁代谢

氨基酸代谢

碱基编辑

cGAS-STING

肠脑轴

蛋白降解

乳酸化

翻译调控

circRNA

piRNA

肿瘤异质性

NK 细胞

氧化脂质

MDSC

NETosis

低氧缺氧

溶酶体功能

琥珀酰化

细胞干性

CAR-NK

冷应激

RNA 编辑

Tfh

巴豆酰化

器官芯片

表观遗传记忆

铜死亡

器官纤维化

线粒体未折叠蛋白反应

空间代谢组

程序性坏死

自噬流

MAIT 细胞

肠肝轴

丙酰化