Abstract
BACKGROUND: Real-world data-based feasibility assessments enhance clinical trial design, but automating eligibility criteria conversion to database queries is hindered by challenges related to ensuring high accuracy and generating clear, usable outputs. OBJECTIVE: The aim of this study is to develop an automated system converting free-text eligibility criteria from ClinicalTrials.gov into Observational Medical Outcomes Partnership Common Data Model (OMOP CDM)-compatible Structured Query Language (SQL) queries and systematically evaluate hallucination patterns across multiple large language models (LLMs) to identify the optimal deployment strategies. METHODS: Our system employs a three-stage preprocessing pipeline (segmentation, filtering, and simplification) achieving 58.2% token reduction while preserving clinical semantics. We compared GPT-4 concept mapping performance against USAGI using 357 clinical terms from 30 trials. For comprehensive evaluation, we analyzed 760 SQL generation attempts (19 trials×8 LLMs×5 prompting strategies) using the SynPUF (Synthetic Public Use Files) dataset and validated selected queries against National COVID Cohort Collaborative reference concept sets using Asan Medical Center's OMOP CDM database. RESULTS: GPT-4 achieved a 48.5% concept mapping accuracy versus USAGI's 32.0% (P<.001), with domain-specific performance ranging from 72.7% (drug) to 38.3% (measurement). Surprisingly, the open-source llama3: 8b model achieved the highest effective SQL rate (75.8%) compared to GPT-4 (45.3%), attributed to lower hallucination rates (21.1% vs 33.7%). The overall hallucination rate was 32.7%, with wrong domain assignments (34.2%) and placeholder insertions (28.7%) being the most common. Clinical validation revealed mixed performance: high concordance for type 1 diabetes (Jaccard=0.81), complete failure for pregnancy (Jaccard=0.00), and minimal overlap for type 2 diabetes (Jaccard=0.03), despite perfect overlap coefficients in both diabetes cases. Moderate performance was observed for uncontrolled hypertension (Jaccard=0.18). CONCLUSIONS: While LLMs can accelerate eligibility criteria transformation, hallucination rates of 21-50% necessitate careful model selection and validation strategies. Our findings challenge assumptions about model superiority, demonstrating that smaller, cost-effective models can outperform larger commercial alternatives. Future work should focus on hybrid approaches combining LLM capabilities with rule-based methods for handling complex clinical concepts.