Abstract
BACKGROUND: Large language models (LLMs) continue to enjoy enterprise-wide adoption in health care while evolving in number, size, complexity, cost, and most importantly performance. Performance benchmarks play a critical role in their ranking across community leaderboards and subsequent adoption. OBJECTIVE: Given the small operating margins of health care organizations and growing interest in LLMs and conversational artificial intelligence (AI), there is an urgent need for objective approaches that can assist in identifying viable LLMs without compromising their performance. The objective of the present study is to generate taxonomy portraits of medical LLMs (n=33) whose domain-specific and domain non-specific multivariate performance benchmarks were available from Open-Medical LLM and Open LLM leaderboards on Hugging Face. METHODS: Hierarchical clustering of multivariate performance benchmarks is used to generate taxonomy portraits revealing inherent partitioning of the medical LLMs across diverse tasks. While domain-specific taxonomy is generated using nine performance benchmarks related to medicine from the Hugging Face Open-Medical LLM initiative, domain non-specific taxonomy is presented in tandem to assess their performance on a set of six benchmarks and generic tasks from the Hugging Face Open LLM initiative. Subsequently, non-parametric Wilcoxon rank-sum test and linear correlation are used to assess differential changes in the performance benchmarks between two broad groups of LLMs and potential redundancies between the benchmarks. RESULTS: Two broad families of LLMs with statistically significant differences (α=.05) in performance benchmarks are identified for each of the taxonomies. Consensus in their performance on the domain-specific and domain non-specific tasks revealed robustness of these LLMs across diverse tasks. Subsequently, statistically significant correlations between performance benchmarks revealed redundancies, indicating that a subset of these benchmarks may be sufficient in assessing the domain-specific performance of medical LLMs. CONCLUSIONS: Understanding medical LLM taxonomies is an important step in identifying LLMs with similar performance while aligning with the needs, economics, and other demands of health care organizations. While the focus of the present study is on a subset of medical LLMs from the Hugging Face initiative, enhanced transparency of performance benchmarks and economics across a larger family of medical LLMs is needed to generate more comprehensive taxonomy portraits for accelerating their strategic and equitable adoption in health care.