Abstract
BACKGROUND: Machine learning, a subset of artificial intelligence, enables large language models (LLMs) like ChatGPT and Gemini to analyze data, learn from context, and generate responses without explicit programming. While machine learning and LLMs show promise in orthopedic image analysis and patient education, it is unclear whether LLMs provide accurate responses to patient queries. METHODS: Our purpose was to evaluate the alignment of ChatGPT and Gemini’s recommendations for managing glenohumeral arthritis with the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines (CPGs). Both ChatGPT and Gemini were asked questions based on strong- or moderate-strength recommendations from the AAOS CPG on management of glenohumeral arthritis. Responses were classified by 2 independent reviewers as “agree,” “neutral,” or “disagree” based on their alignment with the CPG recommendations. A Cohen's kappa coefficient was used to assess interobserver reliability, and Fisher's exact test was used to compare the accuracy of responses between LLMs. RESULTS: Of the 12 recommendations of strong- or moderate-strength evidence, ChatGPT and Gemini provided responses that were in agreement for 6 (50%) and 9 (75%) of recommendations, respectively. There were no significant differences in performance between ChatGPT and Gemini. ChatGPT provided 6 study references, of which 4 were nonexistent or inappropriately cited. Gemini provided 32 study references, of which 5 studies were nonexistent or inappropriately cited. CONCLUSION: Both ChatGPT and Gemini do not consistently provide responses that align with the AAOS CPG for management of glenohumeral arthritis. Responses frequently included confabulated details or inappropriately referenced articles. Consequently, physicians and patients should use these artificial intelligence platforms with caution when seeking advice on management of glenohumeral arthritis.