Abstract
BACKGROUND: Artificial intelligence (AI) and large language models (LLMs) are increasing in sophistication and are being integrated into many disciplines. The potential for LLMs to augment clinical decision-making is an evolving area of research. OBJECTIVE: This study compared the responses of over 1000 kidney specialist physicians (nephrologists) with the outputs of commonly used LLMs using a questionnaire determining when a kidney biopsy should be performed. METHODS: This research group completed a large online questionnaire for nephrologists to determine when a kidney biopsy should be performed. The questionnaire was co-designed with patient input, refined through multiple iterations, and piloted locally before international dissemination. It was the largest international study in the field and demonstrated variation among human clinicians in biopsy propensity relating to human factors such as sex and age, as well as systemic factors such as country, job seniority, and technical proficiency. The same questions were put to both human doctors and LLMs in an identical order in a single session. Eight commonly used LLMs were interrogated: ChatGPT-3.5, Mistral Hugging Face, Perplexity, Microsoft Copilot, Llama 2, GPT-4, MedLM, and Claude 3. The most common response given by clinicians (human mode) for each question was taken as the baseline for comparison. Questionnaire responses on the indications and contraindications for biopsy generated a score (0-44) reflecting biopsy propensity, in which a higher score was used as a surrogate marker for an increased tolerance of potential associated risks. RESULTS: The ability of LLMs to reproduce human expert consensus varied widely with some models demonstrating a balanced approach to risk in a similar manner to humans, while other models reported outputs at either end of the spectrum for risk tolerance. In terms of agreement with the human mode, ChatGPT-3.5 and GPT-4 (OpenAI) had the highest levels of alignment, agreeing with the human mode on 6 out of 11 questions. The total biopsy propensity score generated from the human mode was 23 out of 44. Both OpenAI models produced similar propensity scores between 22 and 24. However, Llama 2 and MS Copilot also scored within this range but with poorer response alignment to the human consensus at only 2 out of 11 questions. The most risk-averse model in this study was MedLM, with a propensity score of 11, and the least risk-averse model was Claude 3, with a score of 34. CONCLUSIONS: The outputs of LLMs demonstrated a modest ability to replicate human clinical decision-making in this study; however, performance varied widely between LLM models. Questions with more uniform human responses produced LLM outputs with higher alignment, whereas questions with lower human consensus showed poorer output alignment. This may limit the practical use of LLMs in real-world clinical practice.