Abstract
Large language model (LLM)-based chatbots are increasingly used for behavioral health support. Few studies have rigorously evaluated their advice on alcohol misuse. We evaluated seven publicly available chatbots-including general-purpose and behavioral health-focused tools-in responding to alcohol misuse-related questions. Using a fictional case, we simulated longitudinal chatbot interactions over seven days, using 25 prompts derived from real-world Reddit posts. Using an evaluation framework specific to chatbots, four clinicians independently rated each chatbot's transcript along five domains: empathy, quality of information, usefulness, responsiveness, and scope awareness. Clinicians also assessed secondary dimensions, including stigmatizing language and challenging the user (vs. only validating feelings). We generated descriptive statistics on performance and identified examples of problematic output. Across all chatbots, empathy was the highest-rated domain (mean score 4.6/5) while quality of information was the lowest (mean 2.7/5). There was considerable variation in overall mean performance scores across the chatbots, ranging from 2.1 (SD 1.1) to 4.5 (SD 0.8). There were no significant differences in performance between behavioral health and general-purpose chatbots. All chatbots had one or more examples of guidance deemed inappropriate, over-stated, or inaccurate. All avoided stigmatizing or judgmental language and supported self-efficacy. Chatbots were perceived to vary widely in their ability to support individuals with alcohol misuse. While generally strong in empathy, there is room for improvement in response quality. As chatbot use expands, users and clinicians should be aware of the strengths and weaknesses of chatbots in providing advice on alcohol misuse.