Publicly Available Large Language Models for Trichoscopy: A Head-to-Head Comparison with Dermatologists

公开可用的毛发镜检查大型语言模型:与皮肤科医生的直接比较

阅读:1

Abstract

Background/Objectives: Trichoscopy is an important diagnostic tool for hair and scalp disorders, but it requires significant expertise. Publicly available large language models (LLMs) are becoming more popular among both physicians and patients, yet their usefulness in trichology is unknown. We aimed to evaluate the diagnostic accuracy of four publicly available LLMs when interpreting trichoscopic images, as well as to compare their performance with that of dermatology residents, board-certified dermatologists, and trichology experts. Method: In this prospective comparative study, a preprocessed set of trichoscopic images was assessed in an online image-based survey. To reduced recognition bias from public image repositories, all images were structurally transformed while preserving diagnostic features. Fifteen dermatologists (five residents, four board-certified dermatologists, six trichology experts) provided a suspected diagnosis (SD), and up to three the differential diagnoses (DD). Four LLMs (ChatGPT-4o, Claude Sonnet 4, Gemini 2.5 Flash, and Grok-3) evaluated the images under the same conditions. Results: The overall diagnostic accuracy among 15 dermatologists was 58.1% (95% CI, 53.0-63.0) for SD and 68.3% (95% CI, 63.4-72.8) for SD + DD. Experts significantly outperformed residents and board-certified dermatologists. AI models achieved an accuracy of 18.2% (95% CI, 11.8-26.9) for SD and 44.4% (95% CI, 35.0-54.3) for SD + DD. Gemini 2.5 Flash performed best, with an accuracy of 62.5% for SD + DD. Agreement among dermatologists increased with experience (AC1 up to 0.65 for experts), while agreement among AI models was moderate to good (AC1 up to 0.70). Agreement between AI models and dermatologists was only slight to fair (AC1 = 0.06 for SD and 0.21 for SD + DD). All human-AI differences were statistically significant (p < 0.001). Conclusions: In trichology, publicly available LLMs currently underperform compared to human experts, especially in providing a single correct diagnosis. These models require further development and specialized training before they can reliably assist with trichological diagnoses in routine care.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。