Abstract
INTRODUCTION: This study aimed to evaluate the readability, quality, reliability, similarity, and length of texts generated by ChatGPT on common rheumatic diseases and compare their content with American College of Rheumatology (ACR) patient education fact sheets. MATERIAL AND METHODS: Fifteen common rheumatic diseases were included based on the ACR fact sheets. Questions about disease characteristics, symptoms, treatments, and lifestyle recommendations were generated based on ACR content and input into ChatGPT-4 for comparison. Readability was assessed using the Flesch-Kincaid Grade Level (FKGL), Flesch Reading Ease (FRE), and the Simple Measure of Gobbledygook (SMOG) index. Quality and reliability were evaluated using the DISCERN questionnaire and the Ensuring Quality Information for Patients (EQIP) tool. Text similarity was measured using cosine similarity, and word count was obtained using Microsoft Word. RESULTS: ChatGPT-generated texts had significantly higher FKGL scores (14.3 vs. 12.7; p = 0.007) and SMOG scores (p < 0.001), indicating greater linguistic complexity. They also had lower FRE scores (35.8 vs. 43.7; p < 0.001). The mean DISCERN score for ChatGPT was significantly lower than for ACR fact sheets (46 vs. 52; p < 0.001), suggesting reduced reliability. However, no significant difference was found in EQIP quality scores (p = 0.744). Cosine similarity between ChatGPT and ACR texts averaged 0.69 (range: 0.57-0.76), indicating moderate content overlap. ChatGPT texts were more than twice as long, with a median word count of 1,109 compared to 450 for ACR materials (p < 0.001). CONCLUSIONS: Despite the moderate similarity, ChatGPT-generated texts on rheumatic diseases were more complex, less reliable, and longer than ACR fact sheets. These findings highlight the need for improvements in artificial intelligence-driven healthcare tools to ensure readability, accuracy, and reliability, making them more aligned with expert-reviewed resources.