Abstract
BACKGROUND: OpenAI developed ChatGPT as an advanced artificial intelligence (AI)-driven natural language processing system. ChatGPT is capable of generating responses through statistical pattern recognition established during pretraining. OBJECTIVE: To ascertain whether ChatGPT could respond to patients with breast cancer in a way that was consistent with evidence-based medical practices and a breast cancer clinical guideline. This guideline was a practical pocket book based on the latest evidence and took into account the national data, and to evaluate the ability of AI to provide accurate and up-to-date information to patients, potentially serving as a supplementary resource for medical professionals. METHODS: The research team designed a series of tests to assess the responses of ChatGPT to specific questions related to breast cancer diagnosis, treatment options, and post-treatment care. Thirty clinically validated breast cancer questions spanning diagnosis, prognosis, treatment, and pharmacotherapy were administered through three iterative trials to: (1) GPT-3.5/GPT-4.0 (5min interval between trials) and (2) three breast surgeons stratified by expertise (high/medium/low). Responses were scored dichotomously (1 = guideline-consistent; 0 = inconsistent) with total scores ranging 0 to 3 per question. For each consistent and inconsistent answer with the standard answer, 1 and 0 points were given, respectively. The sum of the answers obtained from the three experts resulted in a score of 0 to 3. Data analysis included mean score comparisons (analysis of variance with post hoc Tukey tests), subgroup analyses by question category, and inter-rater reliability assessment. RESULTS: Performance comparison between GPT-3.5 and GPT-4.0 across breast surgery subspecialties and question types revealed that GPT-4.0 generally outperformed GPT-3.5, despite the absence of significant difference in the mean scores for most items. We found that GPT-3.5 and have the same medical response ability as lower qualified breast surgeons, while GPT-4.0 have the same ability as higher qualified breast surgeons.