Abstract
BACKGROUND: With the growing use of generative AI in scientific writing, distinguishing between AI-generated and human-authored content has become a pressing challenge. It remains unclear whether ChatGPT (OpenAI, San Francisco, CA) can accurately and consistently recognize its own output. METHODS: We randomly selected 100 research articles published in 2000, before the advent of generative AI, from 10 high-impact internal medicine journals. For each article, a structured abstract was generated using ChatGPT-4.0 based on the full PDF. The original and AI-generated abstracts (n = 200) were then evaluated twice by ChatGPT-4.0, which was asked to rate the likelihood of authorship on a 0-10 scale (0 = definitely human, 10 = definitely ChatGPT, 5 = undetermined). Classifications of 0-4 were considered human, and 6-10 were considered AI generated. RESULTS: Misclassification rates were high in both rounds (49% and 47.5%). No abstract received a score of 5. Score distributions overlapped substantially between groups, with no statistically significant difference (Wilcoxon p-value = 0.93 and 0.21). Cohen's kappa for binary classification was 0.33 (95% CI: 0.19-0.46) and weighted kappa on the 0-10 scale was 0.24 (95% CI: 0.15-0.34), both reflecting poor agreement. CONCLUSION: ChatGPT-4.0 cannot reliably identify whether a scientific abstract was written by itself or by humans. More robust external tools are needed to ensure transparency in academic authorship.