Abstract
BACKGROUND: As artificial intelligence continues to play an expanding role in healthcare, ensuring its compliance with medical ethics is essential. However, the ethical performance of artificial intelligence in medical contexts remains insufficiently studied. This study aimed to evaluate the ability of ChatGPT to address questions related to medical ethics and to compare its performance with that of human experts. METHODS: A Medical Ethics Evaluation dataset was developed, consisting of 465 single-choice questions derived from a range of medical ethics standards. These questions were used to assess two artificial intelligence models, GPT-3.5 and GPT-4. Model responses were compared with those provided by two medical ethics experts. Each test was conducted independently twice to ensure consistency. Accuracy was calculated for each model and expert, and chi-square tests were used to compare differences in performance. RESULTS: GPT-3.5 achieved an overall accuracy of 38.92%, while GPT-4 achieved 27.10%. In comparison, two medical ethics experts achieved substantially higher accuracies of 86.23% and 78.32%, respectively. Both experts performed significantly better than GPT-3.5 and GPT-4. These findings indicate a substantial gap between artificial intelligence models and human experts in understanding and applying medical ethics principles. The relatively low performance of the models, compared with their reported strengths in diagnostic tasks, may reflect the complexity and nuance of ethical reasoning in medicine. Nevertheless, the large language models showed some ability to align with core medical ethics principles, particularly in ethical dilemma scenarios, and were also able to generate responses that addressed psychological needs. CONCLUSIONS: Artificial intelligence models currently show limited accuracy in medical ethics decision-making compared with human experts. Although these models demonstrate some alignment with fundamental ethical principles, the performance is not yet sufficient for reliable use in ethically sensitive medical contexts. Further optimization is needed to improve their ability to meet the ethical demands of medical practice.