Abstract
OBJECTIVE: Large language models (LLMs) can process text for various applications, including surgical pathology reports, but studies primarily focus on English. Their performance has not been systematically studied for a low-resource language. To analyze the performance of various LLMs, 759 Turkish pathology reports from 5 different procedures were selected. METHODS: We used 10 examples from every procedure to optimize prompts for OpenAI's GPT-3.5 Turbo, GPT-4o mini, and GPT-4o. The rest was used to test generalizability. RESULTS: The GPT-4o model performed superior in processing Turkish reports (12%-25% over GPT-3.5 Turbo, 3%-16% over GPT-4o mini). English-translated versions of the reports have been demonstrated to enhance accuracy, especially for GPT-3.5 Turbo and GPT-4o mini. GPT4-o showed comparable results for Turkish and English. A 12% to 22% performance gap was observed between GPT-4o and GPT-3.5 Turbo for English-translated reports. Domain-related tips in prompts increased accuracy. Results of larger test sets were parallel for all models with the validation set. The GPT-4o model yielded the most accurate results, while the GPT-4o mini model demonstrated intermediate performance. The GPT-3.5 Turbo model exhibited the least accuracy. CONCLUSIONS: To our knowledge, for the first time in the literature, we have demonstrated the performance of GPT models in Turkish surgical pathology reports, and results indicate that data extracted by GPT-4o are almost ready for direct application.