Abstract
OBJECTIVE: This study aimed to improve transcription accuracy for Korean hospital telephone consultations by fine-tuning the Whisper large-v3-turbo model. The goal was to assess whether domain-specific adaptation enhances automatic speech recognition (ASR) performance across speaker types in telemedicine. METHODS: I used a publicly available speech corpus comprising 1,272,630 Korean-language audio files (∼1,300 h) from telemedicine interactions involving doctors, nurses, and patients. Audio signals were standardized (16 kHz, 16-bit) and paired with normalized transcripts. The Whisper model was fine-tuned using supervised learning with data augmentation (SpecAugment, speed perturbation, noise injection) and speaker normalization. Performance was evaluated using word error rate (WER) and character error rate (CER), with statistical tests (Wilcoxon Signed-Rank and Sign Test) applied across speaker groups. RESULTS: The fine-tuned model consistently outperformed the baseline. In the patient group, WER improved from 22.92% to 22.42% and CER from 5.32% to 4.98%. Statistically significant improvements were observed for doctors and patients (p < .001), while changes in nurse data were not significant due to low baseline error. CER was found to better reflect transcription fidelity in Korean, as it was less affected by morphological variation and word segmentation errors typical in agglutinative languages. Loss monitoring confirmed stable convergence without overfitting. CONCLUSION: Domain-specific fine-tuning of Whisper improves ASR performance in Korean telemedicine, especially for spontaneous patient speech. CER is more appropriate than WER for evaluating Korean ASR systems. These findings support the use of optimized ASR models for more accurate and reliable clinical documentation in digital health environments, with potential to reduce clinician documentation burden, support continuity of care, and improve patient safety.