Abstract
BACKGROUND AND OBJECTIVE: Accurate diagnosis of thyroid nodules using ultrasound images heavily depends on the clinical expertise of radiologists. This reliance poses significant challenges in underdeveloped countries and regions where access to specialized medical resources is limited. Recently, Multi-modal Large Language Models (M-LLMs) have demonstrated promising potential in handling heterogeneous data, such as images and text, making them attractive candidates for automating labor-intensive diagnostic tasks. However, M-LLMs often struggle in ultrasound diagnosis of thyroid nodules for two main reasons: (1) without domain-specific fine-tuning, they are prone to generating hallucinated content, especially in classification tasks that demand expert-level decision-making; and (2) the cost and effort required for ultrasound multi-modal datasets of thyroid nodules are prohibitively high, which are essential for fine-tuning M-LLMs. METHODS: We propose a novel multi-modal prompt-tuning method based on ultrasound images and textual description, which can assist radiologists in improving their diagnoses of the etiology of thyroid nodules. Our approach leverages an image encoder and a prompt-tuning framework to learn effective representations from both modalities without the need for expensive full model fine-tuning. The fused multi-modal features are then used to improve the diagnosis of thyroid nodules. These obtained features are re-input into the multi-layer perceptron (MLP) model to fuse multi-modal relationships for complementing image features and assist in the diagnosis of thyroid nodules. RESULTS: Extensive experiments on publicly available and private enrolled datasets demonstrate that our method achieved state-of-the-art performance. Our method significantly outperformed traditional single-modality methods, with accuracy improvements of up to 40.62 over ResNet and 28.51% over AlexNet on the publicly available dataset. In contrast to other multi-modal models, our method achieved superior performance of up to 23.12% and 25.21% on accuracy and F1 score. CONCLUSIONS: Our method even surpasses all participating radiologists in accuracy, highlighting its strong potential to assist in expert-level diagnostic decision-making and provide scalable support for resource-limited clinical environments. Practically, it facilitates faster and more consistent thyroid nodule screening, thereby enhancing diagnostic efficiency.