Abstract
Visual Language Models (VLMs) have shown significant potential in processing multimodal tasks in a wide range of domains, such as medical image understanding, comprehensive diagnosis, etc. In Traditional Chinese Medicine (TCM), VLMs have also achieved promising performance in various tasks, including symptom differentiation and constitution diagnosis. However, in these TCM-related tasks, there are challenges including the lack of TCM-specific multimodal datasets and the weak associations between the examination results and TCM treatment strategies. To address these problems, we propose ConsTCM, which is trained via a two-stage finetuning pipeline, from the basic SFT to the specific SFT. The basic SFT stage utilizes general ophthalmoscopy datasets, encouraging the model to capture general features of the fundus images. In the specific SFT stage, we collected real-world self-labeled cases from hospitals, which are used to further finetune the model and enhance the model performance in the TCM constitution tasks. Besides, we constructed the finetuning dataset, which is derived from the mixture of the existing classification tasks and the real-world hospital data. Finally, we train the model using this pipeline, resulting in ConsTCM. A series of experiments verify our pipeline. Compared with baseline models, ConsTCM achieves a 5.6% performance improvement. Meanwhile, the ablation study shows that the proposed two-stage pipeline significantly enhances the model performance, as evidenced by a 47% improvement in performance of ConsTCM in objective evaluation compared with the base model.