Abstract
Artificial intelligence has revolutionized medical image analysis by enabling automated, accurate, and efficient diagnostic solutions. This study introduces a novel self-boosting multimodal alignment framework for automated medical image report generation, leveraging advanced Vision Transformer and BERT models. The proposed architecture integrates a dual-branch design comprising report generation and image-text matching modules. By enabling cooperative interactions between these branches, the framework achieves iterative performance enhancements, ensuring the generation of coherent, clinically relevant diagnostic reports. Evaluated on multiple benchmark datasets, the framework consistently outperforms state-of-the-art methods. On the IU-Xray dataset, the model achieved a BLEU-4 score of 0.316 and a CIDEr score of 0.441, reflecting improvement over baseline models, respectively. On the MIMIC-CXR dataset, it attained a BLEU-4 score of 0.172 and a ROUGE-L score of 0.321, surpassing recent methods like AM-MRG and MetTransformer by 26.5 and 10.3%, respectively. Generalization tests on specialized datasets such as DDSM and GTEx yielded BLEU-4 scores of 0.291 and 0.170 and CIDEr scores of 0.541 and 0.397, emphasizing the framework's adaptability across diverse medical imaging modalities. Ablation studies highlight the significance of the self-boosting module, demonstrating consistent performance declines across metrics when omitted. Qualitative analyses further validate the clinical relevance of the generated reports, aligning closely with ground truth while ensuring clarity and interpretability. This framework streamlines diagnostic workflows and establishes a scalable, robust solution for automated medical reporting, paving the way for enhanced efficiency and consistency in real-world clinical settings.