Abstract
BACKGROUND: Medical image segmentation is crucial for improving healthcare outcomes. Convolutional neural networks (CNNs) have been widely applied in medical image analysis; however, their inherent inductive biases limit their ability to capture global contextual information. Vision transformer (ViT) architectures address this limitation by leveraging attention mechanisms to model global relationships; however, they typically require large-scale datasets for effective training, which is challenging in the field of medical imaging due to limited data availability. This study aimed to integrate the advantages of CNN and ViT architectures to improve segmentation performance on small-scale medical image datasets. METHODS: In this study, we established a U-shaped network architecture based on a Transformer-assisted convolutional neural network (TAC-UNet). The TAC-UNet is primarily composed of a hybrid structure integrating CNN and Transformer components. Specifically, the hybrid architecture follows a dual-path design in which the Transformer branch continuously conveys global contextual information to the CNN backbone. This allows the CNN backbone to enhance its global perception while building on the local features it extracts, thereby improving its ability to comprehend complex image structures. A channel cross-attention (CCA) module is also incorporated as a bridge between the encoder and decoder to better reconcile the semantic discrepancies between them. RESULTS: Detailed experiments on three public datasets were conducted. Specifically, our model was trained on 30 images from the Multi-organ Nucleus Segmentation (MoNuSeg) training dataset, 85 images from the Gland Segmentation (GlaS) training dataset, and 551 images from the Computer Vision Center Colorectal Cancer-Clinic Database (CVC-ClinicDB) dataset. We evaluated the performance of our model on the corresponding test sets. Our TAC-UNet achieved the best Dice scores (80.36%, 90.70%, and 91.81% on the MoNuSeg, GlaS, and CVC-ClinicDB datasets, respectively) of all the models. Compared to other CNN-based, Transformer-based, and hybrid methods, the TAC-UNet demonstrated significantly superior segmentation performance. CONCLUSIONS: Our TAC-UNet model showed advanced segmentation performance on small-scale medical image datasets. The detailed experimental results showed the effectiveness of the method. Our model's code is available at: https://github.com/hejlhello/TAC-UNet.