Abstract
Deep learning has become an important tool for 3D medical image segmentation, where learning effective representations from limited labeled data remains essential for practical deployment. Here, we present ResTRANS3D, a data-efficient self-supervised hybrid framework that combines a 3D-ResNet encoder with a multi-scale Transformer through a residual interaction mechanism to jointly model local spatial structures and long-range contextual dependencies. A dynamic position learning module generates adaptive positional representations conditioned on multi-scale features, while selective self-attention reduces the computational cost of global attention. The model is pretrained using a dual self-supervised strategy that integrates contrastive learning and image reconstruction. Experiments on multiple public 3D medical image benchmarks show that ResTRANS3D supports effective downstream segmentation, particularly when labeled data are limited. These results highlight the potential of hybrid representation learning to improve data-efficient 3D medical image analysis.