Abstract
The emergence of portable imaging devices improves medical image acquisition efficiency in resource-limited regions, but a shortage of medical personnel still limits timely diagnosis. We propose Embed-MedSAM, a fully automatic segmentation model with low deployment cost. Built on MedSAM, it integrates a lightweight RepViT encoder to reduce computation and applies two-stage distillation on over one million multimodal medical images to preserve the original model's visual representation. A self-prompting mechanism is also introduced, where the model generates pseudo labels to guide fine-grained segmentation. The training jointly optimizes KL divergence and segmentation losses to improve accuracy under prompt-free conditions. Embed-MedSAM shows excellent performance on 17 benchmark datasets covering 7 imaging modalities. Without external prompts, it improves average Dice score by nearly 16% over the second-best model. It also runs at nearly 30 FPS on iPhone 14, showing strong potential for real-world deployment.