Abstract
Open-vocabulary object detection (OVD) is a critical research area in computer vision, particularly for applications in autonomous driving and robotics. Many existing OVD methods adopt transformer architectures for image-text fusion, utilizing self-attention mechanisms to model complex dependencies. However, transformer-based approaches are often computationally demanding, limiting their practical deployment. To address this issue, we propose MambaOVD, a novel open-vocabulary object detection method based on the Mamba architecture. MambaOVD consists of four key modules: an image encoder, a text encoder, a Mamba-based image-text fusion module, and a detection head. The image encoder extracts visual features, the text encoder generates text embeddings, the fusion module integrates multimodal information using Mamba layers, and the detection head performs object localization and classification. To evaluate the effectiveness of MambaOVD, we trained the model on the Objects365 (V1) and GoldG datasets, and conducted testing on the LVIS minival and AutoMine datasets. Experimental results show that MambaOVD achieves superior performance compared to state-of-the-art (SOTA) models, including YOLO-World-S, GLIPv2_T, and DetCLIP_T, demonstrating advantages in both qualitative and quantitative evaluations.