MambaOVD: a Mamba-based open-vocabulary object detection method

MambaOVD:一种基于Mamba的开放词汇目标检测方法

阅读:1

Abstract

Open-vocabulary object detection (OVD) is a critical research area in computer vision, particularly for applications in autonomous driving and robotics. Many existing OVD methods adopt transformer architectures for image-text fusion, utilizing self-attention mechanisms to model complex dependencies. However, transformer-based approaches are often computationally demanding, limiting their practical deployment. To address this issue, we propose MambaOVD, a novel open-vocabulary object detection method based on the Mamba architecture. MambaOVD consists of four key modules: an image encoder, a text encoder, a Mamba-based image-text fusion module, and a detection head. The image encoder extracts visual features, the text encoder generates text embeddings, the fusion module integrates multimodal information using Mamba layers, and the detection head performs object localization and classification. To evaluate the effectiveness of MambaOVD, we trained the model on the Objects365 (V1) and GoldG datasets, and conducted testing on the LVIS minival and AutoMine datasets. Experimental results show that MambaOVD achieves superior performance compared to state-of-the-art (SOTA) models, including YOLO-World-S, GLIPv2_T, and DetCLIP_T, demonstrating advantages in both qualitative and quantitative evaluations.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。