CA-VLN: Collaborative Agents in MLLM-Powered Visual-Language Navigation

CA-VLN:基于MLLM的视觉语言导航中的协作代理

阅读:1

Abstract

Generalization to unseen environments remains a fundamental challenge in Vision-Language Navigation. To tackle this issue, we propose a novel framework that leverages world knowledge embedded within Multimodal Large Language Models. We introduce Collaborative Agents in Visual-Language Navigation (CA-VLN), a framework based on a dual-agent architecture. This architecture comprises a Knowledge Agent, which infuses the action prediction process with semantic context and commonsense reasoning, and a Hierarchical History Agent, which constructs a detailed episodic memory to enable long-horizon planning. The collaboration between these agents facilitates a dynamic interplay between high-level semantic understanding and grounded episodic experience. Extensive experiments on the R2R, REVERIE and SOON datasets demonstrate that our model achieves state-of-the-art performance, significantly improving generalization and navigation success in previously unobserved environments.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。