Abstract
Generalization to unseen environments remains a fundamental challenge in Vision-Language Navigation. To tackle this issue, we propose a novel framework that leverages world knowledge embedded within Multimodal Large Language Models. We introduce Collaborative Agents in Visual-Language Navigation (CA-VLN), a framework based on a dual-agent architecture. This architecture comprises a Knowledge Agent, which infuses the action prediction process with semantic context and commonsense reasoning, and a Hierarchical History Agent, which constructs a detailed episodic memory to enable long-horizon planning. The collaboration between these agents facilitates a dynamic interplay between high-level semantic understanding and grounded episodic experience. Extensive experiments on the R2R, REVERIE and SOON datasets demonstrate that our model achieves state-of-the-art performance, significantly improving generalization and navigation success in previously unobserved environments.