Abstract
Older adults often need guidance when visiting new buildings for the first time. However, indoor navigation remains challenging due to the lack of Global Positioning System (GPS) availability, visually repetitive corridors, and frequent location failures. This article presents a multimodal indoor navigation assistant that combines graph-based route planning with visual landmark verification to provide step-by-step guidance. The environment is modelled as a directed graph whose nodes are annotated with semantic landmarks, and the graph is constructed primarily from a video of the building, reducing the need for 3D scanners, beacons, or other specialised instruments. Routes are calculated using Dijkstra's shortest-path algorithm over the semantic graph. During navigation, camera frames are analysed using a restricted vision-language recognition strategy that only considers candidate landmarks from the current and next nodes, reducing false detections and improving interpretability. To increase robustness, a temporary voting mechanism was introduced to confirm node transitions, as well as a hierarchical redirection strategy with local and global recovery. The system is implemented in two modes: handheld mode with visual cues using augmented reality arrows, mini map and voice instructions, and hands-free mode with front camera using voice instructions and keywords. Evaluation involved preliminary technical testing in the United Kingdom followed by formal user validation in Spain. During these trials, participants reported high usability, strong confidence and safety, and increased perceived independence.