Abstract
With the rapid development of next-generation wireless communication systems, the increasing density of heterogeneous base stations and the dynamic nature of channel conditions have posed significant challenges to accurate and timely base station selection. Traditional single-modal approaches relying solely on partial channel or location information often fail to capture the complex semantics of real communication scenarios, leading to suboptimal decision-making. To address these limitations, this paper proposes the Multimodal Optimal Base Station Selection Network (MOBS-Net), which integrates multimodal spatial and temporal information to achieve both optimal base station judgment and proactive prediction. The judgment module employs convolutional neural networks to extract image semantics and a Transformer-based fusion mechanism to combine image, location, and channel features for real-time decision-making. The prediction module leverages multimodal sequential data and a large-scale multimodal model to extract temporal semantics, enabling proactive base station switching under dynamic channel conditions. Extensive experiments demonstrate that MOBS-Net significantly outperforms single-modal baselines, achieving an accuracy of 92.20% for optimal base station judgment and 91.5% for prediction tasks. These results highlight the reliability and effectiveness of MOBS-Net for intelligent base station decision-making in dynamic wireless environments.