Abstract
Driven by the concepts of digital twin and metaverse, constructing a high-fidelity, semantic-rich, and interactive digital copy of the physical world has become a key issue in the field of surveying, mapping, and geographic information. However, in typical complex landforms such as urban canyons and mountainous forest areas, the single-sensor data acquisition methods (such as UAV oblique images and lidar) has inherent information blind spots and accuracy bottlenecks. Traditional data fusion approaches predominantly focus on shallow geometric alignment and splicing at the geometric level, ignoring the heterogeneity of different data sources in semantic connotations, leads to common problems such as geometric distortion, detail loss, and semantic inconsistency in the fusion products. To break through this dilemma, this paper proposes an adaptive fusion framework for multi-source data of complex landforms (SAAF-Net) with deeply coupled semantic information. Centered on computer vision, this framework constructs a full-link technical process from raw data to high-precision semantic 3D models: Two-stream parallel semantic parsing: A two-stream deep semantic segmentation network for images and point clouds (based on SegFormer and PointNeXt) is designed to achieve fine-grained classification of scene features (the average intersection over union mIoU exceeds 90%), providing high-dimensional semantic priors for fusion. Semantic-guided cross-source registration: A semantic weighted iterative closest point algorithm (SW-ICP) is proposed. By constraining the corresponding point search space through cross-source semantic consistency and combining with the significance weighting of local geometric structures, the robustness problem of heterogeneous data registration is solved. Neural adaptive fusion modeling: A multi-factor driven neural network model is constructed to dynamically evaluate the confidence of data sources under different semantic categories and observation conditions, achieving the optimal fusion of pixel-level elevation and texture. Experiments in the city center and mountainous forest areas show that compared with mainstream methods, the root mean square error (RMSE) of SAAF-Net is reduced by 35% - 48%, and the completeness is improved to over 99%. Especially, the reconstruction quality in building edges, vegetation-covered areas, and light-shadow areas is significantly improved.with a substantial enhancement in visual realism. This research provides theoretical and technical support for the construction of a high-precision 3D base for digital twin cities.