Abstract
Owing to its ability to enable precise perception of dynamic and complex environments, point cloud semantic segmentation has become a critical task for autonomously driven vehicles in recent years. However, in complex, dynamic scenes, cumulative errors and the "many-to-one" mapping problem are challenges for existing semantic segmentation methods, which further limit their accuracy and efficiency. To address these, this paper introduces a new framework that balances accuracy and computational efficiency by utilizing temporal alignment (TA), projection multi-scale convolution (PMC), and priority point retention (PPR). By combining TA and PMC, the framework effectively captures inter-frame correlations, improving local detail information, reducing error accumulation, and maintaining detailed scene features. Second, employing the PPR mechanism ensures that critical three-dimensional information is retained, thereby resolving information loss caused by the "many-to-one" mapping problem. Finally, by combining LiDAR and camera data through multimodal fusion, the framework provides complementary perspectives, further enhancing segmentation performance. Our method achieves state-of-the-art performance on the benchmark SemanticKITTI and nuScenes datasets. Notably, the proposed framework excels at detecting occluded objects and dynamic entities.