Abstract
In autonomous driving and intelligent robotics, the semantic information of LiDAR (Light Detection and Ranging) sensor data is crucial for understanding the surrounding environment. However, directly operating on point clouds is computationally expensive. To address this, some researchers have projected three-dimensional LiDAR data onto a two-dimensional spherical range view and used two-dimensional convolutional neural networks to segment the projected images. While the results are promising, many of these models are structurally complex, with high spatiotemporal complexity, which makes them unsuitable for real-time applications. To solve these issues, this paper proposes a multi-scale LiDAR data semantic segmentation method, MSCNet, with fewer parameters and higher segmentation accuracy. In the encoding phase, a single-channel multi-scale feature fusion block is introduced to alleviate the distribution differences between input channels. To obtain more stable local features, multi-scale dilated convolution residual blocks are designed to encode information from different receptive fields. To quickly capture global features, a pyramid pooling module is introduced. Experimental results on the SemanticKITTI, SemanticPOSS, and Pandaset datasets show that MSCNet achieves a good balance between parameter, accuracy, and running time. Particularly on the SemanticPOSS and Pandaset datasets, MSCNet achieves the best performance. Under the same parameter conditions, this method outperforms existing point cloud-based and projection-based methods.