Abstract
This paper addresses perception degradation caused by adverse weather, occlusion, and asynchronous sampling by proposing an uncertainty-weighted multi-task learning framework for robust semantic understanding of traffic scenes (UW-MTL). The method performs differentiable multi-source spatiotemporal alignment to unify camera, LiDAR, radar, and IMU into a BEV sequence, and adopts a hybrid backbone that combines a Mixture of Experts Transformer with a spatiotemporal graph neural network to balance global semantics and local topology. Each task employs evidential prediction heads that explicitly output confidence and uncertainty. During training, soft-temperature weighting and a sigma aware gradient conflict resolver enable stable joint optimization. On the nuScenes benchmark, UW-MTL consistently surpasses BEVFusion and UniAD on 3D object detection, BEV semantic segmentation, and short-horizon trajectory prediction, with especially pronounced gains at long range, under heavy occlusion, and in low-visibility conditions.