Abstract
Traditional calibration methods rely on precise targets and frequent manual intervention, making them time-consuming and unsuitable for large-scale deployment. Existing learning-based approaches, while automating the process, are typically limited to single LiDAR-camera pairs, resulting in poor scalability and high computational overhead. To address these limitations, we propose a lightweight calibration network with flexibility in the number of sensor pairs, making it capable of jointly calibrating multiple cameras and LiDARs in a single forward pass. Our method employs a frozen pre-trained Swin Transformer as a shared backbone to extract unified features from both RGB images and corresponding depth maps. Additionally, we introduce a cross-modal channel-wise attention module to enhance key feature alignment and suppress irrelevant noise. Moreover, to handle variations in viewpoint, we design a modular calibration head that independently estimates the extrinsics for each LiDAR-camera pair. Through large-scale experiments on the nuScenes dataset, we show that our model, requiring merely 78.79 M parameters, attains a mean translation error of 2.651 cm and a rotation error of 0.246∘, achieving comparable performance to existing methods while significantly reducing the computational cost.