Abstract
Monocular depth estimation (MDE) is a cornerstone task in 2D/3D scene reconstruction and recognition with widespread applications in autonomous driving, robotics, and augmented reality. However, existing state-of-the-art methods face a fundamental trade-off between computational efficiency and estimation accuracy, limiting their deployment in resource-constrained real-world scenarios. It is of high interest to design lightweight but effective models to enable potential deployment on resource-constrained mobile devices. To address this problem, we present RepACNet, a novel lightweight network that addresses this challenge through reparameterized asymmetric convolution designs and CNN-based architecture that integrates MLP-Mixer components. First, we propose Reparameterized Token Mixer with Asymmetric Convolution (RepTMAC), an efficient block that captures long-range dependencies while maintaining linear computational complexity. Unlike Transformer-based methods, our approach achieves global feature interaction with tiny overhead. Second, we introduce Squeeze-and-Excitation Consecutive Dilated Convolutions (SECDCs), which integrates adaptive channel attention with dilated convolutions to capture depth-specific features across multiple scales. We validate the effectiveness of our approach through extensive experiments on two widely recognized benchmarks, NYU Depth v2 and KITTI Eigen. The experimental results demonstrate that our model achieves competitive performance while maintaining significantly fewer parameters compared to state-of-the-art models.