Abstract
In learning-based stereo matching methods, a feature information-rich and concise cost volume is crucial for achieving high-precision and high-efficiency stereo matching. Aiming at the problem that the cost volume lacks global geometric information, which leads to confusing foreground and background disparity estimation and blurring at edges and details, this paper proposes a fusion of multi-scale geometric features and frequency domain decomposition stereo matching network. Firstly, the initial cost volume is processed by the multi-scale geometric extraction module, which achieves an effective conversion from local correlation to global geometric information understanding, and significantly enhances the perception of scene boundaries and occluded regions. In the cost aggregation stage, we introduce an adaptive guidance mechanism based on channel attention, which not only improves the cost aggregation efficiency but also reduces the time overhead. In the disparity refinement stage, we not only use the iterative update of disparity based on multi-scale GRU, but also introduce the high and low-frequency separation of disparity reconstruction network, which reconstructs the disparity by decomposing the high and low-frequency errors, and is able to obtain a finer full-resolution disparity map. Our method achieves state-of-the-art performance on benchmark tests across multiple datasets, including Scene mFlow, KITTI2012, KITTI2015, ETH3D, and Middlebury. Compared to mainstream approaches, our method demonstrates excellent results on the KITTI2015 test set, attaining error rates of 1.39% in the background region (D1-bg) and 2.54% in the foreground region (D1-fg), while maintaining real-time inference capabilities.