Abstract
This study presents LSE-CVCNet, a novel stereo matching network designed to resolve challenges in dynamic scenes, including dynamic feature misalignment caused by texture variability and contextual ambiguity from occlusions. By integrating three key innovations-local structural entropy (LSE) to quantify structural uncertainty in disparity maps and guide adaptive attention, a cross-image attention mechanism (CIAM-T) to asymmetrically extract features from left/right images for improved feature alignment, and multi-resolution cost volume fusion (MRCV-F) to preserve fine-grained details through multi-scale fusion-LSE-CVCNet enhances disparity estimation accuracy and cross-domain generalization. The experimental results demonstrate robustness under varying lighting, occlusions, and complex geometries, outperforming state-of-the-art methods across multiple data sets. Ablation studies validate each module's contribution, while cross-domain tests confirm generalization in unseen scenarios. This work establishes a new paradigm for adaptive stereo matching in dynamic environments.