Abstract
To tackle the challenges of complex construction environment interference (e.g., lighting variations, occlusion, and marker contamination) and the demand for high-precision alignment during the hoisting process of bridge-erecting machines, this paper presents a two-stage marker detection-localization network tailored to hoisting alignment. The proposed network adopts a "coarse detection-fine estimation" phased framework; the first stage employs a lightweight detection module, which integrates a dynamic hybrid backbone (DHB) and dynamic switching mechanism to efficiently filter background noise and generate coarse localization boxes of marker regions. Specifically, the DHB dynamically switches between convolutional and Transformer branches to handle features of varying complexity (using depthwise separable convolutions from MobileNetV3 for low-level geometric features and lightweight Transformer blocks for high-level semantic features). The second stage constructs a Transformer-based homography estimation module, which leverages multi-head self-attention to capture long-range dependencies between marker keypoints and the scene context. By integrating enhanced multi-scale feature interaction and position encoding (combining the absolute position and marker geometric priors), this module achieves the end-to-end learning of precise homography matrices between markers and hoisting equipment from the coarse localization boxes. To address data scarcity in construction scenes, a multi-dimensional data augmentation strategy is developed, including random homography transformation (simulating viewpoint changes), photometric augmentation (adjusting brightness, saturation, and contrast), and background blending with bounding box extraction. Experiments on a real bridge-erecting machine dataset demonstrate that the network achieves detection accuracy (mAP) of 97.8%, a homography estimation reprojection error of less than 1.2 mm, and a processing frame rate of 32 FPS. Compared with traditional single-stage CNN-based methods, it significantly improves the alignment precision and robustness in complex environments, offering reliable technical support for the precise control of automated hoisting in bridge-erecting machines.