Abstract
The sizes of objects within an image can vary significantly. In particular, in fields such as construction, manufacturing, and healthcare, where image analysis directly affects human life and safety, it is crucial to accurately detect even small objects. To address this, the present study proposes a multiscale object detection model that employs the Pyramid Vision Transformer (PVT) as the backbone network for the YOLO model. This approach compensates for the limitations of the Spatial Pyramid Pooling Feature (SPPF) used in conventional YOLO models and improves the detection accuracy for small objects. The proposed transformer-based multiscale object detection model aims to effectively capture long-range dependencies while simplifying complex pre-processing and post-processing procedures. Furthermore, it generates feature maps at various resolutions, enabling multiscale feature representation and object detection. In particular, by leveraging Global Self-Attention, the model leverages contextual information from the entire image, thereby enhancing understanding of object relationships and improving overall scene comprehension.