Abstract
The effective receptive field (ERF) is a crucial concept in object detection, as it captures rich semantic information about the target, including its position and class. Existing methods typically associate the ERF with the depth, size, and nonlinear operations of the convolutional network in a static manner, such that the feature maps at each layer of the convolutional neural network correspond to a fixed ERF size. However, in fact images, multiple objects with varying scales, shapes, and other characteristics can influence the ERF, and the ERF often follows Gaussian distribution. In this paper, we propose a dynamic and real-time region-oriented ERF computation method, named GERF (Gaussian-based Effective Receptive Fields). We apply GERF to the BRA (Bi-Level Routing Attention) module of BiFormer, and refer to the method as GERF-BRA. Our approach can predict the ERF for each window in feature map and capture the weighted features of adjacent windows using Gaussian distribution. We integrate GERF-BRA into the detection heads of YOLOv8n, and experimental results on the COCO 2017 dataset demonstrate the effectiveness of GERF-BRA, achieving an improvement of 2.5 AP. Meanwhile, our method also demonstrates remarkable efficacy on proprietary agricultural and medical datasets.