Abstract
Current weakly supervised salient object detection (SOD) methods for RGB-D images mostly rely on image-level labels and sparse annotations, which makes it difficult to completely contour object boundaries in complex scenes, especially when detecting objects with filamentary structures. To address the aforementioned issues, we propose a novel cross-modal weakly supervised SOD framework. The framework can adequately exploit the advantages of cross-modal weak labels to generate high-quality pseudo-labels, and it can fully couple the multi-scale features of RGB and depth images for precise saliency prediction. The framework mainly consists of a cross-modal pseudo-label generation network (CPGN) and an asymmetric salient-region prediction network (ASPN). Among them, the CPGN is proposed to sufficiently leverage the precise pixel-level guidance provided by point labels and the enhanced semantic supervision provided by text labels to generate high-quality pseudo-labels, which are used to supervise the subsequent training of the ASPN. To better capture the contextual information and geometric features from RGB and depth images, the ASPN, an asymmetrically progressive network, is proposed to gradually extract multi-scale features from RGB and depth images by using the Swin-Transformer and CNN encoders, respectively. This significantly enhances the model's ability to perceive detailed structures. Additionally, an edge constraint module (ECM) is designed to sharpen the edges of the predicted salient regions. The experimental results demonstrate that the method shows better performance in depicting salient objects, especially the filamentary structures, than other weakly supervised SOD methods.