Abstract
In the field of camouflaged object detection (COD), effectively distinguishing the intrinsic similarity between objects and their backgrounds is a critical factor for improving detection performance. Existing approaches typically leverage boundary constraints to provide additional auxiliary information during the training phase. To capture more discriminative detailed cues, we introduce texture labels as supervisory signals and propose a context- and texture-aware hierarchical interaction network (CTHINet) for COD. In the coding phase, the network is divided into two separate branches, a context and a texture encoder. Specifically, a context encoder is employed to generate contextual information. Subsequently, the features at different scales are refined by implementing a Multi-head Feature Aggregation Module (MFAM). The diversity of features is subsequently enhanced by leveraging the interactions among their distinct feature receptive fields, facilitating the matching of candidate areas for camouflaged objects with varying sizes and shapes. Following this, the enhanced features are combined with texture features generated by the texture encoder, fully exploiting imperceptible cues within candidate objects through utilizing the Hierarchical mixed-scale Interaction Modules (HMIM). This module continuously integrates texture cues with contextual information within a single feature scale, aiming for more accurate detection. Extensive experiments conducted on three challenging benchmark datasets, e.g., CAMO, COD10K, and NC4K, illustrate that our model has superior performance compared to state-of-the-art methods. Furthermore, the evaluation results on the polyp segmentation dataset underscore the promising potential of CTHINet for downstream applications.