TASA: Text-Anchored State-Space Alignment for Long-Tailed Image Classification

TASA:基于文本锚定的状态空间对齐方法用于长尾图像分类

阅读:4

Abstract

Long-tailed image classification remains challenging for vision-language models. Head classes dominate training while tail classes are underrepresented and noisy, and short prompts with weak text supervision further amplify head bias. This paper presents TASA, an end-to-end framework that stabilizes textual supervision and enhances cross-modal fusion. A Semantic Distribution Modulation (SDM) module constructs class-specific text prototypes by cosine-weighted fusion of multiple LLM-generated descriptions with a canonical template, providing stable and diverse semantic anchors without training text parameters. Dual-Space Cross-Modal Fusion (DCF) module incorporates selective-scan state-space blocks into both image and text branches, enabling bidirectional conditioning and efficient feature fusion through a lightweight multilayer perceptron. Together with a margin-aware alignment loss, TASA aligns images with class prototypes for classification without requiring paired image-text data or per-class prompt tuning. Experiments on CIFAR-10/100-LT, ImageNet-LT, and Places-LT demonstrate consistent improvements across many-, medium-, and few-shot groups. Ablation studies confirm that DCF yields the largest single-module gain, while SDM and DCF combined provide the most robust and balanced performance. These results highlight the effectiveness of integrating text-driven prototypes with state-space fusion for long-tailed classification.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。