Abstract
Zero-shot learning enables the recognition of images from unseen classes by leveraging auxiliary semantic information. Traditional methods typically learn either the relationship between the visual features and the semantic vectors or that between the seen and the unseen semantic vectors. However, their zero-shot recognition performances are not ideal for scene images due to large intra-class variations. To address this challenge, we propose a novel approach combining semantic autoencoders (SAEs) and visual relation transfer (VRT), termed SAEVRT. Specifically, we learn two semantic autoencoders for both the seen and the unseen scene classes, which help to alleviate the domain shift between the visual and the semantic spaces. Considering that semantic vectors (no attribute vectors available) are less effective than visual features for scene images, we propose an interpretable seen and unseen visual relation transfer method to learn more effective unseen semantic vectors. By combining SAEs and VRT in a unified learning framework, we exploit both the visual-semantic and seen-unseen relationships. Extensive experiments on four scene datasets demonstrate the superior performance of SAEVRT, achieving recognition accuracies of 63.77%, 67.75%, 58.68%, and 53.26% on Scene15, MIT67, UCM21, and NWPU45, respectively.