Abstract
Cultural heritage preservation has garnered global attention. Museum artifact classification, a core task, faces challenges related to insufficient multimodal information collaboration and a scarcity of high-quality annotated data. Traditional methods and single-modality deep learning models struggle to achieve both efficiency and accuracy. To address this, this paper proposes a museum artifact classification model (VBG Model) based on cross-modal attention fusion and generative data augmentation. This model constructs an integrated multimodal framework through task-oriented refactoring of the Vision Transformer (ViT), BERT, and a Generative Adversarial Network (GAN). ViT extracts global visual features from artifact images, while BERT mines the historical and cultural semantics of text. A bidirectional interactive attention fusion layer achieves precise feature alignment. The GAN generates diverse samples, forming a closed "generation-feedback-optimization" loop to alleviate data scarcity. Experiments on the MET and MS COCO datasets demonstrate exceptional performance: the VBG Model achieves 92% classification accuracy, 0.85 mAP, and 88% F1 score for the former, while the latter achieves 90% accuracy, 0.83 mAP, and 86% F1 score for the latter. These performance indicators outperform competing models such as ResNet and DenseNet. Ablation experiments confirm that cross-modal fusion and generative data augmentation modules are essential; removing either module results in a 5%-9% drop in accuracy. The current model still has room for improvement in terms of training time and generated image quality. Future work will focus on optimizing performance through lightweight design and multi-scale fusion, enhancing the ability to distinguish similar artifacts and providing technical support for digital artifact management and cultural heritage preservation.