Efficient knowledge distillation and alignment for improved KB-VQA

高效的知识提炼和对齐,以改进知识库视觉质量保证

阅读:1

Abstract

Knowledge-based visual question answering (KB-VQA) often requires utilizing external knowledge to answer natural language questions about image content. Recent research has emphasized the importance of knowledge in answering questions by implicitly leveraging Large Language Models (LLMs). However, these methods suffer from the following issues: (1) They primarily focus on aligning image-text descriptions while neglecting alignment between image features and knowledge. Solely relying on knowledge retrieval from databases or LLMs may introduce irrelevant information. Knowledge relevant to visual content contributes to improving the accuracy of model answers. (2) These methods often require long inference times and significant computational resources, with some even heavily relying on access to the GPT-3 API. Therefore, we propose a efficient approach EKDA (Efficient Knowledge Distillation and Alignment), unlike other methods utilizing LLMs, does not require extensive computational resources or involve complex processes. Leveraging knowledge distillation techniques with the LLaMA model as the teacher model enables knowledge extraction. Additionally, we employ Graph Neural Network (GNN) to effectively align visual information with knowledge, thereby effectively capturing image-related knowledge and enhancing the model's understanding of semantics. Furthermore, our approach achieves state-of-the-art accuracy on the OK-VQA dataset, surpassing baseline methods by 6.63%.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。