Abstract
OBJECTIVE: The reconstruction of visual stimuli and captions from brain activity offers a distinctive viewpoint on how perception reconstructs the external world within neural dynamics. Despite considerable advancements in deep generative models in recent years, simultaneously generating images and captions with both detailed accuracy and semantic consistency remains a significant challenge. METHODS: We introduce panoptic segmentation and generative semantics for the first time, offering enhanced, multi-level data support and a novel perspective in the domain of brain decoding. Using multi-scale fusion techniques, we integrate pixel features from natural images with structural features from panoptic segmentation, creating a state-of-the-art "initial guess." Building upon the neural paradigm that we discovered, we propose an innovative semantic connection strategy to guide image reconstruction. Additionally, by fine-tuning visual semantics within the encoded space compressed by a language model and further combining our advanced retrieval module with the comprehension capabilities of large language models (LLMs), we generate high-quality brain captions. RESULTS: Experimental results demonstrate that we surpass current methods in visual decoding and brain captioning tasks. We offer a webpage to showcase the results: www.neuai4science.cn:5001/brain_visual_decode . CONCLUSION: Our proposed Brain-Imager framework, which incorporates multi-level data and semantic guidance, sets a new standard in the domain. SIGNIFICANCE: This work provides a novel perspective on the relationship between text and image semantics and the visual pathways of the human brain, with potential applications in downstream tasks such as brain-computer interfaces. Additionally, our code is publicly available at https://github.com/songqianyi01/Brain-Imager .