Abstract
Converting 2D images into accurate 3D models is one of the core tasks in computer vision and graphics. However, existing methods still face issues in multi-view generation tasks, such as poor geometric consistency, insufficient detail recovery, and inaccurate texture mapping. This is particularly evident in complex objects or multi-view environments, where the generated 3D models often fail to maintain consistency. To address these challenges, this paper proposes the NeuroDiff3D model, which combines 3D diffusion modeling with multimodal information fusion techniques. NeuroDiff3D integrates structural, texture, and semantic information and is divided into two main components: the 3D Prior Pipeline and the Model Training Pipeline. In the 3D Prior Pipeline, a rough 3D object representation is generated using the 3D diffusion model, gradually recovering the object's geometric shape, texture details, and semantic information. In the Model Training Pipeline, these pieces of information are further optimized through the T2i-Adapter module, ultimately generating a fine-grained 3D model. Experimental results show that NeuroDiff3D outperforms existing Text-to-3D and Image-to-3D methods on the OmniObject3D and Pix3D datasets, particularly excelling in geometric consistency, detail recovery, and semantic consistency, demonstrating its strong potential in complex scenarios.