Abstract
In recent years, diffusion models have been widely used in 3D scenes-related work. However, the existing diffusion models primarily focus on the global structure and are constrained by predefined dataset categories, which are unable to accurately resolve the detailed structure of complex 3D scenes. This study therefore integrates Denoising Diffusion Probabilistic Models (DDPM) with Learning Dense Volumetric Segmentation from Sparse Annotation (3D U-Net) architecture fusion, a novel approach to local 3D scenes generation-driven understanding is proposed, namely a customized 3D diffusion model (3D-UDDPM) for local cubes. In contrast to conventional global or local single-structure analysis techniques, the 3D-UDDPM framework is designed to prioritize the capture and recovery of local details during the generation of localized 3D scenes. In addition to accurately predicting the distribution of the noise tensor, the framework significantly enhances the understanding of localized scenes by effectively integrating spatial context information. Specifically, 3D-UDDPM combines Markov chain Monte Carlo (MCMC) sampling and variational inference methods to reconstruct clear structural details in a stepwise backward inference manner, thereby driving the generation and understanding of local 3D scenes by internalizing geometric features as a priori knowledge. The innovative diffusion process enables the model to recover fine local details while maintaining global structural coherence during the gradual denoising process. When combined with the spatial convolutional properties of the 3D U-Net architecture, the modelling accuracy and generation quality of complex 3D shapes are further enhanced, ensuring excellent performance in complex environments. The results demonstrate superior performance on two benchmark datasets in comparison to existing methodologies.