Abstract
3D Gaussian Splatting (3DGS) has become a significant research focus in recent years, particularly for 3D reconstruction and novel view synthesis under non-ideal conditions. Among these studies, tasks involving sparse input data have been further classified, with the most challenging scenario being the reconstruction of 3D structures and synthesis of novel views from a single input image. In this paper, we introduce SVG3D, a method for generalizable 3D reconstruction from a single view, based on 3DGS. We use a state-of-the-art monocular depth estimator to obtain depth maps of the scenes. These depth maps, along with the original scene images, are fed into a U-Net network, which predicts the parameters for 3D Gaussian ellipsoids corresponding to each pixel. Unlike previous work, we do not stratify the predicted 3D Gaussian ellipsoids but allow the network to learn the positioning autonomously. This design enables accurate geometric representation when rendered from the target camera view, significantly enhancing novel view synthesis accuracy. We trained our model on the RealEstate10K dataset and performed both quantitative and qualitative analysis on the test set. We compared single-view novel view 3D reconstruction methods across different 3D representation techniques, including methods based on Multi-Plane Image (MPI) representation, hybrid MPI and Neural Radiance Fields representation, and the current state-of-the-art methods using 3DGS representation for single-view novel view reconstruction. These comparisons substantiated the effectiveness and accuracy of our method. Additionally, to assess the generalizability of our network, we validated it across the NYU and KITTI datasets, and the results confirmed its robust cross-dataset generalization capability.