Abstract
Raman spectroscopy and Infrared (IR) spectroscopy are two important tools in solving the structure and bond properties of molecules. With the development of deep learning methods in material science, there is a growing demand for the quantity and diversity of quantum chemistry data, so as the spectral information. However, plenty of spectra still missing in current datasets. To solve this problem, we applied Gaussian09 to construct a Raman spectrum and IR spectral dataset. In this work, currently a total of 220,000 molecules were extracted from ChEMBL. The number of molecules is increasing and is uploaded regularly. The dataset comprises optimized geometries, vibrational frequencies, IR and Raman intensities, and energies expanding both the breadth and depth of existing quantum chemistry collections. By providing high-fidelity, multidimensional feature sets, this resource enables the training and benchmarking of next-generation models including inferring substructures from spectroscopic fingerprints, assembling molecule structure from spectras, and prediction Raman or IR spectra for novel molecules.