Abstract
The accurate and transparent estimation of greenhouse gas emissions is essential for corporate sustainability reporting and machine learning applications. Existing emission-factor datasets have restrictive licenses, insufficient spatiotemporal granularity, or outdated information, limiting their reproducibility and utility across disciplines. We present ExioML, an open-source dataset derived from Exiobase 3.8.2. It integrates environmentally extended multi-regional input-output tables with a graphics processing unit (GPU)-accelerated computational toolkit, facilitating compatibility with and extensibility to other datasets. ExioML encompasses sector-level emission factor data for 49 regions and 28 years from 1995 to 2022, structured into two aggregation schemes: a product-by-product format covering 200 categories, and an industry-by-industry format covering 163 categories. To validate dataset usability and establish a reproducible baseline, we define a regression task for predicting sectoral greenhouse gas emissions. The task is evaluated using tree-based and neural-network-based models, with mean squared error as the evaluation metric. ExioML provides openly accessible emission-factor tables and a reproducible baseline intended to support reuse and benchmarking across sustainability and machine-learning studies.