ENTRANT: A Large Financial Dataset for Table Understanding

参赛作品:用于表格理解的大型金融数据集

阅读:1

Abstract

Tabular data is a way to structure, organize, and present information conveniently and effectively. Real-world tables present data in two dimensions by arranging cells in matrices that summarize information and facilitate side-by-side comparisons. Recent research efforts aim to train large models to understand structured tables, a process that enables knowledge transfer in various downstream tasks. Model pre-training, though, requires large datasets, conveniently formatted to reflect cell and table characteristics. This paper presents ENTRANT, a financial dataset that comprises millions of tables, which are transformed to reflect cell attributes, as well as positional and hierarchical information. Hence, they facilitate, among other things, pre-training tasks for table understanding with deep learning methods. The dataset provides table and cell information along with the corresponding metadata in a machine-readable format. We have automated all data processing and curation and technically validated the dataset through unit testing of high code coverage. Finally, we demonstrate the use of the dataset in a pre-training task of a state-of-the-art model, which we use for downstream cell classification.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。