Subset selection for domain adaptive pre-training of language model

语言模型领域自适应预训练的子集选择

阅读:1

Abstract

Pre-trained language models have brought significant performance improvements in many natural language understanding tasks. Domain-adaptive language models, which are trained with a specific domain corpus, exhibit high performance in their target domains. However, pre-training these models with a large amount of domain-specific data requires a substantial computational budget and resources, necessitating the development of efficient pre-training methods. In this paper, we propose a novel subset selection method called AlignSet, which extracts an informative subset from a given domain dataset for efficient pre-training. Our goal is to extract an informative subset that enables faster learning of the language model compared to learning from the entire dataset. By experiments across multiple domains, we demonstrate that AlignSet generates better subsets than other methods.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。