Improving the state-of-the-art in Thai semantic similarity using distributional semantics and ontological information

利用分布语义和本体信息改进泰语语义相似度计算的现状

阅读:2

Abstract

Research into semantic similarity has a long history in lexical semantics, and it has applications in many natural language processing (NLP) tasks like word sense disambiguation or machine translation. The task of calculating semantic similarity is usually presented in the form of datasets which contain word pairs and a human-assigned similarity score. Algorithms are then evaluated by their ability to approximate the gold standard similarity scores. Many such datasets, with different characteristics, have been created for English language. Recently, four of those were transformed to Thai language versions, namely WordSim-353, SimLex-999, SemEval-2017-500, and R&G-65. Given those four datasets, in this work we aim to improve the previous baseline evaluations for Thai semantic similarity and solve challenges of unsegmented Asian languages (particularly the high fraction of out-of-vocabulary (OOV) dataset terms). To this end we apply and integrate different strategies to compute similarity, including traditional word-level embeddings, subword-unit embeddings, and ontological or hybrid sources like WordNet and ConceptNet. With our best model, which combines self-trained fastText subword embeddings with ConceptNet Numberbatch, we managed to raise the state-of-the-art, measured with the harmonic mean of Pearson on Spearman ρ, by a large margin from 0.356 to 0.688 for TH-WordSim-353, from 0.286 to 0.769 for TH-SemEval-500, from 0.397 to 0.717 for TH-SimLex-999, and from 0.505 to 0.901 for TWS-65.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。