An efficient learning based approach for automatic record deduplication with benchmark datasets

一种基于学习的高效自动记录去重方法,适用于基准数据集

阅读:1

Abstract

With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。