Evaluating the utility of data integration with synthetic data and statistical matching

评估数据集成与合成数据和统计匹配的效用

阅读:1

Abstract

Data integration enhances dataset utility but raises privacy concerns due to increased disclosure risks. Synthetic data offers a potential solution, though its role in data integration has not been thoroughly investigated. This study assesses synthetic data integration by evaluating the impact of varying common variables during statistical matching and exploring synthetic-real dataset combinations in donor-recipient settings. We used data from the Korean Genome and Epidemiology Study (KoGES) cohort, with the full dataset as the donor and one-quarter of the subjects as the recipient. Multiple synthetic datasets were generated from both datasets, with varying sets of common variables. Statistical matching was conducted using the nearest-neighbor hotdeck method. Data utility was evaluated using confidence interval overlap measures in the hazard ratio estimates under clinical scenarios to predict diabetes onset. When both donor and recipient data were synthetic, the all-available matched data generally outperformed other matching conditions. However, clinically relevant matching variables occasionally showed equivalent performances. The synthetic data showed comparable model accuracy to real data, although further investigation is warranted to understand the performance differences. Statistically matched synthetic data offers utility comparable to real data, providing a potential approach for reducing privacy risks while maintaining data utility.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。