Privacy-hardened and hallucination-resistant synthetic data generation with logic-solvers

利用逻辑求解器生成具有隐私保护和抗幻觉能力的合成数据

阅读:1

Abstract

MOTIVATION: Machine-generated or synthetic data is a valuable resource for training artificial intelligence algorithms, evaluating rare workflows, and sharing data under stricter data legislations. However, current statistical and deep learning methods struggle with large data volumes, are prone to hallucinating scenarios incompatible with reality, and seldom quantify privacy meaningfully. RESULTS: Here, we introduce Genomator, a logic solving approach (SAT solving), which efficiently produces private and realistic representations of the original data. We demonstrate the method on genomic data, which arguably is the most complex and private information. We benchmark Genomator against state-of-the-art methodologies (Markov generation, Wasserstein Generative Adversarial Network and Conditional Restricted Boltzmann Machines), demonstrating a 40%-530% accuracy improvement and 57%-172% higher privacy. Genomator is also 3-100 times more efficient, making it the only tested method that scales to whole genomes. We show the universal trade-off between privacy and accuracy, and use Genomator's tuning capability to cater to all applications along the spectrum, from provable private representations of sensitive cohorts, to datasets with indistinguishable pharmacogenomic profiles. Demonstrating the production-scale generation of tuneable synthetic genomes hold great potential for balancing underrepresented populations in medical research and advancing global data exchange. AVAILABILITY AND IMPLEMENTATION: Genomator is available at https://github.com/csiro/genomator.

特别声明

1、本页面内容包含部分的内容是基于公开信息的合理引用;引用内容仅为补充信息,不代表本站立场。

2、若认为本页面引用内容涉及侵权,请及时与本站联系,我们将第一时间处理。

3、其他媒体/个人如需使用本页面原创内容,需注明“来源:[生知库]”并获得授权;使用引用内容的,需自行联系原作者获得许可。

4、投稿及合作请联系:info@biocloudy.com。