Abstract
MOTIVATION: Machine-generated or synthetic data is a valuable resource for training artificial intelligence algorithms, evaluating rare workflows, and sharing data under stricter data legislations. However, current statistical and deep learning methods struggle with large data volumes, are prone to hallucinating scenarios incompatible with reality, and seldom quantify privacy meaningfully. RESULTS: Here, we introduce Genomator, a logic solving approach (SAT solving), which efficiently produces private and realistic representations of the original data. We demonstrate the method on genomic data, which arguably is the most complex and private information. We benchmark Genomator against state-of-the-art methodologies (Markov generation, Wasserstein Generative Adversarial Network and Conditional Restricted Boltzmann Machines), demonstrating a 40%-530% accuracy improvement and 57%-172% higher privacy. Genomator is also 3-100 times more efficient, making it the only tested method that scales to whole genomes. We show the universal trade-off between privacy and accuracy, and use Genomator's tuning capability to cater to all applications along the spectrum, from provable private representations of sensitive cohorts, to datasets with indistinguishable pharmacogenomic profiles. Demonstrating the production-scale generation of tuneable synthetic genomes hold great potential for balancing underrepresented populations in medical research and advancing global data exchange. AVAILABILITY AND IMPLEMENTATION: Genomator is available at https://github.com/csiro/genomator.