Abstract
Synthetic data are a popular method to publish useful datasets in a privacy-aware manner, making them useful across a range of scientific domains involving human subjects. They are typically generated by sampling from algorithms that mimic the probability distribution of real datasets, thereby maximizing statistical similarity to real data. However, we argue and demonstrate that synthetic data need to be similar only in ways relevant to their intended use and may neglect any irrelevant information, which in turn may improve privacy protection. As such, we propose a data synthesis method entitled fidelity-agnostic synthetic data. The method first extracts features relevant to the dataset's intended use using a neural net and then generates synthetic versions of the extracted features, after which they are decoded to mimic the real dataset. We show that our synthetic data improve performance in prediction tasks while retaining privacy protection compared to other state-of-the-art methods.