Hot off the heels of the Privacy Hub, we’re introducing more privacy protecting features to Tonic! Today we’re excited to announce the introduction of differential privacy to Tonic. Differential privacy gives you confidence and visibility into the safety of the data Tonic produces. For the uninitiated, differential privacy is a mathematical guarantee about the privacy of a process. In this first release only certain data types will support differential privacy, but expect more in coming releases. Okay, that’s your TLDR; now let’s dig a bit deeper.
What is Differential Privacy
Differential privacy is a property of a process. In our case the process is generating data without private info in it, i.e., what Tonic does. A process that is differentially private is guaranteed to never reveal anything attributable to a specific member of the dataset. Instead, differentially private processes can only reveal information that is broadly knowable about a dataset. For example, a differentially private process could never reveal the weight of patient #18372, but it can reveal the average weight of all patients. A more subtle example: a differentially private process can reveal the average weight of all patients within a given zip code, adjusting its output appropriately given the number of patients in the zip code. For zip codes with many patients little adjustment may be necessary; for zip codes with few patients it may give a result that’s less accurate to protect its members.
Differentially private processes have some pretty great properties. The first, and perhaps most important, is that no amount of post processing or additional knowledge can break the guarantees of differential privacy. This isn’t true for other data anonymization techniques, e.g. k-anonymity. Additionally differentially private data can be combined with other differentially private data without losing its protection. In practice this means data protected by a process with differential privacy cannot be reverse engineered, re-identified, or otherwise compromised, no matter the adversary.
So how does that translate to Tonic? One of the strengths of Tonic is its ability to use your own data to generate fake data. Filling a test database with random noise isn’t nearly as valuable as filling it with random data that looks like your real data, afterall. With differential privacy Tonic can guarantee the protection of the source data, while still creating data based on the statistical properties of your input database. And this is important. There’s a common misconception that synthetic data, derived from statitics of a source data, is automatically safe; but it’s not! Not without differentially privacy.
If you want to learn more about differential privacy, including more examples and mathematical definitions check these links:
- This blog has a nice high level review of differential privacy.
- A book by the original author of differential privacy, Cynthia Dwork.
How it works in Tonic
In the newest release of Tonic the Categorical generator is differentially private.
By default the toggle is on, and for most datasets this will be the best choice. With differential privacy enabled for a given column (or set of columns), you get mathematical guarantees about the protections Tonic provides: The data in columns protected this way cannot be reverse engineered or re-identified.
Disabling differential privacy is not recommended, but there is one situation where it may be necessary: when the data is unique (or nearly unique). As discussed above, differential privacy’s guarantee is that the output isn’t attributable to any given individual in the input. When individuals are nearly unique, the system can’t reveal any data without potentially revealing something attributable to a specific individual. Tonic will warn you when a column isn’t suitable for differential privacy, and you must disable it in those cases.
We’re all very excited for this release, and I hope after reading this post you can see why. With differential privacy Tonic can guarantee the protections it’s providing your sensitive data. Over the next couple of months we’re going to add differential privacy to our continuous data generator, as well as our free text masking generator. Cheers!