How Prop 24 Puts Anonymized Data at Greater Risk of Re-identification

Ian Coe
November 6, 2020
How Prop 24 Puts Anonymized Data at Greater Risk of Re-identification
In this article

    Californians voted Yes on Prop 24, approving extensive amendments to the California Consumer Privacy Act with the mindset of strengthening the U.S.’s most comprehensive data privacy regulation to date. But one of its provisions undermines that goal in a way that puts both individual privacy and corporate security at a heightened risk.

    Specifically, a collection of amendments grants service providers expanded powers to combine consumer datasets obtained from different sources (see Secs. 1798.140(ag)(1) and 1798.185(e)(10)). In the world of data anonymization, the risks of combining datasets are well known. When paired with additional data, individuals in many anonymized datasets are surprisingly easy to re-identify. This can be due to uneven approaches in anonymization, as well as the simple benefit of added data points. Real-world instances include anonymized medical records being re-identified when pooled with voter databases; the infamous Netflix user re-identification achieved by combining anonymized Netflix data with IMDB user movie ratings; and anonymized Swiss Supreme Court cases involving pharmaceutical companies being re-identified using publicly available databases.

    We’ve talked previously about how the CCPA impacts dev teams thanks to its extremely broad definition of the types of data requiring protection. Today, we’ll look at how Prop 24 throws that data under the bus by loosening its restrictions on data pooling.

    A practical example

    Imagine there’s a handful of companies in the same industry, say EdTech, all working with the same third-party service provider to debug software, perform testing, or develop apps. Each of these companies anonymizes their users’ data prior to sharing it with the service provider. Thanks to Prop 24, the service provider pools those datasets to maximize their work. Then the service provider experiences a breach. Individually, those anonymized user datasets may have been safe from re-identification. But combined, they make it much easier to connect the dots and re-identify users with as much as a 99.9% rate of accuracy.

    The end result? Multiple companies grappling with crisis management thanks to a single data leak of “anonymized” data. The inclusion of this provision in the newly-passed Prop 24 significantly weakens the individual privacy protections that the CCPA otherwise aims to safeguard. What’s more, it weakens a company’s ability to ensure the privacy of their data when working with third parties.

    What to do about it

    More and more companies are relying on third-party providers to expand their development teams, outsource QA testing, or engage in other ways that rely on shared datasets. Where the law opens the door to data pooling, concerned companies should specify in their contracts if they want that door kept firmly shut.

    But maybe a company stands to benefit from the provider having a larger pool of data to work with. In that case, here are two approaches that allow for reaping the benefits of working with third-parties without upping the risks to data security.

    Ironclad data de-identification

    Simple redaction, pseudonymization, and sampling aren’t going to cut it. Advanced subsetting paired with full database obfuscation? Now we’re talking. Data beyond what the law defines as “sensitive personal information” and “unique identifiers” could be used to re-identify individuals when combined with other datasets. Taking a more comprehensive approach to data de-identification provides the safeguards needed against re-identification due to data pooling. And with tools like those available in Tonic to link columns and preserve relationships in the protected output dataset, companies can strengthen their data security without sacrificing their data’s utility.

    Data synthesis with differential privacy 

    For the strongest in data security, data synthesis performed with differential privacy at its core not only provides mathematical guarantees for data protection, it also allows for scaling up datasets to any size to equip developers with the amount of realistic, yet wholly fictitious, data they need to do their best work. Here, too, Tonic is leading the charge.

    Why companies should act

    Two additional components of Prop 24 and the CPPA are worth highlighting. First, Prop 24 establishes the California Privacy Protection Agency, expanding the resources available to implement and enforce the CCPA (see Sec. 1798.199.10). This strengthens the state’s ability to take legal action against companies that violate privacy rights.

    Second, Prop 24 missed an opportunity to bring the CCPA in line with GDPR in requiring that companies have users opt into data collection, as opposed to leaving the burden with users to opt out. Essentially what this means is that under the CCPA privacy is not the default. And most consumers won’t take the time to click through pop-ups, on/off forms, or privacy settings to make the opt-out change.

    This leaves the onus on companies to do right by the data they're collecting by default in order to avoid the legal ramifications that are now easier to enforce.

    CCPA and Prop 24 are designed to protect consumers. In the best of worlds, protecting consumers also means protecting businesses from the fallout of poor data practices. Despite its best intentions, Prop 24 stumbles in achieving these goals. Enabling service providers to combine datasets makes companies who work with third-parties more vulnerable to data leaks, despite the protections those companies may have put in place through data anonymization.

    Where Prop 24 does succeed is in paving a clear path for taking legal action against companies who misuse or fail to protect their users’ data. It’s up to companies to implement strong data protection practices regardless of the law’s lax requirements, to avoid costly fines and damaging leaks.

    Ian Coe
    Co-Founder & CEO
    Ian Coe is the Co-founder and CEO of For over a decade, Ian has worked to advance data-driven solutions by removing barriers impeding teams from answering their most important questions. As an early member of the commercial division at Palantir Technologies, he led teams solving data integration chal- lenges in industries ranging from financial services to the media. At Tableau, he continued to focus on analytics, driving the vision around statistics and the calculation language. As a founder of Tonic, Ian is directing his energies toward synthetic data generation to maximize data utility, protect customer privacy, and drive developer efficiency.