Creating Realistic, Secure Test Data for Databricks

May 11, 2021

TL;DR: Databricks is a data analytics platform built to handle the scale and complexity of today’s data. Tonic integrates seamlessly with Databricks to generate synthetic test data based on production data that is both complex enough to be valuable and secure enough to protect user privacy.  

Essentials of Databricks 

Databricks and Apache Spark were born together at UC Berkeley’s AMP (Algorithms, Machines and People) Lab. Both are about eight years old, but very mature for their age.

Databricks is a cloud-based unified analytics platform with built-in support to let you do your best work, while Apache Spark is an open-source unified analytics engine for processing your data. They are pre-integrated so that you don’t have to configure the Spark cluster in Databricks. VM creation, configuration, network, security, storage, and the like are ready to go.

You can find Databricks in more than 5,000 orgs, including big names like CVS Health, Shell, Starbucks and T-Mobile. It is particularly popular among data scientists for its shared, interactive notebooks (ideal for experimenting) and the freedom to switch back and forth between common languages like Scala, Python, R, Java and SQL as necessary.

Over the past year, Databricks has self-identified more as a “lakehouse” than a typical data warehouse or data lake. By that, they mean that the platform features the kinds of data structures you would expect in a data warehouse along with the low-cost functions and utilities of a typical data lake. This makes it easier for developers of various skill levels to work with the 3 V’s of big data - volume, velocity, and variety - necessary for applications in business intelligence (BI) and machine learning (ML).

But when it comes to working securely with that data for the purposes of software testing and development, those 3 V’s translate into significant challenges.

The Problems of Generating Mimicked Data from Databricks 

The close kinship between Databricks and Spark means that it can easily ingest streaming data from many different sources simultaneously and perform transformations in real-time. On the flip side of that coin, performing data anonymization at that speed and with data that heterogeneous is extremely resource intensive.

Databricks lends itself to hundreds of tables, thousands of rows, and petabytes of data. For a schema that complex, legacy masking solutions and in-house anonymization scripts simply won’t work.

Add to this challenge the growing urgency around guaranteeing that your data is sufficiently anonymized to satisfy today’s rapidly shifting legal landscape. Data privacy laws vary greatly from state-to-state or country-to-country, and your organization is responsible for complying with the most restrictive guidelines applicable where they operate, even as those laws change. 

What’s more, all organizations have to deal with the fluidity of access management, as people come and go, employees change roles, contractors gain access to environments temporarily, and data leaks can go undetected for months.

The most responsible course of action is to implement a scalable, fully-integrated system that keeps production data completely cut off from your testing environments, while generating synthetic test data that looks and feels just like the original so that your testing is effective and your data’s privacy is guaranteed. To do this, you’ll need a system that sits between your production and test databases—in other words, a system that can connect directly to where your data lives in Databricks.

How Tonic Supports Databricks

Tonic supports the running of Spark jobs via Databricks on AWS. To run Tonic with Databricks on Azure, contact our experts at support@tonic.ai.

In terms of data source providers, Tonic supports Parquet and JSON, with data output for both written to Parquet. Also supported is Avro, as both data source and output.

For more details and updates, refer to the docs.

Connecting Tonic to Databricks

Databricks supports both MANAGED and EXTERNAL tables. MANAGED tables store all of their data within Databricks, whereas EXTERNAL tables store their data on a separate file system (often S3).  Tonic can read from both table types but when writing output data will only write to EXTERNAL tables.

Tools for Mimicking Data in Databricks with Tonic

Creating valuable test data means threading the needle between what’s safe and what’s useful. It requires a set of tools that combines the best aspects of data anonymization and synthesis to optimize the balance between these priorities. With Tonic, we’re building a toolbox for the specific goal of generating the most realistic test data possible that fully preserves data privacy. And as all the players in the datasphere adapt and evolve, our tools adapt right along with them.

For Databricks users, here are just a few of our most valuable tools:

  • Advanced subsetting - Given the scale of today’s data, creating quality test data often starts with subsetting. Whether it’s to minimize your data footprint for de-identification or to hone in on a bug in production, subsetting your data using percentages or custom WHERE clauses while preserving referential integrity across multiple tables can significantly streamline CI/CD. For more information, see our docs.
  • Consistency across tables - This ensures that the same input always maps to the same output across an entire data ecosystem, which can be essential when you need to fully anonymize a field and still use it in a join, preserve the cardinality of a column, or match duplicate data across databases.
  • Dependency linking and partitioning - Easily mirror complex relationships in your data by linking and partitioning as many columns as you need.
  • Add foreign key relationships - If there are foreign keys that aren’t yet assigned in your database but are essential for preserving referential integrity when generating your test data, you can use Tonic’s foreign key tool to add as many as needed. This can be particularly useful in generating a quality subset of your data.
  • Differential privacy - Applying differential privacy to your data de-identification process provides mathematically guarantees of data privacy. Tonic allows you to apply differential privacy to a growing number of its generators.

Getting Started

Ready to start safely generating realistic test data for Databricks? We’re here to equip you with the tools you need to tackle the challenge. Let’s chat.


Real News — from the experts on Fake Data.