TL;DR: Databricks is a data analytics platform built to handle the scale and complexity of today’s data. Tonic integrates seamlessly with Databricks to generate synthetic test data based on production data that is both complex enough to be valuable and secure enough to protect user privacy.
Databricks and Apache Spark were born together at UC Berkeley’s AMP (Algorithms, Machines and People) Lab. Both are about eight years old, but very mature for their age.
Databricks is a cloud-based unified analytics platform with built-in support to let you do your best work, while Apache Spark is an open-source unified analytics engine for processing your data. They are pre-integrated so that you don’t have to configure the Spark cluster in Databricks. VM creation, configuration, network, security, storage, and the like are ready to go.
You can find Databricks in more than 5,000 orgs, including big names like CVS Health, Shell, Starbucks and T-Mobile. It is particularly popular among data scientists for its shared, interactive notebooks (ideal for experimenting) and the freedom to switch back and forth between common languages like Scala, Python, R, Java and SQL as necessary.
Over the past year, Databricks has self-identified more as a “lakehouse” than a typical data warehouse or data lake. By that, they mean that the platform features the kinds of data structures you would expect in a data warehouse along with the low-cost functions and utilities of a typical data lake. This makes it easier for developers of various skill levels to work with the 3 V’s of big data - volume, velocity, and variety - necessary for applications in business intelligence (BI) and machine learning (ML).
But when it comes to working securely with that data for the purposes of software testing and development, those 3 V’s translate into significant challenges.
The close kinship between Databricks and Spark means that it can easily ingest streaming data from many different sources simultaneously and perform transformations in real-time. On the flip side of that coin, performing data anonymization at that speed and with data that heterogeneous is extremely resource intensive.
Databricks lends itself to hundreds of tables, thousands of rows, and petabytes of data. For a schema that complex, legacy masking solutions and in-house anonymization scripts simply won’t work.
Add to this challenge the growing urgency around guaranteeing that your data is sufficiently anonymized to satisfy today’s rapidly shifting legal landscape. Data privacy laws vary greatly from state-to-state or country-to-country, and your organization is responsible for complying with the most restrictive guidelines applicable where they operate, even as those laws change.
What’s more, all organizations have to deal with the fluidity of access management, as people come and go, employees change roles, contractors gain access to environments temporarily, and data leaks can go undetected for months.
The most responsible course of action is to implement a scalable, fully-integrated system that keeps production data completely cut off from your testing environments, while generating synthetic test data that looks and feels just like the original so that your testing is effective and your data’s privacy is guaranteed. To do this, you’ll need a system that sits between your production and test databases—in other words, a system that can connect directly to where your data lives in Databricks.
Tonic supports the running of Spark jobs via Databricks on AWS. To run Tonic with Databricks on Azure, contact our experts at firstname.lastname@example.org.
In terms of data source providers, Tonic supports Parquet and JSON, with data output for both written to Parquet. Also supported is Avro, as both data source and output.
For more details and updates, refer to the docs.
Databricks supports both MANAGED and EXTERNAL tables. MANAGED tables store all of their data within Databricks, whereas EXTERNAL tables store their data on a separate file system (often S3). Tonic can read from both table types but when writing output data will only write to EXTERNAL tables.
Creating valuable test data means threading the needle between what’s safe and what’s useful. It requires a set of tools that combines the best aspects of data anonymization and synthesis to optimize the balance between these priorities. With Tonic, we’re building a toolbox for the specific goal of generating the most realistic test data possible that fully preserves data privacy. And as all the players in the datasphere adapt and evolve, our tools adapt right along with them.
For Databricks users, here are just a few of our most valuable tools:
Ready to start safely generating realistic test data for Databricks? We’re here to equip you with the tools you need to tackle the challenge. Let’s chat.
Enable your developers, unblock your data scientists, and respect data privacy as a human right.