A long time ago (2013, to be exact), a team at the University of California, Berkeley, known as AMPLab, announced a new way to process what we call “big data.” Specifically, they created two open-source projects: Apache Spark (a Unified Analytics Engine) and Databricks, a platform to help manage Sparks as well as other data lakes. They came together for one simple mission: to build the best computing platform for extracting value from big data.
Over the years, Databricks has become a powerhouse in the Big Data world alongside Spark. Focusing on their idea of not just a data lake or a data warehouse, they instead offered something new: a data lakehouse. This idea of a centralized platform combines the best parts of data lakes and warehouses, breaking down the barriers for data scientists and allowing maximum flexibility.
Yet, managing data at that size can become incredibly difficult, especially when it comes to sensitive data like personally identifiable information (PII). Data scientists, business analysts, and developers need access to realistic data—but opening the doors and freely exposing such a massive data footprint can pose a huge liability for your organization.
There is another way.
Challenges of Sourcing Test Data for Databricks
Synthesizing realistic data based on production is an incredibly useful way to utilize the data you have without putting it at risk or compromising data privacy. Let’s see how it can serve your team with this quick look into how to create realistic test data for Databricks.
Data Moves Fast… Too Fast
Databricks and Spark are fast. Heck, it even broke a world record a year after it was released for the fastest sorting of 100 TB of data. However, there is such a thing as going almost too fast. While you have access to streaming data from various sources and mutating them in real-time… doing this is also resource-intensive.
There’s A Lot Of Data To Process And De-Identify
When organizations move from traditional databases to data lakehouses (and similar solutions), it’s typically for one reason: they have a lot of data and need to put it somewhere.
However, developers and data analysts need to access that data to do their job. But giving them full access to your data lakehouse is not only inappropriate, it can also put your company’s or customer’s data at risk. Staying compliant with HIPAA and GDPR is essential, and can be incredibly challenging when working with large datasets.
Creating Test Data In-house
While Databricks may be an excellent solution for big data needs, sometimes the amount of data can just be too much. To create realistic test data from your existing Databricks lakehouse, you need to be able to keep up with the speed that Databricks/Sparks offers, while securely de-identifying your sensitive data to stay within regulations.
To do this in-house, you might begin by working with smaller datasets and masking the information somehow. That said, this approach is time-consuming and resource-heavy. Plus, working with smaller datasets may not always give you the perfect picture of the data you need.
How Tonic Solves for De-identifying and Synthesizing Test Data
With Tonic, we’ve created the leading all-in-one platform to generate test data safely that looks, acts, and feels like your production data in Databricks. With Tonic, there’s no such thing as too much data, or data that moves too quickly. Tonic enables you to swiftly (and realistically!) de-identify your production data so your teams can work efficiently while complying with regulations.
The Tonic + Databricks integration enables:
- App development: Realisticallyde-identify your data on Databricks, to equip developers and QA with secure, useful, de-identified data instead of using production data
- Data lakehouse tokenization: Tonic enables long-term data storage by tokenizing your data to eliminate all PII/PHI/sensitive data while preserving the utility of the data for analytics.
- Regulatory compliance: All that tokenized data we just talked about? It’s compliant, to safely enable data analytics and off-shore development
Why does this matter for Databricks users, specifically? First and foremost, we’re proud to be pioneering native support for Databricks as the only data masking and synthesis platform to offer a native streamlined integration. *sparkle emoji*
What’s more, we’ve solved the challenges outlined above of de-identifying massive scale data that needs to move very fast. Tonic is specifically architected to match the scale of any Databricks DB and to work unbounded on Databricks instances. The platform can move data as fast as your hardware allows; it will never be a bottleneck to how fast your data moves.
Last but not least, Tonic is equipped to handle the complex data situations that result from the way data lakehouses and warehouses process your data. Our library of 50+ data generators equips you with all the tools you need to create the realistic synthetic data you’ve been dreaming of.
Let Tonic Create Realistic Test Data for Databricks
So how does our integration with Databricks work? Tonic allows you to manage your Databricks implementation using AWS or Azure and includes all versions of Spark except Spark 2.4.2. (We don’t talk about that version of Spark.)
Let’s go over the steps to success:
Starting, we need to ensure we’ve got a good working environment to implement Tonic. Put on some light jazz, light a candle, and set up a cluster. Tonic requires a standard or single-node all-purpose cluster, which means no High Concurrency clusters. You’ll also want to set up some specific environment variables, based on whether you’re using S3, ALDSv2, or another destination.
Configure Your Workspace Connections
Once your base workspace is set up, you’ll need to make the connection to Databricks. This is pretty simple. You’ll want to set up the Source Server, the Databricks Cluster, and your Destination Server. This is mostly just plugging values into Tonic for the corresponding fields, which you can see in our documentation.
After that, you’re all set to start faking it!