Technical deep dive

The case of two data virtualizations

Author

Andrew Colombi, PhD

Author

January 16, 2024

What is data virtualization? Depending on who you ask you might get two very different answers. On the one hand you have Delphix, which has been marketing one vision of data virtualization, and on the other hand you have everyone else. Yup, it’s Delphix vs The World when it comes to data virtualization.

Normally an incongruity like this wouldn’t bother me, but at Tonic.ai we sometimes encounter Delphix’s “data virtualization” in our conversations with engineering teams looking to improve their test data workflows. I’m writing this blog post as a resource for those that Delphix has, for lack of a better word, brainwashed. People need to understand that not all data virtualizations solve the same problem so they can make an informed buying decision when they compare vendors of developer data solutions.

The definition of data virtualization agreed upon by everyone except Delphix

Data virtualization is a technology that enables data consumers to retrieve and manipulate data from heterogeneous data systems, without knowledge of the underlying systems that store and process the data. For example, suppose your organization stores data in a bunch of different systems: Snowflake contains website analytics, Postgres contains sales information, and S3 contains log data. Your analysts may not be expert in all these technologies, and with data virtualization they don’t need to be! With a data virtualization platform, such as Trino, the analysts only learn Trino’s SQL, and they can query all these sources without understanding where the data lives or how it’s processed.

Graphic representation of a simple data virtualization workflow, showing a SQL query passing through a data virtualization proxy to retrieve data from multiple data sources.

Don’t take my word for it. Check out these definitions.

Wikipedia's definition: "Data virtualization is an approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data."
Gartner's definition: "Data virtualization technology is based on the execution of distributed data management processing, primarily for queries, against multiple heterogeneous data sources, and federation of query results into virtual views."

Lastly, here’s the definition from another player in the test data space, K2View: "In short, data virtualization is a unified, logical data layer that integrates enterprise data across multiple systems, while allowing for centralized governance and security."

It’s worth pointing out that K2View is a competitor of Delphix. Both of these companies offer “data virtualization” but they are not solving the same problem or addressing the same needs with their respective solutions. So, although K2View and Delphix compete at a broad level, in at least data virtualization, they do not compete. (Notably, Gartner lists K2View as a data virtualization vendor, but they do not list Delphix.)

Data virtualization is a great technology for analysts. But I don’t really see a broad use for it by engineers developing and testing software. As software developers, it’s usually our job to know exactly where and how the data is represented. Data virtualization would just get in our way.

Delphix’s definition of data virtualization

Compared to the definitions above, Delphix has other ideas. By their definition, data virtualization focuses on creating an efficient clone of a database. According to their site, "Data virtualization provides the ability to securely manage and distribute policy-governed virtual copies of production-quality datasets. No matter the underlying database management system or source database location, data virtualization technology creates block-mapped virtual copies of the database for rapid and controlled distribution all while leaving a minimal storage footprint no matter how many copies are used."

Pay attention to “data virtualization technology creates block-mapped virtual copies.” What Delphix has done here is come up with a new marketing term for what us plain-spoken engineers have called Copy-on-Write for decades.

What does Copy-on-Write mean? It means what it sounds like it means: Copy something when you make a write to it. Let me explain. Imagine a document on a shared cloud drive. Now suppose I want to give you a copy of that document. As long as you and I don’t want to make any changes to the document, the cloud drive can share the same underlying file between us and give us the illusion of having our own independent “copies."

Graphic representation of copying a document on a shared cloud drive.

However, by sharing the underlying file, we need to be careful when we want to make changes to it, i.e. write to it. If we allow writes to occur on the underlying document, the illusion of having independent copies will be broken. A Copy-on-Write system preserves the independence by duplicating the underlying file just before we write to it, thereby maintaining the independence of the document copies.

Delphix’s data virtualization is quite simply Copy-on-Write by another name.

Graphic representation of a copy-on-write workflow, duplicating a shared document when edits are made to one of the "copies"

Developer data

One thing Delphix does get right is the value of cheap, quickly available, independent test databases for developers. Engineers frequently find themselves wanting a quick way to provision a database filled with the test data they need that won’t get interrupted by other engineers using the same database. For example:

CI/CD: You may want to include a database as part of your CI/CD. Spinning up a database can add significant time to your CI if it’s not quick, and sharing a database can introduce all kinds of nondeterministic behaviors if multiple CI/CD builds occur concurrently. Quick and independent are key!
Ad-hoc testing: Sometimes you need a temporary database filled with test data for ad-hoc testing. Provisioning a database using your cloud provider is slow and expensive. Anyone who has left an RDS instance running accidentally knows how quickly those cloud DB charges add up! (I’ve never made that mistake, of course… No, no—not even last month.)

Delphix was founded in 2008, and when they developed their proprietary Copy-on-Write technology it was a relatively new idea for file systems. Back then cloud computing, such as AWS, was still in its infancy. Now, it’s 2024, and a lot has changed. Today everyone is in the cloud, and that means:

Storage is infinite and cheap. Copy-on-Write saves money on storage, but when storage is cheap that doesn’t matter as much anymore.
Data copies are extremely fast. The likes of AWS and Azure have optimized data transfers.
Compute is infinite but very expensive. Organizations are fearful of letting developers provision cloud resources because they’re afraid (and rightly so!) their cloud expenses will balloon.

With this change in the technology landscape, the problems that need solving have also changed. What Delphix was looking to solve with its “data virtualization” is not as relevant to today’s workflows. Data and storage are no longer bottlenecks; compute is the bottleneck. And that’s where Tonic.ai comes in.

Tonic Ephemeral

Tonic Ephemeral is neither a data virtualization nor a Copy-on-Write technology. Tonic Ephemeral is a modern approach to solving what matters for developers—fast access to test data in an isolated environment—while keeping budgets in check by minimizing compute costs. We built Ephemeral with today’s cloud environments as the primary target. Ephemeral deploys with Kubernetes, so you can host Ephemeral, or let Tonic do the heavy lifting by using our cloud. Eliminate lengthy database provisioning processes and accelerate your engineers with quick access to high quality test data. It’s the value you were hoping to find in yesterday’s Copy-on-Write solutions built and delivered specifically for today’s data management needs.

To learn more sign up for Tonic Ephemeral.

Want to make your data usable?

Unblock product innovation with safe, high-fidelity data de-identification and synthesis.

Book a demo

Andrew Colombi, PhD

Co-Founder & CTO

Andrew is the Co-founder and CTO of Tonic. An early employee at Palantir, he led the team of engineers that launched the company into the commercial sector and later began Palantir’s Foundry product. Today, he is building Tonic’s platform to engineer data that parallels the complex realities of modern data ecosystems and supply development teams with the resource-saving tools they most need.