Meet the company that’s faking the world a better place. What is Tonic.ai? How do we view the challenges of data synthesis? Who are we enabling with our solutions?
Join our co-founder and CTO Andrew Colombi, Head of Product Kasey Alderete, and VP of Marketing Omed Habib for a live session as we introduce the platform that developers, data engineers, and data scientists around the world rely on for safe, useful fake data.
If you’ve been curious to learn more about faking your data with Tonic.ai or to get your questions answered directly by those leading our product’s development, this is the event for you.
Chiara Colombi (00:04):
Hello. Welcome. Thank you for joining us today for an introduction to Tonic.ai, The Fake Data Company. My name is Chiara. I'm the Product Marketing Manager here at Tonic. I'm really excited to introduce you to several of my co-workers from across the company who are going to be headlining today's session.
Chiara Colombi (00:20):
We have Omed Habib, our VP of Marketing, who will be presenting the problem that we're solving for, and the results we're helping teams achieve. Kasey Alderete is our head of product. She's going to detail our solution spotlight, our latest feature releases, and provide a demo of the platform as well. Andrew Colombi is our CTO and he is a co-founder of Tonic. He's here to answer all of your most technical questions from the challenges of subsetting to the guarantees of differential privacy.
Chiara Colombi (00:46):
We clearly have lots to cover, lots of opportunities for questions to come up. Please feel free to ask your questions at any time. You can put them in the chat. You can put them in the Q&A function of Zoom. I will be looking out for them in both places. I will also jump in to ask these questions as they come up. Don't be shy. Without further ado, I will pass the mic over to Omed.
Omed Habib (01:09):
Thank you. Thank you, Chiara. All right, jumping in. Okay. Super excited to chat about Tonic. I am going to go over the problem section. Then I'm going to hand the mic over to Kasey to chat about the product, as well as a product demo. I'll try to keep the slides to a minimum. I know slides can be a little bit boring. But may I ask if everybody in the audience is ... I love to also keep this conversational, we the panel, we will answer your questions as they come. Feel free to jump in.
Omed Habib (01:36):
Without further ado, let's chat about Tonic. Now, before we go into Tonic or the actual product itself, I do want to set the stage here a little bit. You probably a software developer or someone who's involved within the software development process know the challenge of building software. It all starts from a local sandbox or dev environment, probably your local laptop, some kind of virtualized environment.
Omed Habib (02:00):
The goal is for you to build features, check in the code, and that code advances to different stages in the pipeline before it finally gets to production. The challenge with each of these stages is that you try to mimic your production environment as much as possible. Now, why do you do that? Obviously, it's because you want to make sure that your feature, or the QA or the security, whatever stage of the CI/CD process that you're in is as efficient as possible.
Omed Habib (02:28):
In order to be as efficient as possible, you have to mimic and replicate an environment, whether it's infrastructure or data. The challenge here, of course, is that when it comes to the data portion of it you might have the application infrastructure and the settings and the configuration an exact match to production. But the behavior of your application in a pre-prod environment is only as good as the test data that you're using.
Omed Habib (02:59):
That's where Tonic steps in. What we do, which is a bit different than probably how you're doing it today is we will take data from your production environment, and synthesize for you a test dataset that looks, acts, and feels like production, because it's actually made for production. In real time, with that data, we will saturate your pre-prod environments.
Omed Habib (03:28):
Tonic will connect to a production environment and using a few different configuration settings which Kasey will show you here in a bit, we will then synthesize for you a data set that you can use in your local sandbox, your QA, security, your staging. It doesn't have to be the entire production database. It can be a subset of that in some cases. If you're running petabytes of data production, you can generate a one gigabyte file that maintains the structural fidelity of your production environment and behavior of production with data that is de-identified. No personal information.
Omed Habib (04:02):
But again, it looks, acts, and feels like production. With that, introducing Tonic, like I said before, Tonic is test data that looks, feels, and behaves like production, because it's synthesized from production. Kind of a funny question, but if anybody here does not or prefers to stay away from real sugar, but uses Splenda, I see some of you guys smiling.
Omed Habib (04:30):
If you use Splenda, you'll notice on the Splenda packet it actually ... it'll say on there, it'll say "Tastes like sugar because it's made from sugar." It's an analogy that I use to describe tonic. Anyway, eking out here for a sec. Tonic was founded a few years ago by four brilliant people who came across this problem in their past lives, including Andrew who's on the call right now. Our CEO, Ian Coe. Karl Hanson and Adam Kamor.
Omed Habib (04:58):
They came across this problem in trying to fix bugs in pre-production environments, but trying to stay compliant with all sorts of regulations, especially in the financial space. You might be in healthcare. You might be in any industry across the world, and you have the same challenge. You can't use production data in a test pre-production environment. How do you get quality testing done without compromising customer quality? Sorry, customer security and privacy.
Omed Habib (05:32):
We have an office in San Francisco, Atlanta, and New York. We are roughly ... I think we're probably over about 70 employees now growing pretty quick. With $45 million total raised. We announced our series B late last year, $35 million round that was led by Insight partners. We do have some of the largest software deployments in the world, very proud of that. We deal with extremely large scale.
Omed Habib (05:58):
One of our largest customers is eBay. We do specialize in some of the most regulated spaces, specifically finance and healthcare. However, we have customers from across the entire spectrum of verticals. Every company today is a software company. If you have software engineers, whether they're local or remotely distributed with a CI/CD pipeline, they probably have a very strong need for test data.
Omed Habib (06:28):
The higher quality of the test data, the more productive software developers can be. eBay, we actually have quite a few case studies. I won't go into too much detail on the case studies. They're all on the website on Tonic.ai. But if we have time later, I might even touch upon eBay here in a bit. Excellent key results. I won't go into too much detail on this slide. Again, all the case studies are on the website.
Omed Habib (06:51):
But I do want to point out a couple of very interesting key results. One is Everlywell. They were able to reach three to five times more releases per day. I mentioned earlier, some companies have petabytes of data. You'll notice that eBay has a total of eight petabytes of prod data. I'm sure that number is actually increased by now. They're able to synthesize a subset of that prod data to a one gig file that they use in their pre-prod testing environments.
Chiara Colombi (07:19):
Omed a question for you, just before we go on. I think a lot of people might be asking what ... Before Tonic, what have teams been doing historically to solve this problem since it is such a huge problem? Why is that previous solution no longer working?
Omed Habib (07:36):
Yeah. Fantastic question. We back up a little bit to our second slide here, because this is probably where I should have talked about this. What they're probably doing ... Andrew, you can probably jump in here as well, because you've seen a lot from the field as well as Kasey. There's a few ways that they solve for it.
Omed Habib (07:52):
The first one is they use a third party product to generate random arbitrary data. The problem with that is that the larger environments, larger companies, larger data sets, you break the integrity of the data. In other words, your test environment doesn't actually match the behavior of your production environment. When the data is arbitrary, there's a major loss of signal in exchange for noise.
Omed Habib (08:24):
The other option, which I really hope you're not doing, you being the customer, people who are listening to this is using production data. I unfortunately have seen more admissions than I wish was a reality where VPs and CTOs and engineering admit that they're actually using production data in pre-prod environments. They admit it's an anti-pattern.
Omed Habib (08:46):
Some of our customers have since deprecated that practice and are using Tonic now to follow best practice. There's more ways in which you can do this. We have a really good book coincidentally published, co-authored by Chiara called Test Data 101, which I strongly recommend you download. We outline I think it was five different ways in which people are creating test data today and why it's a bad idea.
Andrew Colombi (09:17):
Omed Habib (09:17):
Did I cover all of it, Kasey? Andrew?
Kasey Alderete (09:21):
Yes. I'm going to build on it, too, depending on which angle you're coming from, because there's a couple of different ends of the spectrum that people are approaching Tonic from.
Omed Habib (09:34):
Super curious if anybody in the audience has a method that they're using that we didn't mention, feel free to mention in the chat, and I'd love to talk about it. All right. Let's keep going. I think you guys are all pretty excited to see the product. I don't want to waste too much time on some boring slides here.
Omed Habib (09:50):
All right. Supported databases, we do cover some of the largest, most popular databases that exist today. All SQL, including recently introduced last year, no SQL database are first of many, MongoDB. Now, that's the source database. But once you bring Tonic into your CI/CD pipelines, then there's also the ability to extend its functionality and integrate it into your automations.
Omed Habib (10:22):
We have customers today that trigger Tonic automatically and very neatly integrated into their pipelines using tools like GitLab for example, Everlywell. One of our case studies does have automation built in place to ... during the pipeline, trigger Tonic to saturate a test database as artifact, advances through, eventually into production.
Omed Habib (10:49):
As well as Jenkins, for example, and it's not limited to what's on the screen, you'll see why here in a bit, Kasey is going to cover, how we extend Tonic in the product demo.
Kasey Alderete (11:07):
Omed Habib (11:07):
Kasey Alderete (11:08):
I think I get to, yeah, start sharing the product. I'm going to start giving a framework to how we structure the product and our approach to solving this problem. The first one is talking about the usefulness of the data. We use a process we call Mimicking, which is really taking your existing production data, looking at the characteristics and mimicking that in a destination database.
Kasey Alderete (11:33):
We do that with what we call our generators, those are transformations that you're applying based on the data type based on the statistical properties that really looks more deeply at the relationships in your data. The second thing is that we are really shielding your data and we're providing this separation between your production and staging environment.
Kasey Alderete (11:55):
The way we do that architecturally is that we're integrating at that database layer. We natively integrate with these databases, and are able to sit deeply in the stack so that privacy isn't a separate step you do. It's built right into your workflow. The developers are continuing to use those same environments for developing and testing. They don't have to do anything differently in order to get safe and useful data.
Kasey Alderete (12:21):
The third thing I wanted to highlight is our subsetting capability. Not only just Tonic make data useful by making it accessible, and realistic, but a lot of times we found that developers and testing environments, you need a subset of the data. eBay doesn't need all eight petabytes in order to build a new app. They need a representative subset in order to troubleshoot, do local development, have something that fits on a developer laptop.
Kasey Alderete (12:50):
Our subsetting capability is very powerful, it will identify the relationships across tables. It will traverse those foreign keys to bring back a relevant subset that's referentially intact. That's more useful and productive for you. The next slide, I wanted to just give a taste for the building blocks for privacy that we use. These are the generators that are available on Tonic. This is a fun graphic, but just explaining the features.
Kasey Alderete (13:20):
You can think of it progressively that some of the generators that we have are based on the data type that you have. Some are based on statistical properties that are present within a column so that we can make sure the same distribution is present and the output as it is in the source. There are also ones that will let you do consistency.
Kasey Alderete (13:42):
When you want predictable outputs, or you want to link columns across, you can think of them as composable, that you're applying the right generator based on that data type, its sensitivity, the privacy level, taking all that in count. Then moving on from our base generators, I wanted to introduce our smart linking generator. This is our first machine learning-based generator that's available in the product.
Kasey Alderete (14:12):
What it does is it automatically detects and builds a model based on your source data. It trains a neural network to identify all of those relationships, maybe that you don't even know that are present in your data, correlations across columns. After that model has been trained, we use that model to generate entirely new synthetic data. There's a clean break between your source and your output data.
Kasey Alderete (14:41):
This is a proprietary piece of technology. It's available in several different kinds of datas. As you can see here, they're listed the different generators that we have. It's something we've put a lot of work into some of the architects that you can see on the page here.
Chiara Colombi (14:58):
Quick question for you, Kasey. Someone had asked, "What if I need a generator that you don't currently have for data specific to my database?"
Kasey Alderete (15:05):
Yeah. The good thing is that we have really ... With these core building blocks, breadth across on nearly all data types. Now you'll find some edge cases, specific to databases, and sometimes we can add those. But what's really interesting is that this core set actually meets most of the needs regardless of what industry you're in. It's all this similar type of data.
Kasey Alderete (15:26):
Once we support SQL server and PostgreS. We know how to treat a lot of those datas. We can also write custom generators, if that's something that you need. We can do that really well, because we can make sure they're performant and truly private. That's something we look at with our customers to find any edge cases that are there.
Omed Habib (15:53):
All right. Product demo time?
Kasey Alderete (15:54):
Yes. I can share my screen. Okay. Okay. This is Tonic. I'm showing our hosted version today. Tonic is also available on your own premises. The unit I want to start with is talking about the workspace. The workspace is the technical implementation of what Omed showed on that first slide. Omed was showing how Tonic will link production all the way back to the lower environments.
Kasey Alderete (16:23):
This is where I was talking about, it depends on where you're coming from as a starting point, how you view the workspace, a workspace is a connection between a source and a destination database. If you are coming from an environment that is more open, where data is flowing freely from production down to staging, you can think of Tonic as that gate, as that separation of concerns.
Kasey Alderete (16:47):
This is where we're protecting that data so that it's not leaking through freely. If you are coming from somewhere that's very locked down and you can't access production data, then you can think of a workspace as that path for giving dev and test teams access to realistic data. The workspace is that connection, whether you see that as a gate or as a new path, that's going to depend on your point of view where you're coming from.
Kasey Alderete (17:13):
Within a workspace, I'm going to dive into the privacy hub. This is within a particular connection. I'm actually showing a PostgreS database here. We go like to like. This is from PostgreS to PostgreS. We are like an ETL tool. We transform it in-flight. We don't store your data. We're taking it. We're de-identifying it and writing it into the destination database. We take your source schema.
Kasey Alderete (17:43):
Within the privacy hub, we can get an overview of the current state. There's a little bit of a workflow here. There's a lot going on. But there's a little bit of a workflow to de-identifying your data. The first step is to scan for sensitive information. Then you're protecting data with generators. Then when that's all complete, you're generating data. Generating is when we're actually populating that destination database.
Kasey Alderete (18:11):
This is a great place to get started and to visualize how Tonic actually works. Tonic is detecting different fields. You can manually enable or disable the sensitivity flag. Tonic is looking at the column names, looking at data patterns to detect what might be sensitive or at-risk. Anything that hasn't yet been protected, it's going to show up as an "at-risk field," and tonic will surface all of those right here in the privacy hub.
Kasey Alderete (18:38):
I can take action right here. You'll see something like the email. I can confirm with the Preview button. Yes, in fact, those do look like email addresses. Tonic is suggesting the email generator. I can click right here and apply that right here from the privacy hub. Once I've got my connection, the only thing that I've done so far is just to click that and start making protecting fields.
Kasey Alderete (19:02):
Because as I apply these generators, the number of address fields is decreasing, protected fields is increasing. You can see the audit trails populating here as I apply those. This is a great place to just get to know that data set. We have customers with really large and complex deployments. This is just a helpful way to get started with where my most at-risk fields, maybe serving across the database.
Kasey Alderete (19:30):
But once you're a little bit more comfortable, you're going to want to operate at scale as you think about large databases. That's what we're going to use the database view. The database view will let you see your whole schema. This is really about power tools, so that you can do things in bulk.
Kasey Alderete (19:49):
Thinking about narrowing that focus ... Actually I'm going to pop back to the privacy hub. You see this number. This is the one we're really focused on. We really want to think about decreasing that number 40, making sure we have addressed all of our at-risk fields. In the database view, I can start by maybe rolling out particular tables that I know I don't need, and I can just truncate right here, so that we can drop the rows and not write them to the destination database.
Kasey Alderete (20:18):
I could also maybe select some tables, like stores and vendors and identify that ... Let's see here, I could ... Oh, yeah, stores and vendors and I could select them.
Omed Habib (20:34):
It's a dumb question. But you're not going to truncate on the production side, what you're saying is you're going to truncate on the QA side, the destination ...
Kasey Alderete (20:41):
Right. Truncate is just dropping the rows. We'll still copy over the schema. Then we're just not going to populate new rows in there. I can bulk edit the sensitivity there. I can also apply filters here to figure out what I need to narrow in on maybe the sensitive and not protected fields. I can also just type ahead. Something like name I want to see how am I protecting name across the database, for example.
Kasey Alderete (21:12):
If I want to apply name generator, I can see here in the preview, this is going to let me see what my original data was, what a mocked up version, a sample of what the output will look like. This all looks right Alexandria becomes Thelma; John becomes Kyle that will help my application look and feel as it does in production. But one thing I'm noticing is that Melissa, there's two entries here on the source side, and it's mapping out to two different values on the output side.
Kasey Alderete (21:47):
Now, this might be fine for names. But you can imagine there are times when you need output values that are predictable, that are consistent, maybe there's some implicit relationship that your applications expecting or you want to replicate the frequency of which it's happening. You actually need the same inputs to have the same outputs. Tonic has a feature called Consistency.
Kasey Alderete (22:09):
When that's on, I can see that Melissa is always going to map out to Rosella. It's still de-identified, but I can start to replicate the same realism, maybe messy data, maybe that person signed up twice. I want to actually replicate that in my test environment. That's a helpful edge case. That's what consistency is for. Any questions here, I guess, before I move on, from the database field?
Chiara Colombi (22:37):
Yes. Just about consistency, it works across tables, does it also work across database types, from PostgreS to, I don't know, Redshift?
Kasey Alderete (22:48):
Yes. We don't store a mapping anywhere. We're not storing that Melissa is Rosella, that's not something that can be compromised. But we're using a seed. That seed can be used across this database. It can be used across my PostgreS and Mongo databases if I want to keep them in sync so that those email IDs or unique identifiers are always having the same key on the output.
Chiara Colombi (23:11):
That's great. Thank you. We also do have a question that has come through that will probably speak to. I'm not sure if you were going to show the smart linking generator again. But a question about it, have we conducted any testing to understand if the smart linking ML generator creates synthetic datasets that preserve the correlations between attributes and targets? Also asked for whitepaper.
Chiara Colombi (23:30):
I'm going to post a link in the chat for everybody that that details, comparing the original source data to the output generated with smart linking. But if anybody, the panel wanted to speak to that question as well, specifically, the question asked is interested in understanding that machine learning, how generated metrics such as F1 and RMSE compare between the original dataset and synthetic dataset?
Andrew Colombi (23:56):
There is a lot of ... Go ahead, Kasey.
Kasey Alderete (23:58):
Well, I was going to say I'm going to show the very basic building blocks of what that's leading up to. If you want to speak more to the advanced, Andrew, or we can circle back. What I'm going to show is how you can manually replicate what smart linking does by magic using a model. I'm going to show you about linking between columns. But I wasn't going to speak to any of the higher level calculations and if you wanted to, Andrew.
Andrew Colombi (24:26):
Yeah. I'm sorry. I just want to read the question, because there was a lot of acronyms in there as well. The short answer to your question is, yes, we have done specific testing, comparing different machine learning models on synthetic data that we've created using machine learning models. It's machine learning evaluating other machine learning.
Andrew Colombi (24:47):
I believe we even have a blog post about it, but we can ... Yeah, is this the one, Chiara, that you just linked?
Chiara Colombi (24:54):
Andrew Colombi (24:55):
Yeah. Yeah, exactly. I'm just going to peruse this to make sure it's exactly what I think it is. But pretty confident, yes. In this one we ... In this blog post, I'll just summarize it quickly. We compare training several machine learning models on real data, as well as on our synthetic data for a common machine learning benchmark dataset. It's this dataset called the King County Housing dataset. It's something it's out there. People use it in research, et cetera.
Andrew Colombi (25:29):
But you can take a look at that blog post to really get more detailed, granular things about what things we tested. Yeah. Happy to answer more questions if you have more.
Kasey Alderete (25:43):
Okay. I'm going to go ahead and dive into the table view. This is a little bit more granular view of the generators. You can see every generator I've applied I'm building a model for how Tonic is going to de-identify the source data. With this live preview, I can actually see the mocked output data right here as I'm working.
Kasey Alderete (26:05):
I spoke about generators earlier in the slides as the building blocks. It is. It's a unit of privacy. It's where I take into consideration, what is the use case here, what's the need for how private this data needs to be? What are the characteristics that I need to preserve or protect? I wanted to show one, the Categorical Generator here, that will actually take the source data and shuffle the values while maintaining the same ratios that are present in the source.
Kasey Alderete (26:39):
It's looking at the statistical properties, like I said, within a column. You can also see here within the categorical generator, there's the option to have differential privacy. This is where I'm going to get rid of outliers in my data. You could imagine that being really useful, particularly in something like income, where even if we're shuffling the values, or averaging the values, the mere presence of a really large outlier would reveal something about my information set.
Kasey Alderete (27:08):
If you have mostly $50,000 salaries, you have one 1 million, that's going to tell me something about the data that I wouldn't have otherwise known. The differential privacy being on will actually apply some mathematical standards to making sure it can't be reverse engineered, remove those outliers while preserving as much of the fidelity as possible. We do use some more advanced algorithms and mathematical calculations there to make sure that that's true.
Kasey Alderete (27:41):
The next part is going to get more complex here. We've talked really basically about data types, and maybe within a column thinking about those relationships. But now I want to start to look across, the relationships across columns at a deeper level. This is a retail data set. It exists in the real world. It has real meaning. I know things, if I have a customer that has a higher annual income, they're typically going to be spending more on my store, maybe largest bill amount is related.
Kasey Alderete (28:13):
I need to maintain that relationship. I can apply the continuous generated annual income, that's going to maintain the variance within the distribution of incomes. I'm going to do the same thing with largest bill amount, apply the continuous generator. But now I'm going to link largest bill amount to annual income. Now, largest bill amount is a function of annual income. It's looking at the data and understanding what is that relationship.
Kasey Alderete (28:45):
Maybe I also know that store membership is another determinant of how much they spent. A member maybe would spend more than a non-member. I can add a categorical there to the store membership. I can partition my largest bill amount by store membership card. I'm creating all of these relationships across the data that one is a function of the other and that as one changes, it's going to impact what scenarios I want to replicate in my destination database.
Kasey Alderete (29:22):
This is where the smart linking comes into play. I can do all of this manually, where I can identify these links and these relationships that are present in my data. What smart linking will do is when I apply that to the columns, it will train the model and it will train a neural network and then using that derived model will generate fresh data from that model that will populate the destination database.
Kasey Alderete (29:50):
We have done a lot of correlograms and different analyses to make sure that the fidelity of that data and that those models are statistically representative. That's where ... Andrew, if you have anything else you wanted to add, this is where that would come into play. But I can apply smart linking generators, just as I've built up some of these other generators.
Kasey Alderete (30:12):
The smart linking, you can think of as having a higher fidelity representation to the source that's going to maintain more of that ...
Chiara Colombi (30:24):
I think another thing worth pointing out is that you just apply smart linking to that continuous and categorical. You don't have to choose a different generator for each and you don't have to specify the linking or the partitioning, it just simplifies the process for the end user while making it more nuanced on the back end.
Kasey Alderete (30:42):
Right. Right. The smart linking is right here. There's a trade off in terms of time. It takes some time to train the model and to do it that way. It's just something to think about for your use case of how high the fidelity of data you need, depending.
Omed Habib (31:01):
That feature is live, right? It's technically GA?
Kasey Alderete (31:04):
Yep. Just showing that right there. It's at the smart linking.
Omed Habib (31:10):
You can always access our docs. I think you may have already talked about this docs.tonic.ai?
Kasey Alderete (31:16):
Yep. One of the last generators I want to show, this is a whole host of generators, depending on the data type that are available, different date generators to model time there. Last one I was going to show starting to get into semi-structured data. This is helpful for thinking about our support for MongoDB and documents.
Kasey Alderete (31:41):
Even in relational database, it's pretty common to have JSON, XML blobs, and just wanted to show how data can really leak through if you're not also masking that. Let's look back over here. This first customer, I turn off preview, it's Alexandria Sanford. When I have the name generators on or at least the first name generator, Alexandria turns into Leonida.
Kasey Alderete (32:10):
If I scroll over to this JSON blob, so this might be a customer profile, or a form that they filled out online, Alexandria is coming through. I've mapped Alexandria to Leonida, but Alexandria is coming through here. That's where I want to use a JSON mask generator. I can actually traverse using JSON path, traverse that blob, and then apply a generator.
Kasey Alderete (32:36):
Using consistency, I can make sure it's the same name that I've masked it to in the other part of the row. Now you can see that it's changed to Leonida. Alexandria turns to Leonida here, and that column, as well as on this column, Alexandria goes to Leonida. Now I'm going to move on from the masking portion to thinking about protecting my data and talk a little bit about Subsetting.
Kasey Alderete (33:07):
This is a really powerful part of Tonic. There's a lot of embedded technology that's going on once we set ... turn Subsetting on. All that I have to do is to turn Subsetting on to decide what my target is. Am I trying to go for a particular percentage? Is it a smaller data set by 10% half, or is a particular subset, maybe based on a characteristic?
Kasey Alderete (33:34):
For here, with marital status, married. Tonic is going to use all of the logic present in my database, looking at foreign keys. If you don't have foreign keys in your source database, you can use Tonic to identify what those relationships are. Then I'm going to get a preview of how many rows are going to be returned. How much data am I going to get back when I'm using that Where clause.
Kasey Alderete (34:00):
I can actually dive really deep into the algorithms here, too. I won't do that now. But common use cases are for local development, remote work so that you're not VPNing into a staging or development environment. Offshoring customer, troubleshooting, maybe you just need that one customer scenario, but traced all the way through the database to replicate and find where that bug is happening.
Kasey Alderete (34:27):
It's all around the shift left testing, and how much can we discover earlier by getting better, more representative datasets when developers and testing teams meet up.
Omed Habib (34:38):
Kasey, this feature is only as good as how well your foreign keys are defined, right?
Kasey Alderete (34:44):
Yeah. It's going to be based on how good your data is. We do find that a lot of our customers discover things about their data. They didn't know once they start digging into this. They didn't know that they didn't have foreign keys present in their data. That's why you've got a foreign key tool here to actually declare what those relationships are with Tonic. But it is an iterative process.
Kasey Alderete (35:10):
The last thing I wanted to highlight, before moving on is just thinking about how Tonic really exists in the real world. It's a full platform. We exist in customer environments where it's not the only tool. There are lots of teams using it. I wanted to highlight a few things that helps Tonic fit into your world.
Kasey Alderete (35:31):
One is in terms of automation, everything is scriptable, and helps you plug it into the CI/CD pipeline, as Omed talked about. We can run SQL scripts, after you run a Tonic job. We can call up to web hooks so that you can link us up to other tools. In terms of collaboration, Tonic's got built-in SSO, role-based sharing of workspaces, commenting. We're seeing a lot of our larger, more complex deployments of Tonic to make use of a bunch of these features, so that Tonic is just a part of their overall workflow.
Kasey Alderete (36:08):
Finally, I wanted to highlight or talk a little bit about how Tonic thinks about privacy. You can see the number has gone down. We were up 40 earlier. But now we're down to 25. But really think about privacy as the intersection of how sensitive your data is, how well you're protecting it. We also want to think about the use case. This was going to be different requirements for how private the data needs to be if you have internal users versus if you're posting something publicly.
Kasey Alderete (36:38):
Depends on the utility requirements that you have if you do need very high fidelity data. You're making those tradeoffs as you go through Tonic. One thing we've added for visibility into where you're at, this snapshot of your protection level in Tonic is the privacy report. Every time I generate data, I've got a job and a job report associated with it. We provide some summary level statistics to know that when I generated this data set on this date, this was the privacy level at that time.
Kasey Alderete (37:15):
I can see which fields Tonic thought were at-risk, what were protected. This is just the counts. But I can also get the detailed CSV export. This is really helpful for when you're communicating what's been done in Tonic to external stakeholders. You can think about centralized InfoSec teams who want visibility into what are you doing in this tool, what was protected, when and how was it protected?
Kasey Alderete (37:40):
I'm going to show a stylized version of that CSV here. This just helps you understand what's going on in Tonic, making that transparent. The privacy report, it's got four sections here. We've got the schema that we're reading from your source database. We've got the sensitivity level. This might be what Tonic auto detected, or what you've manually flagged as sensitive, will detail the actual protection that was in place at the time of generation.
Kasey Alderete (38:12):
We've got some more advanced reporting on the level of privacy protection that's in place. This is available for all of the data that Tonic has de-identified. This is just a snapshot, obviously. But we provide all that raw data to you. That's something that you can share and communicate to understand.
Chiara Colombi (38:33):
Great question just came in on the subject of privacy, about differential privacy and how there's the ... balancing that tradeoff between data utility and data privacy. The question is, is our differential privacy option on the columns just to remove outliers as was shown? Or is this tradeoff parameter, the epsilon, something that is able to be customized for how noisy the synthetic data is?
Andrew Colombi (38:55):
I can take that one. It's easier to explain to a general audience how differential privacy works when you couched in the terms of outliers. But it sounds from your question you know a little bit more about how differential privacy works under the hood. It's not as simple as just removing outliers.
Andrew Colombi (39:15):
It's a whole process of making sure that your algorithm isn't unduly, in fact, impacted by a single row in the database. That if a given entity were to be removed from your data set, you can be sure that it doesn't drastically change the output that you would create with your algorithm. No. It doesn't just remove outliers. It does the real differential privacy stuff, which is difficult to explain in even 15 minutes, let alone 30 minutes or even an hour.
Andrew Colombi (39:50):
To answer your other question of do we allow you to change the epsilon? In short, no. Right now we don't. Most of our customers haven't really asked for that. In a detailed way, we could easily make it a parameter in the code. It is a parameter in our code. It's just not one that we expose on the UI or anything like that.
Andrew Colombi (40:12):
But for what it's worth, the default parameter that we provide people is epsilon of one. That's just based on our research about literature and what seems to be common practice.
Chiara Colombi (40:23):
Kasey Alderete (40:23):
I'll add one conceptual nuance to that as present here in the privacy report. When we talk about data being protected, that means we've taken some action to obscure the original value. There's different levels of protection that you can apply in Tonic. The two levels we think about that we're summarizing here are Anonymized and Masked.
Kasey Alderete (40:53):
When data is masked, it means we've maybe shuffled the values. We've removed the obvious data that was present there. But you can imagine if I was wearing a mask, you could still probably tell it's me. Maybe I have a mask over my eyes, and you can't tell what color my eyes are. But you can probably still tell that I'm short or that I've got brown hair or like some other characteristic about me.
Kasey Alderete (41:17):
Mask is a lower level of privacy protection than anonymized. Anonymized, there's different ways you can achieve that. One is by making use of the differential privacy generators. Another is by virtue of being data free. When we talk about using a first name dictionary, we're literally just ... You're telling us that it's a first name, and we go into our dictionary, and we provide a first name. That has nothing to do with your source data.
Kasey Alderete (41:45):
That is completely anonymous. We can't learn anything about your original data set. The smart linking generators would also result in anonymized data, because we are training a model that we're using to then generate data. Anywhere, we've really broken that link completely, obviously, there's tradeoffs in terms of utility.
Kasey Alderete (42:07):
There's different use cases that require fully anonymized data or particular fields that need to be anonymized versus just masked. We see it all is protected. But you'll have to bring your own policies and requirements for how you think about that tradeoff.
Chiara Colombi (42:22):
Thanks for bringing that up, because it's true. There's a lot of variability in how people interpret the words masked versus anonymization versus synthesis. It's important for us to define the way we're thinking about it as well. Thank you, Kasey.
Omed Habib (42:35):
I just got a question, if I can jump in real quick, was in the private chat. The question I suppose is for any one of us. Can multiple teams use the same source database with different configs, for example, my data science teams versus the QA teams?
Kasey Alderete (42:50):
Yes. Each of those workspaces is ... I'm going to drop something ... is independently managed. We can create different workspaces using the same source data so we can have our prod replica. Then we can create an environment for staging. Maybe it's different than maybe a local developer.
Kasey Alderete (43:13):
We can also think about different subsets. We could think about data scientists might need more of the models and higher fidelity of data than is represented by the rough approximation for a developer environment. Yeah. We can think about each of those workspaces being managed independently, or we can keep them in sync, using our APIs and have copying over those configurations so that you are keeping them in sync.
Omed Habib (43:40):
Thank you for that. Follow-up to that same question, who usually owns then the administration of Tonic?
Kasey Alderete (43:46):
That's a really good question.
Omed Habib (43:47):
... it's objective to the company, probably.
Kasey Alderete (43:49):
Yeah. I think there's a couple of those ...
Omed Habib (43:49):
I mean, I personally have seen.
Kasey Alderete (43:51):
Maybe three models I can think of. The most common or I think where we've seen the most success is where we have a dev ops team or an SRE team that's centrally managing Tonic, and creates ... It's hosting Tonic, it's setting up the users and managing the availability of Tonic, and then publishing Tonic to development teams to start making use of creating different environments for each new app that they have, that they're working on, or test suite that they're running.
Kasey Alderete (44:23):
That case, you get a lot of different Tonic users. You have maybe three or four that are managing Tonic. But you might have 20, 30, 40 developers across the entire organization that are all using Tonic based on whatever needs they have. Another model is just a central team that does all of the Tonic work and does all of the de-identification. That's usually when you have a smaller company and a couple of people have enough knowledge of the data and the use cases.
Kasey Alderete (44:49):
It depends on how centralized the knowledge of data and downstream requirements is and that's going to affect who do you need to give access to Tonic? Okay. I don't know if you want to hop back over to the slides, Omed. We have a couple more if we want to dive in to customer.
Omed Habib (45:12):
Yeah, totally. Okay. I just want to end with the, at least as far as the slides are concerned, we have some time. Interestingly enough, before I start presenting my screen, we talked about a few case studies are all on the website, you can check them out. But one case study in particular, eBay. They published a blog. I think it was in December, maybe, their VP of engineering Senthil, a customer of ours, published a blog doing an internal case study on how Tonic actually fell.
Omed Habib (45:44):
Let me just share my screen real quick. I'm probably just going to show ... Matter of fact I'm just going to show the blog itself. I have it opened up here on a second tab.
Chiara Colombi (45:54):
I'll post a link in the chat.
Omed Habib (45:57):
Awesome. Thank you very much. Okay. Chiara is going to share the link. Check it out when you get a chance. It's a fantastic blog. I'm going to give you a very quick little 30-second high level. eBay's challenge, you can probably imagine the sheer size of not just data volume, but just the company itself and the amount of software engineers that they have, it's in the thousands.
Omed Habib (46:20):
The challenge that they had internally was, hey, we have a staging environment that's broken. It's not actually being used. What do we do? The blog goes into how they dissected the problem, one of the options actually was "Let's not actually have a staging environment. Let's go ahead and just do a canary test in production, in other words, use production as our testing environment," which is not unheard of.
Omed Habib (46:42):
A lot of companies do this, especially during deployments. Blue, green, or canary is actually very popular methodology of deploying today, and consequently, actually testing in production. But the blog outlines why that was not a feasible option. One of the primary arguments against that was that we can't test against production data, because of the manipulation in some of the test suites, and the integration tests.
Omed Habib (47:10):
Anyway, long story short, they did end up restructuring their staging environment. They were able to re-architect it, make it usable, bring utility to it, and one of the core components, one of the core ingredients to the recipe of success for the staging environment was to have a tool like Tonic to be able to provide data that looks acts and feels like production, has high utility, but at the same time preserving the privacy of production data. Check that out.
Omed Habib (47:43):
I could keep going with more case studies. But I don't want to bore everybody. Plus, I think we're almost here at time. Do we have any more questions on the line?
Chiara Colombi (47:54):
We do have a followup question to the question earlier around differential privacy. With respect to data utility, is there any reporting quantitative measures or other descriptions that indicate how representative the generated data is? In short, beyond the Tonic guarantee, are we able to see any quantifiable evidence for how close we are to the original data?
Andrew Colombi (48:13):
Kasey, you want to do that or you want me to cover it? Oh, no, sorry.
Kasey Alderete (48:18):
I'll start because you're going to have much more detail.
Andrew Colombi (48:20):
Right. Okay. Sure.
Kasey Alderete (48:20):
I'll start by saying we have it in the works. We are looking to add the corollary from the privacy report, which talks about how well your data is protected, is to actually provide in the product some more utility reporting. What's the data quality like? How near it is to that source data?
Andrew Colombi (48:39):
Yeah. The only thing I would add to that really is if you look at the blog posts that was previously posted, you'll see a preview of things we're looking at. It's a tough question, because I think there are different things that people would care about. There's a lot of generic statistics you can do, like correlograms and that kind of thing, and regressions.
Andrew Colombi (49:00):
But in the end, the thing that probably matters to you most is does it work for the worst use case you have? Does it work for the model you're trying to train? Or does it work for the BI that you want to do? It's hard to generically answer that question. One of the things we considered is making it easy to connect, a really one click way of connecting Tonic to a Jupyter Notebook so you can really quickly try a hypothesis with the data that Tonic generates.
Andrew Colombi (49:29):
We're not there yet. That's still something that's more distant. We're going to start with simpler things like the correlograms and that kind of thing. But yeah, it's a tough question to answer because what utility means to one person is not the same as to another and it's difficult to generically capture all utility and a few charts.
Chiara Colombi (49:52):
Great, thank you for that. Another question around frequency, Omed you mentioned Everlywell, releasing three to five times faster on a daily basis, how often are Tonic users running generations? How much are Tonic users even dependent on Tonic data?
Omed Habib (50:13):
Probably every day. We have a customer of ours. I'm not sure I can share their name with them publicly. But I'll probably hold off on sharing their name. But they said that if Tonic didn't work, their software development process would come to a halt, which shows you how dependent organizations are on good high quality test data.
Omed Habib (50:32):
I would say in the least, you definitely want to change data or use credit generation anytime there is a schema change, absolutely. Tonic does have a built-in schema change alerting tool. That if there is a schema change, you can then apply the correct generator and then generate data. Everlywell, in particular, since you mentioned them, they have an automation in place that using our API, which was also in the docs, docs.tonic.ai. You can check out the API and build your own custom integrations with it.
Omed Habib (51:04):
GitLab would trigger a generation every time a pipeline is triggered. There's a post commit hook. Code gets checked in. GitLab starts a CI/CD pipeline, artifact moves forward. Once it hits the QA stage, GitLab will then hit the API, trigger a data generation. Now your QA database is now saturated up-to-date with ... Of course, your production environment obviously is changing.
Omed Habib (51:30):
Every time you hit cogeneration, you're now getting the latest and greatest. Artifact goes through a QA process, probably even a security process, and then eventually advances over into production. That's an example of the other extreme, which is completely real-time every single time is a deployment.
Chiara Colombi (51:50):
Yep. Yeah. I was thinking about that. One quote that you mentioned of "If Tonic stops, development stops." Let's see what are the questions we did have come through. This is a philosophical question actually from for me. The approach of connecting directly to a user's database, as opposed to connect me, uploading files or things, something like that. I think the answer there is interesting as well.
Andrew Colombi (52:15):
What do I mean by uploading files, actually?
Chiara Colombi (52:18):
Well, as opposed to uploading a table. Here's a table of data.
Andrew Colombi (52:25):
You mean uploading arbitrary data and rather than data that's based on their data?
Chiara Colombi (52:27):
Or no, I was thinking a data ... a table ...
Kasey Alderete (52:29):
A single CSV or something?
Chiara Colombi (52:30):
Kasey Alderete (52:31):
A single flat file.
Andrew Colombi (52:32):
Like replicating ...
Omed Habib (52:33):
Or a SQL dev file.
Andrew Colombi (52:35):
... replicating their environment?
Omed Habib (52:37):
Andrew Colombi (52:38):
Chiara Colombi (52:38):
What was it that let us to our approach of ...
Andrew Colombi (52:41):
Yeah. Yeah. Yeah. I got you now. Really, what it came down to is we wanted to support the developer use case, first and foremost. We've since gone beyond that. But there are definitely data science teams working with Tonic, sales teams working with Tonic support teams working with Tonic. But we started with a developer use case. Developers have databases that they need to connect their applications to.
Andrew Colombi (53:01):
That's what we needed. I wanted to make sure that it would work for the developer. If you start building a product that can do really great job with CSV, it's actually just a huge chasm to overcome between that product and a product that works with databases. It's not a stepping stone along the way. You need to start with databases, if that's your plan. Otherwise, you're going to struggle to make the rest of it work. Yeah. Does that answer ...
Kasey Alderete (53:31):
I would add, it's what I alluded to is that being at the database level, it's something that's built-in. It's not this separate stuff that like, "Oh, I have my data. Now I need to go de-identify it before I go use it." This is not a process change. It's fundamental to how data is made available internally. It's not reliant on a human to remember, or to follow the right process of where I'm supposed to go.
Kasey Alderete (53:54):
We can actually set down the controls of, "No. Nobody else can access production and that data is limited to who can access it. It is built-in that making it available internally."
Chiara Colombi (54:10):
Yep. We are at the top of the hour. I do have one last question that is probably a huge question, that could be a full hour in and of itself. But if you could just speak it of last snippet, what are the challenges involved if you want to build a similar solution in-house?
Andrew Colombi (54:30):
Oh, boy. I think we have some writing on that on our website that you could probably plug. But I'll just say briefly. I think there are two kinds of engineers in this world. Engineers that haven't done this before and they're like, "Yeah, that's probably pretty easy." And engineers that have tried to do it and be like, "No, it's impossible. You can't do it."
Omed Habib (54:50):
That's a really good answer.
Andrew Colombi (54:54):
The truth is if you haven't done it before, it is extremely, extremely, extremely difficult. It's much harder than you think it is. Having a whole company doing it is probably the right choice, because it's very hard.
Chiara Colombi (55:08):
Thanks to everyone, all our panelists, speakers. Thanks to all the great questions that you all had for us today. We welcome them at any time. I did have one last slide. Omed, if you want to pull it up, but I can also just rattle it off how to get in touch with us. Feel free to reach out to us. You can reach us at email@example.com. You can find us on social or at Tonic Fake Data, and our website www.tonic.ai.
Chiara Colombi (55:32):
We'd love to hear from you. You can book a demo on our website. You can also sign up for our free trial at Tonic. We've got a sandbox, two week free trial. You get support along the way. You can connect to your own data. Really see Tonic in action. It's a great way to test out the product. Thanks, everybody.
Omed Habib (55:47):