Please welcome one of Tonic’s most powerful integrations yet… SNOWFLAKE. Integrate realistic data de-identification into your workflows at data warehouse scale with the only data synthesis platform that’s pioneering Snowflake support. Whether your goal is to get better test data for your lower environments or to protect sensitive data for compliance, Tonic has you covered.
On June 8th, at 11:00 AM PT, join Co-founder and Head of Engineering Adam Kamor and Product Marketing Manager Chiara Colombi to get the crash course on data synthesis for Snowflake.
Learn how Tonic + Snowflake:
Chiara Colombi (00:06):
Hi everyone. And thank you for joining us today for our Snowflake launch event featuring Tonic's native integration with the powerful cloud data platform. Snowflake. My name is Chiara Colombi. I am the product marketing manager here at tonic AI, and I'm happy to be joined by our co-founder and head of engineering, Adam Kamor. Today, we're going to cover the whats and whys of our snowflake integration as well as demo the product in action. Like I mentioned we'll also leave plenty of time for questions. So please do feel free to ask questions in the Q&A. We, we always love hearing what's on your mind in case you're new to tonic AI by way of a very brief introduction, we are the fake data company. Our focus is on generating highly realistic fake data that is safe to use in software development and testing in data analytics and in data science use cases as well like ML model training.
Chiara Colombi (00:56):
And the way we do that is by generating data based on your existing data with tonic, your test data looks, feels, and behaves like production data because it's synthesized from your production data, our platform deploys on premises, and it connects natively to your production databases to enable you to apply transformations to that data and generate new de-identified data sets that are really sized and shaped to your developers and or your data analysts needs. And as you can see, we offer native support for a broad range of databases from popular relational databases to data warehouses, to no SQL offerings like MongoDB. We also easily integrate into your CI to CD workflows as well. But today we're going to focus on Snowflake. We're really excited to be launching this integration because of the unique challenge that it represents, and also the, the opportunities it opens for snowflake users within the synthetic data space.
Chiara Colombi (01:48):
We're currently the only provider that offers native snowflake support. And I think this is because at tonic, we love a challenge. We're pioneering a secure, effective test data solution for data warehouses and the data cloud. But what is this challenge? I keep referencing if you're a snowflake user honestly, if you are a data user, generally speaking, these details will come as no surprise. The challenge breaks down into three parts. Number one is scale snowflake data bases are very, very large. Snowflake obviously is, is built to work at this scale. It's one of their core value props, but what it means is that any solution you build to integrate and work with snowflake data must be equipped to work at that scale. When you're talking about de-identifying a data cloud, it becomes a very weighty task. It's, it's not easy to accomplish in-house and it requires specific architecting on the part of vendors to build a successful solution.
Chiara Colombi (02:40):
And number two is complexity. Arguably all data today is complex and, and partly that's because of the size of today's data, but also because of the way data is processed by our data stores. Your data in snowflake might have situations in which like multiple fields are combined within a single column, or maybe the data has been mutated in some way between its collection point and the database. All this complexity requires a complex solution that is capable of handling all data types and data scenarios. And the last part of the challenge is sensitivity. If you're using snowflake to store customer data or proprietary business data, you're going to have sensitive information in there that is almost undoubtedly subject to one of the recent data privacy regulations. And if you're planning to using that data in, in development or analytics, the fact is you can't, you need to make sure that PII is purged from the data before it finds its way into your lower environments.
Chiara Colombi (03:31):
So that combination of scale complexity and sensitivity means that securely de-identifying your data in snowflake while preserving that data's utility is not easy. So if it's so hard, why do it at all? Well, let's, let's look at some use cases, oh, sorry. Jumped ahead. First and most obvious, at least from where we stand since this use case has been our focus since day one is software development and testing. A lot of snowflake users today are building their apps on their data in snowflake. And, and that's great. It's, it's an incredibly powerful resource for them. What we're happy to enable snowflake users to do is to build their apps safely in snowflake by first de-identifying that data when done well, you' developers, your QA teams, they all get the data that they need that is effective equally as effective as production data, but purge of sensitive information and safe to use from your sandbox to your stage and environments.
Chiara Colombi (04:25):
The second and potentially less obvious use cases is data warehouse tokenization. This is the act of replacing sensitive data with non-sensitive tokens that offer no exploitable value to bad actors. But tokenized data does still retain value for data analytics. And it enables organizations to store the data long term and not be subject to the shorter timeframes that are imposed by privacy regulations. So really both of these use cases spoke to that, speak to that broader use case, which is compliance by de-identifying your data in snowflake. You are ensuring the compliance of your software development and data teams. You're also enabling offshore development. This is a component of compliance that we see at a lot of our customers. We even have customers running high trust environments that are made possible by de-identifying their snowflake data with tonic. So how are we doing this?
Chiara Colombi (05:12):
We're gonna jump into Adam's demo in just a minute, but first I'd like to quickly touch on the key. We've built tonic to meet the needs of snowflake users, and you'll see that these capabilities really align with the challenges we outlined earlier. Number one scale we've specifically architected tonic to match the scale of any snowflake database. This means that tonic works unbounded on snowflake. We move data as fast as your hardware allows. And so tonic will never really be a bottleneck to that. Speed. Second point is complexity. Tonic is equipped to handle the complex data situations that result from how your data is processed. Those mutations we talked about earlier those combined fields within columns, how your database processes, your data can affect how you need to de-identify it. And we've built targeted data generators to handle these mutations and along the way to maintain consistency in your data and ensure meaningful data in your synthesized output.
Chiara Colombi (06:03):
The last part is sensitivity. Tonic offers features like previous privacy, scanning privacy reports, and a plethora of generators that are designed for all of your sensitive data types. We give you the tools you need to ensure that your data is protected to meet those most stringent privacy standards and certifications. All this said, don't take my word for it. Let's see it in action. A quick reminder, like I mentioned earlier, we do love your questions. Feel free to drop them in the Q and a function of zoom or in the chat. I'll look out for them there as well. And I'll ask them Adam 'em as they come up. With that, I'm very happy to pass the mic over to the man behind the magic. Adam Kor our head of engineering.
Adam Kamor (06:42):
Hi, Kiara. Thanks so much. So like Kiara said, my name's Adam I'm on the engineering team at tonic and I'm also one of its co-founders. And I'm gonna give you guys a snowflake demo today, and I'm gonna cover a few of the use cases that Kiara spoke to in her presentation. Let me get my screen shared so we can get started. Excellent. Kiara, are you able to see my screen?
Chiara Colombi (07:07):
Yes I am.
Adam Kamor (07:08):
Okay. All right. So I'll assume everyone else can see it too. So I'm gonna go over two use cases today. The first use case is when customers treat snowflake as an application database, meaning their, their application itself is pointed at snowflake as like the, the, the store of data for the application. Typical typically the use cases here are for development and testing. So you have a development or testing that wants to create and test new features for their application, but they need a database that they can point their environment at. Typically with tonic, you would take your application database, de-identify it? That is to say, create a copy of it. That's devoid of sensitive information. And then your test and development teams can use this PII or sensitive data free database in their development and staging environments. That's the first use case.
Adam Kamor (07:56):
The second use case is when customers use snowflake as a data warehouse. So, you know, in this situation, perhaps, you know, maybe you have data and disparate data sources all over your organization, but through some ETL process, you're feeding it all into snowflake. And then your, your, your, your analytics teams, your data science teams are all pointed at snowflake for doing their jobs and analyzing data. In that situation, we find our customers really enjoy tonics tokenization features for tokenizing data in the data warehouse, so that you can still do all of your analysis and all of your analytics, but it'll be on the tokenized data. And you'll actually, as we'll show, you'll be able to get the same, you know, statistical and analytical results from a tokenized data set, as you are from a non tokenized data set, therefore preserving the privacy of your users and of your, of your data warehouse and ensuring that you're adhering to various regulations.
Adam Kamor (08:47):
Alright, so let's get started with the first use case, which is test and data. You know, deidentification so to get started, the first thing you do in tonic is create a workspace. A workspace is essentially a pair of a source database and an output database. The source database has the real data in it. The output database is where we're gonna send that de-identified data once it's gone through tonic. And so, you know, in this use case, typically you would then take that de-identified output database, put it into your staging or development environments, and then let your tests and development teams have at it. The, the most successful companies that use tonic for this use case move all of their engineering teams over to de-identified data generated by tonic. We hear stories, I mean, honestly, pretty frequently of, of companies that, you know, prior to using tonic or for example, using production data, and then overnight, they would switch to using, you know, tonic generated data and come morning time when, you know, people got to work and started working, the, the data is of such high quality that you don't even know you're using de-identified data.
Adam Kamor (09:48):
And that's really the goal with tonic. We're gonna give you an output database, that's devoid of sensitive information, but it's gonna look and feel real. So to create a workspace, you simply go up here, you click create new workspace, tonic supports a variety of data sources, but today we're gonna focus on our new snowflake offering. It's actually pretty simple. You select snowflake here and then all you do is provide the connection information to the source and output snowflake databases. So this is the standard information one would provide when connecting to a database. That's, it's the information that goes into your connection string, if you're familiar with that. So I've already done that today. And I've connected to a application database and snowflake. If I edit my workspace, we can kind of get an idea of what the connection information looks like. Cool.
Adam Kamor (10:34):
All right, great. So let's get started. When you first connect tonic to a database, we're gonna run our privacy scanner. The privacy scanner is gonna help you identify where your sensitive data is. This is really important for a lot of our customers because typically our customers come to us with databases with hundreds or thousands of tables and thousands or tens of thousands of columns. Not many people can kind of keep track of where all the sensitive data is. So they typically will run the privacy scanner. So tonic can help them identify, Hey, this is what's sensitive and then keep track of it. And in addition to finding it and keeping track of it, tonic will suggest to you ways to kind of deal with that sensitive data and by deal with, I mean, remove the sensitive data and replace it with something that looks realistic, but it's actually fake. So for this data set, you know, we've identified things, you know, columns that are obviously, you know, this is, this is all clearly pretty sensitive information. We have some first names, let's take a look at it.
Adam Kamor (11:31):
Yep. Definitely first names in there. We have last names, gender, email, birthdate, and the list goes on. And in fact, if I click here, I can see additional columns that were kind of marked as sensitive. Everything I do in my demo today is also gonna be tracked via this audit trail. It's gonna tell us who made what changes to what columns and when they made those changes. So as I'm making changes here, you're gonna see it get logged into our audit trail, which, which sticks around forever. Great. So let's go in let's apply some transformations at tonic suggesting, so we already looked at the name column, so yeah, let's apply name generators. Cool. That makes sense. So now those columns are protected and I I'll go into detail on what that means in a second. Well, gender that's interesting. You know, I'm actually not sure if for the purpose of what I'm doing today, gender is sensitive or not.
Adam Kamor (12:18):
So what I'm gonna do is I'm gonna actually ask someone else in my organization, whose name is Craig? Is this really sensitive for the data set we are making, I'm gonna ask him, Craig, for example, could be in our security organization. And he kind of, you know, owns in deciding what's sensitive and what ISN insensitive. So when Craig gets back to me, I can then make a decision as to what I'm gonna do. And when he does respond, I'll get various notifications over email and through the application as well. Let's pretend that he's already responded and said, Hey, no, it's not sensitive. Well, okay. That makes my life easy. Let's remove it from our list. Great. All right, let's go a level deeper into the product. Now let's go to the database view in the database view. I can actually see basically it's like a 10,000 foot view of my database.
Adam Kamor (13:02):
On the left hand side, I have the schema and all of the tables within each schema. And on the right hand side, I have all of my columns. Okay. That's all pretty standard. Today we're working with like 12 or so tables, but like I said, many of our customers have, you know, hundreds or thousands of tables in each individual database that they work with. So this view is really meant to be able to traverse a database very quickly and to identify what's important to look at and what's not. And we have a lot of features built into the product to like, you know, to, to handle that type of scale. So for example I know I've already looked at the marketing table and yeah. Okay. So on the previous page, I applied transformations on first and last name. Let's get a sense of what those look like.
Adam Kamor (13:41):
Like what's, what's tonic doing. Okay. I see. So it's, it's there there's names on the left and it's creating fake names on the right. Oh, that seems simple enough. Let's go back to looking at all of my columns. So I, I kind of wanted to start doing some, you know, exploration as to what's going on. So I type in name. Yeah. That gives me my first names, my last names, and then a few other things that aren't actual, you know, human names. So let me change that. Let's actually make it first name that gives me all of the first name columns in my database. That's great. They're all red because they were all marked as being sensitive. Well, let's deal with them all at once. I'm gonna apply a name generator or a name transformation to all of those columns, and I'm gonna put it as first.
Adam Kamor (14:21):
All right. Well, that was straightforward. Let's see what that does on the customers, cuz there's something I'm gonna point out. Okay. So this looks good, but I wanna point out one thing in the original data, we have Melissa appearing twice, but in the output, what tonics gonna generate when we actually, you know, run tonic. It's not Melissa, but it, it, Melissa's getting mapped to two different names to Linda and Rosalind. Well maybe that's what you want, but oftentimes you actually wanna make sure that a given input value always gets mapped to the fake output value. So that'll be really important later on when we talk about tokenization, but even for application development, let's say you have a natural key across two different tables and you know, you need to be able to preserve joins that are being run, you know, on your application server against the database.
Adam Kamor (15:05):
Well, if you don't have consistency checked I'm sorry. Right now, those joins will not be preserved because Melissa is not going to the same thing each time. So if you would, Melissa appearing in two different columns that you were joining on this lack of consistency will break that joint, which is not good. So what you can do to fix that and I'll just do it for one, take one column right now, you actually just enable consistency. And then when you enable consistency, tonic is gonna guarantee that a given input always goes to the same fake output value. So in this case, Melissa's gonna get mapped to Luann. And as long as you apply consistency on your name generators, anywhere that you see Melissa, it's gonna get mapped to Luann and that's gonna be true of, you know, columns within a single table of columns in different tables and even different columns in different data sources.
Adam Kamor (15:52):
So if you have some data in snowflake, you have some other data sitting in Postgres, we can guarantee that Melissa will always get mapped to Luann. We can also configure tonic not to do a consistent napping if, if you don't need that or if you don't believe it's important. Great. Let's move on. Well, I mean, obviously I probably just very quickly wanna do the same thing with last name. So let just do that really quick and then we'll move on from there. So again, we're gonna select name, we're gonna put it on last and for the per for the demo today, let's make it consistent. Cool. All right. Very good. So I could, oh, go ahead. Garra, jump
Chiara Colombi (16:26):
In with a quick question around consistency. Ryan is asking, is there a way to ensure consistency across workspaces?
Adam Kamor (16:34):
Yes. yes. Consistency works across workspaces, across databases, across tables and columns. And it's actually configurable at all of those levels. So you might want there to be consistency within a given database or a workspace, but you want each database or workspace to only be consistent with itself and not with others. So you, you can like, you know, basically fine tune consistency at every level within the product.
Chiara Colombi (16:56):
Awesome. Thank you.
Adam Kamor (16:57):
Sure. Great. Okay. So let's move on. Let's actually look now at an individual table, so let's go to the customer's table. Great. So we can see that I've already applied a few transformations. Those were done on the previous pages, but there's other sensitive data in this table. So for example, all the columns that are read are sensitive or rather those were flagged as sensitive by R P I scanner. You know, I might argue that, Hey, you know what, marital status that's actually sensitive. So let's, let's mark it as sensitive here. That'll tell to treat it as a sensitive column as if it itself had, had identified it as sensitive. So let's go through this table quickly and make sure that we've de-identified everything appropriately so that we can actually run a data generation. So I'm gonna start off with gender. Okay. Well, I see these values are typically null, M and F. Okay. tonic doesn't have a gender generator, but what it does have is a categorical generator, a categorical generator looks at the values in the column and essentially shuffles the values around, but preserves the frequency counts of each value. So for example, if the value of M appears 35% of the time in the source column, it's gonna going to appear approximately 35% of the time and output column as well.
Adam Kamor (18:10):
Additionally, this is one of our generators that supports differential privacy. So with differential privacy tonic actually think of differential privacy as like a mathematical framework that gives you certain guarantees around, you know, what's the worst thing that can happen if data gets leaked or an attacker tries to reidentify a data set. So with differential privacy in this case, what tonic is doing is essentially removing outliers from the column. So let's say there are certain values in the gender column. For example, that appear very, very infrequently. Tonic will not sample them in the output data source, or it will not sample them and place them and output data set that might be desirable or it might not, but outliers are a typical attack avenue for re-identification. So by enabling differential privacy, we can, we can remove those outliers and, you know, get rid of that attack vector.
Adam Kamor (18:57):
Of course, that comes at the cost of having an output data set that isn't as realistic as the source. So you need to think of differential privacy here as a balancing act. On the one hand, you can increase the privacy of your output data set, but in doing so you might lower the data utility, or you can increase the data utility, but you're gonna lower the privacy. So really at the end of the day, tonic, think of it as like a control knob. You can turn it one way to get more privacy. You can turn it the other way to get more utility. And it's up to you to determine, you know, where, where that right balance is for your use case for your threat models, et cetera. Very good. So let's move on. So for an email column we have a few options.
Adam Kamor (19:37):
One thing I could do, I could just keep it simple. I could apply a character scramble, make it consistent tool. That's gonna look at every character in the string, replace it with a random character from the same class, meaning lowercase to lowercase, uppercase to uppercase white space and punctuation are preserved though. So yeah, these kind of look like email addresses, right? They have something to the left of the sign, something to the right. They have a period and then they have three letters after the period, right. And, and, and that's the same structure as the original data. So that, that kind of makes sense. We also have email specific generators, let's apply one of those with an email specific generator. This is like basically like, you know, this kind of takes into account how customers typically use emails in databases and how they wanna test them in their applications.
Adam Kamor (20:19):
So for example, I might wanna specify my own mail domain, and I might do that so that when I, when my application and staging sends emails, I send them to a domain I can control so that I can actually test for the receipt of those emails and that they weren't properly being sent in the first place. And that the content of the email is correct. So I can do that now. And now all my fake email addresses end in mail capture.com AI. Additionally, oftentimes customers actually create test accounts in production environments and they'll use very specific domains for those test accounts. So for example, if I tell tonic to exclude all test accounts ending in gmail.com, you'll notice that. Yeah. Okay. There's a lot of, you know, fake email addresses here, but all of the ones that end in gmail.com are actually, you know, passed through as is that might be something you need to do.
Adam Kamor (21:10):
All right, let's go to marital status now. So actually there's, there's two columns I wanna deal with at once. I'm gonna treat occupation as sensitive as well. So marital status, occupation, and gender, those three columns are likely related to each other and kind of to add a little more complexity income and bill amount is also likely correlated to each other, but also to these categorical columns, right? Like your gender, your marital status, your occupation likely affect your income. Right? I think we can, all, we, we all understand that. So let's build a model of these five columns that such, that we'll get actually output data that will resemble the source data to a very high like statistical degree of accuracy. So what I'm gonna do is I'm gonna apply that same categorical generator. I'm gonna link it back to gender. What that does is instead of shuffling these, the gender and marital status columns independently of each other.
Adam Kamor (22:00):
When I tell tonic to link the columns together, it's saying, Hey, tonic, these columns are actually related to each other, kind of figure out that relationship and preserve it. So instead of shuffling the columns independently of each other, so, you know, M goes here, F goes, there F goes, there M goes there, et cetera, I'm instead gonna shuffle pairs of values. So I'm gonna, I'm gonna take combinations of valid gender and marital status and shuffle the two pools of values around as opposed to each value independently. So it'll preserve any relationship between those categorical variables. I decide actually, I also wanna throw occupation into the mix, so I'll make it categorical and I'll apply it. I'll, I'll add it into the linkage as well. So now instead of shuffling you know, twos of values, it's gonna shuffle triplets of values, which is pretty cool.
Adam Kamor (22:43):
And, you know, actually I'm gonna turn off differential privacy for a second. Great. So now I wanna throw in income and largest bill amount into the mix. So let's do that well for income. What I want to do is I wanna generate a new distribution of income values that resembles the source distribution of income values. So by applying a continuous generator, I can achieve that. That's another one of our generators that supports differential privacy actually I'm gonna apply the same continuous generator to largest bill amount, and then I'm gonna link it to annual income. So instead of creating independent distributions of these numerical values, I'm actually going to create a two dimensional or bivariate distribution of values. This will guarantee that any correlations that exist between income and bill amount will be preserved between the source and output data sets. I could go one step further and say, you know what, let's partition this numerical distribution by occupation, for example. So this is gonna generate an income distribution per occupation. So this will preserve, for example, the fact that doctors on average make more than farmers. And, and if that relationship exists in the source dataset, it'll now exist in the output dataset as well. Great. Kiara, are there any questions?
Chiara Colombi (23:55):
Yes. I do have a question for you around correlating. How do you correlate gender with first name?
Adam Kamor (24:01):
Oh, that's interesting. So let's say for example that you want to generate first names so you wanna generate fake or de-identified first names and you wanna make sure that they, they kind of match up with, with the gender. So for example, Claire, that that typically would correspond to an F, right? Maybe not always, but you know, in general it will. So the way I've set this up right now is this is generating names independently of any gender column. So if you actually win the generate valid first name, gender pairings, what I would suggest you do is apply a categorical generator to first name and then link it into this combination here. And that'll only generate first name, gender combinations that exist in your source data set. Now that is a little less private than what we had before. So there's a concern there, but you, you, you are relatively safe here because for last names, you, you're still only creating fake last names.
Adam Kamor (24:56):
So it's kind of up to you where your bar is. There, there are also some other options you have here. We, we have generators that are, are similar to the categorical generator, but it's called a custom categorical generator that you could likely use here. But I'm not gonna get into that today, but there are other options as well. Excellent. All right. Let's, let's move on. And I'm actually going to put first name back on the name generator generator like I had before. I just took advantage of our undo and redo buttons for those that weren't watching. Cool. All right. Let's move on on this table very quickly. We have a few date columns here. There are a few options you have. If you wanna deal with dates, I I'm gonna give you one example, like the complexity with these columns is that obviously the birth date for a given row needs to occur before the customer sense date and the customer sense date needs to occur before the last transaction date.
Adam Kamor (25:52):
And not only that typically like you're, you're not gonna see people that are three years old becoming customers like on average, maybe in this data set the average age of a customer when they join is 28 years old with a standard deviation of five years, et cetera, right? Like you might wanna preserve those distributions here. If you wanted to do that, you have a few options. One option would be to use tos event generator with an event generator. We're actually gonna preserve the distribution of date differences between columns and generate random or like really unique like date pairs. That's one option we have. The other option we have is to use a timestamp shift generator and I'll show you what that looks like with a timestamp shift generator. You can tell tonic to generate basically a random date, or rather you can given a date.
Adam Kamor (26:36):
You can have it shifted by a random amount. So I might wanna shift it, you know, a year forward or a year backwards in time. Right? I could do the same thing. For example, on this column shifted a year forward and a year backward in time. The problem though is it's gonna shift those two columns independently of each other. And then you might end up with a situation where you actually get a birth date after someone became a customer, which is nonsensical, maybe your application would crash if it encountered data like that. So in order to guarantee that for a given row dates are always shifted by the same amount we can use consistency, make it consistent on a primary key of the column. And this will guarantee that each date gets shifted by the exact same amount for a given row. So that you'll preserve all of those distributional properties that you care about.
Adam Kamor (27:22):
The last thing I wanna show is what we do with semi-structured data in snowflake. So right here, we have some JSON data sitting in a column, right? That's, that's a pretty common thing to see. It could be JS O or XML. Sometimes people even embed CSV into a relational column, which is funky, but it happens. So how do we deal with this? Cuz clearly there's Phi and PII here. Alright, so let's apply tonics JSON mask generator. All right, well, let's see what this does. So with the JSON mask generator, I can actually select individual fields within the JS O and apply transformations to those individual fields. So for example, I definitely wanna deal with the first name, so let's click that. Well, when I click it, it actually generates the JSON path expression for me. That'll give me name dot first. I can then supply a generator.
Adam Kamor (28:12):
Let's supply the name generator and let's make it consistent and put it on first. Okay, well, let's see what that does. Okay, well now the first name is now Isaac previously. It was Alexandra. Okay, let's go over here. It's also Isaac over here and previously you guess it, it was Alexandra. That's an example of consistency in action. The GI a given input always goes to the same output and that'll be true, whether it's a value in a column or it's a value in a JS O blob, whether it's in snowflake or sitting on S3 or in Postgres, et cetera, let's do one more thing for JSON.
Adam Kamor (28:48):
You know, it might be, and this is really common that you don't know where all of your sensitive data is in the JS O blob. So in that case, I'm gonna use a JS O path expression that identifies every value within the JSUN. This is a, it's a recursive expression that, that kind of goes through the JSON document, finds all the values and I'm just going to apply a simple character scramble. That's actually gonna scramble every value in the JSON. So you can see what it looks like. So your keys are still preserved. So the structures still there, but everything is now scrambled and it's scrambled consistently. Then you might say, well, Hey, you know what? I still wanted to keep that name there. And that's really easy to do just move it down in the order. And now the name is still de-identified as a name, everything else gets scrambled. So does are there any questions about that Kiara or are we, are we good to continue?
Chiara Colombi (29:35):
Nope. No, that's great. Thank
Adam Kamor (29:36):
You. Great. Okay. So we'd wanna continue this process until actually gone through the entire data set and de-identified everything. The, the way that I like to do it typically is to use the advanced search here, filter on columns that have been marked as sensitive, but are not yet protected, right? These are the columns that we know of Phi or sensitive data or PII, et cetera. And we have not yet added transformations to them. So I typically treat that as my to-do list. I get through all of my other columns here, sometimes using the bulk editing features sometimes not. And then once I'm done, I can actually go generate data. And when I generate tonic is gonna take my source database, apply the Cheba of it to the output database, and then copy all of the data over. And at the end of the run, you will have an output database that is structurally in schematically, identical to your source database.
Adam Kamor (30:23):
Most of the data is in fact unchanged, but all of the sensitive data has been replaced with fake, but realistic looking data generated by tonic. I've actually already done a database run. I did one earlier today. I have it here. This is in my job's history page. You can get a sense. It was a small data set. It only took a few minutes to run. You can kind of get an idea of what's happening here. There are some logs you can download additional logs, you can download your cloud, watch logs. If you need to look at those, if you're on AWS, additionally, you get a synthesis report. I only ran the job on the customer's table. So let me show you what it looks like. So it actually goes through all of your categorical and continuous columns, and it will give you an idea of how the output data resembles the input data.
Adam Kamor (31:08):
So you can see here, our categorical generators did a pretty decent job of preserving frequency counts among all of these columns. And remember, this is, these are the three columns that I kind of linked together with a categorical generator. Additionally, it'll show you how your numerical distributions hold up against the original data set. You have your real data in purple, your fake data in blue. You can get a sense of it here. It's a high degree of overlap, same for largest bill amounts. Finally, we show you how your, your numerical columns are actually correlated with each other. So in the real data set there is actually a really strong correlation between your income and your largest bill amount. That strong correlation is preserved in the output as well. So if, to, if for example, you did a run and you didn't link your continuous variables together, you would actually get values close to zero here and here, meaning that tonic wasn't aware of a relationship.
Adam Kamor (32:02):
Didn't try, try to preserve one. And actually the two columns are independent of each other and the output, but that's not what happened, cuz tonic actually did preserve the relationship. Great. in addition to everything I've shown today, tonic has a lot of other features or, you know, that are, that are, are built with how companies like to work in mind. So for example, we already saw the commenting feature I showed earlier. We also have other features related to sharing and permissions. So for example, I can share this workspace with other people within my organization. I'm, I'm not in an org right now, but if I was, you'd get a list of all of your users here tonic supports SSO through a variety of different providers, but all of the popular ones, Google, Okta duo, et cetera. When I share, I can actually assign different roles to people.
Adam Kamor (32:47):
You can have auditors viewers, owners, and editors, and each role gives you different levels of permission on the workspace. So for example, I might have security folks in my organization in an role so they can kind of approve when a workspace is ready to be generated. I might have a manager as a viewer, so he can kind of, you know, see what I'm doing, et cetera. And tonic also supports schema changes. This'll tell you when your, your database schema has changed. So it'll, it can notify you. For example, when a new column's been added and then you can configure tonic not to generate data until you've acknowledged the schema change to help ensure that you don't leak sensitive data into your lower environments. Finally, we have post job actions. This is a way for you to a run scripts on your output database, after a job finishes successfully, and B a way to get a web hook API notification when a job is completed to trigger other jobs. For example, this is really popular, for example, with customers that are, you know, orchestrating tonic through their C I C D pipelines or through airflow or things of that nature. Adam,
Chiara Colombi (33:49):
I have questions for you. Number one, since you're talking about everything else we offer as well does tonic have an API was one question that came first mm-hmm <affirmative>
Adam Kamor (33:56):
Yeah, so we have a, we have a rest API the essentially everything I've done in the UI today, you can do via the API as well. You simply go here, click on API documentation and that'll bring you to the swagger doc.
Chiara Colombi (34:10):
Awesome. And then another question a couple more questions that I have for you, if two data are linked and this is in reference to the synthesis report, should we be looking at scattergrams instead of histograms?
Adam Kamor (34:22):
Yeah, there are the synthesis report is still a work in progress. We're adding new visualizations to it pretty frequently. Yeah. But if, if you wanna look at how two synthetic columns are related to each other, you know, pre and post tonic, then yeah, that would probably be appropriate. And, and we're looking into adding things like that now.
Chiara Colombi (34:45):
Awesome. And then another question for you, is there a way to export import all the workspace configurations? So it can be added to source control or automated within C I CD
Adam Kamor (34:54):
Pipelines. Yeah. That that's a, a really common feature or a really common request actually. And yeah, that is so thing we can do. Everything I've done today actually can be exported from tonic as a simple JSON file can be checked into source control, can be versioned, et cetera. You, you could actually, what I, what I think is a really good idea is typically to export these JSUN files, check them into your source control alongside your database schemes so that when your developers are actually making changes to your database, they can be making changes to the tonic workspace configuration in tandem. So that way everything stays in sync.
Chiara Colombi (35:28):
Adam Kamor (35:30):
Great. Okay. Let's go and do a quick demo of tonics tokenization features, which are really popular for our customers on snowflake when they're using snowflake for the more traditional use case of a data warehouse. So let's switch the second workspace. This workspace again is connected to a snowflake database, but it's a different database than we were using previously. So let's, let's actually skip straight to the database view. The database that we're working with today is it is it's called the TPC dash H database. It, it is actually one that ships with snowflake. So when you create a new snowflake account, you, you get that sample databases database, when you first log in this is the database, this is, this is some of the data that comes in that database. It is a data set of customers and orders and parts, suppliers, and regions, just representing some organization, selling something to customers.
Adam Kamor (36:24):
I've actually already gone through and applied some generators here just for the sake of time, since we've already seen how and apply transformations. And I'm just gonna walk you through some of the transformations that we've selected. So let's go straight to the customer table, and this is different than the customer table we dealt with earlier. So I'm gonna start by turning the preview off. This is what the data looks like. Originally, we have a customer key, we have a customer name, an address a foreign key to another table called the nation table. We have the customer's phone number, account balance, et cetera. So let's say we wanna tokenize this data, set the goal being to tokenize it, but not to really affect or hurt any of the analytical queries that we're running against our data warehouse. So I wanna de-identify our tokenize this database, but I don't wanna affect any of the queries that my analysts or data scientists need to run to kind of, you know, come up with business decisions and insights.
Adam Kamor (37:18):
Okay. So let's go through this let's start by tokenizing the customer key and, and let me, let me quickly actually define what tokenization means. So for every value in the column, we're gonna replace it with a token that token represents the original value, and you can actually use the token to retrieve the original value if you need. The token is such that a given unique value will always get a unique token. So two different original values will never get the same token. So there's never gonna be collisions. It's a, it's a perfectly unique transformation, which is really important, for example, on a primary key, like we're dealing with here, right? You know, primary keys must be unique or they don't really make any sense. When you're dealing with a relational databases, tokenization can actually get pretty tricky because in a relational database Snowflake's a good example of this.
Adam Kamor (38:09):
Your columns are typically fixed with, right? Like I'm storing an eight bite integer in this column, or I'm storing a seven character string in this other column, right? Your tokens need to be able to fit into the width of the column or you can't use them. So all of tonics tokenization transformations are built with that in mind. So if you have a four by integer column, tonic can generate for you four by integer tokens. If you have a seven character column, like a R chart, seven column tonic will generate for you. Varchar seven tokens as well, so that you can put them into the database column. Alright, let's move on. So I've already applied some tokenization to the customer key column. Originally this was a column of integer and it remains a column of integer, but each integer now is essentially generated at random, but it's generated in a way that a given input integer value will only ever get the same output value.
Adam Kamor (39:02):
And there won't be the collisions. I'm gonna do the same thing on customer name. Let's see what I have so far. So for customer key, you can see that, okay, we go from here to here, but now customer name is actually not getting de-identified right now. Let's see here, what I did wrong. So ideally what I wanna do here is I have this customer prefix with a hashtag. Then I have, you know, a couple zeros and then I have an identifier 60,001 that 60,001 matches up to the primary key to the left. So we might actually wanna de-identify this column in the same way that we did the customer key column, right. That might be useful for our analytics. For example, let's say you're doing a join where you take just the left five, most debt digits in this column, and then you join it against something else.
Adam Kamor (39:51):
You, you need to preserve that mapping. So let's do that. So let's go to our RegX generator. Okay. I'm gonna provide a RegX expression. That's going to grab for me the last five characters. So the way I'm gonna do this is very good. I'm gonna then apply my integer key generator and I'm gonna make it consistent and I'm gonna save it very good. So now the, the tokenized value here matches up with the tokenized value there, right? So very spec, you know, certain sequel queries that you would write that might assume that there's this relationship between the last five digits on the right and the customer key on the left, those queries can, can still work for the address, I've actually decided to null it out today. If, if you, you know, in, in, in the original data set, they, they were just kind of, you know, fitting in random characters for the address.
Adam Kamor (40:43):
So assuming there was a real address here, we would likely null it out. Assuming we don't need that for any of our analytics, that's the safest thing you can do when you're not using it for the nation key, we're actually gonna leave it unchanged, knowing the country in which someone lives is typically not so sensitive. Phone number is an interesting one. Let's take a look at what it looks like in the original. So here we have phone number. What I want to do is I wanna preserve the country code and area code, but then I wanna tokenize the actual, like, sensitive part of the phone number, the last, in this case, seven digits. So let's accomplish that now again, I'm gonna use my REDX generator. I'm gonna add a REDX and what I'm gonna do is let me think of the best way to do this slash D two dash two. Hmm. Let me ask you think of a better way to do this
Chiara Colombi (41:38):
Adam Kamor (41:39):
Three slash four. Very good. Okay. So that gives me a REDX of just the last seven digits that I care about. I'm gonna apply another one of our tokenization generators or ask key generator, make a consistent and safe. So now my country code and area code are unchanged. So if I'm doing an analysis of, you know, you know, total sales by country, or looking at like what's the most, you know, profitable area code in the United States, all of that's gonna be preserved, but the actual sensitive part of the phone number's been tokenized. So, you know, you you're, you'll be able to use it in analytics. But you're not gonna really lose any you're not gonna lose any privacy. Very good. The last thing I do here are the last two things I do are I Knowle out my customer comments and for the market segment, I tokenize that as well. You can see here, for example, that it goes household building, building, and in the tokenization, you get one random token here, and then the same random token repeated twice because building is repeated twice and building's always gonna get the same token. Very good. Kiara, were, were you gonna say something I'm sorry.
Chiara Colombi (42:42):
Well, actually I do have a question to ask. I was sure the, the area codes, but this question is, is not quite related to that. There's a question around in encryption. How do you deal with data that's already been encrypted?
Adam Kamor (42:55):
Oh, so like, if you have if you have encrypted data sitting in, in a column that that's actually, it's pretty common. Like customers will encrypt data in the application layer before writing it to the database. Tonic has a notion of what we call in the product pre and post value processors. So in that situation, we will write a pre and post value processor for the customer that will essentially decrypt the data, allow a customer to apply the necessary generator. And then re-encrypt the data. So that it's written back to the database in an encrypted form. And we do that on a, on a per customer date basis after we kind of talk with them about what their encryption scheme looks like, so that we can support it in the product for them.
Chiara Colombi (43:36):
Awesome. Thank you.
Adam Kamor (43:37):
Yeah, absolutely. Great. All right. So I applied similar transformations on one other table today. I wanna point out one thing before we get into something that's pretty interesting. So I I've applied my tokenization on order key. I did a date truncation on, on the order date, meaning I truncated it down to the month to preserve privacy a bit, but also help hopefully preserve your analytics. And then I tokenize order priority. Now this right here is a foreign key back to the customer's table. And if everyone remembers, I applied a tokenization generator on the customer key column in the other table. So tonic knows that there's a foreign key relationship here. So what it does is it actually disables you from modifying this column. And when you go to generate data, tonic will automatically apply the same tokenization scheme to all of the foreign keys that reference the primary key that I modified.
Adam Kamor (44:27):
This will guarantee referential integrity in your output database. Okay. Let's move on. Oh, and by the way you know, in snowflake, you, you can provide foreign keys when you define your table definitions. But snowflake doesn't actually enforce them. I think it's for informational purposes only. So oftentimes our customers with snowflake won't have all of their foreign keys specified in their DDL. In that situation, you can actually tell tonic where the foreign keys are. And then tonic will treat them as if they were real foreign keys. And you can do that through this UI experience here. And this, this experience also exists in our application demo that we did earlier. I just didn't show it at the time. All right. Very good. So let's say I wanna go generate data and kind of do some analysis on my output data set to make sure that I I'm still, you know, getting all of the useful analytical results out of it.
Adam Kamor (45:18):
Let's do that now. So earlier today I actually ran a job and I'm gonna go over the results of that job with you now. So the first thing I, what I've done is I've essentially take, I've gone to the TPC benchmark, which is a set of queries that are, are kind of shift with the database with the data set that we're using today, that that are, you know, there's not to be realistic BI queries, that one might run when investigating this data set, and I've taken a few of those queries and I'm gonna show you how they behave on the real dataset and on the tokenized dataset. So let's do that now. So the first query that we're gonna go over is a distribution of customers with N orders. Think of this as a histogram. It, it essentially is gonna tell you how many customers made five orders.
Adam Kamor (46:01):
How many customers made six orders? How many customers made seven orders, etcetera. So it gives you a customer account by order number and the query to generate that is right here. This query is running against the source data set. And over here I have the same query that runs against the output data set. Cool. the query itself is gonna work on the customer's table and on the orders table, which are the two tables that we've tokenized thus far. It's gonna do a join on the customer key. That's the column that we tokenized in the customer's table and then tonic automatically tokenized in the orders table where that foreign key was. That was the last thing I showed. And it's also going to look and, and then once it does that join it's gonna do a group buy on customer key.
Adam Kamor (46:44):
The group buy will be preserved because our tokenization generators guarantee uniqueness. And then finally it's gonna do account. I ran these queries earlier and actually I put the results into an Excel sheet on the right here. You have the order count, then you have the real customer count, and then the fake customer count, meaning against the real source and against the tokenized source. And then when you do that, you see that your histograms are identical for the two data sources. So tonic is preserving your joints. And because of the uniqueness guarantees that we have, it's gonna preserve group buys as well, thereby preserving counts. And you can that here.
Adam Kamor (47:22):
Very good. Let's look at a second query. We're gonna do it first on the real data set. Very good. So here, we're gonna look at, we wanna list of the top 10 customers by order sites. So I wanna look at the customers, my top 10 that have ordered the most for me in terms of dollars, right? So you can see here, I'm looking at the customer name, I'm doing a sum of their, you know, order total price. And I'm doing this by selecting from orders, but then again, I need to join on customer key. And then I'm gonna group this time on the customer name, the customer name, if everyone recalls was the column that we had to apply the RegX generator onto where we preserved the customer hashtag, but then we wanted to de-identify or tokenize rather everything to the right of the hashtag.
Adam Kamor (48:11):
Okay. So when you do this on the real data source you get a list of customers and then the total order price, that looks pretty reasonable to me. Okay. I believe it. So our largest customer ordered a little over $7 million worth of goods. That's great. When we go to the tokenized data source, the totals are the same, right? There's a customer that again, ordered a little over $7 million worth of goods, but the name of this customer is now different. It's been tokenized. So instead of looking at individuals, we're now looking at tokens, but we're getting the same numerical results out. Excellent. Okay. Everything that I've shown in the application demo earlier in terms of like, you know, role based access, being able to share web hooks and notifications post job scripts, et cetera, all of that also exists in this demo, but since I've already demonstrated it, I'm not gonna demo it again. And I think with that, we can kind of conclude the demo. So Kiara, I think we can, we can open the floor for questions now, if there are anymore.
Chiara Colombi (49:15):
Yes. Yes, we do have questions. Thank you, Adam. That was awesome. Sure. I love how we actually jump into snowflake at the end to really show it in action. The first question that I have for you is, or if you could talk somewhat around automation, how are our customers automating these processes since they're dealing with so much data, what, what ways are they able to just keep things running smoothly and automated in their C I C D?
Adam Kamor (49:38):
Sure. So typically like our customers automate just by using our rest API to start jobs. I, I think like a, a healthy minority of our customers that are using our APIs are typically doing so through tools like Apache airflow where they can just, you know, run jobs, jobs as needed as part of their dag. I, I think that's what I, I see probably the most often maybe second, most is where customers are, are kind of triggering these jobs through C, C, D pipelines. That's more com that's more common on the application database side. It's like, you know, you're spinning up your environment. You need to, to go spin up at the identified database as well on the tokenization front, which is common for data warehouses, you know, things are a little different customers typically talk about like the big dang event that happens initially.
Adam Kamor (50:24):
Like, you know, day one, you get tonic, you need to tokenize everything in your data warehouse, but then on day two, you don't wanna re tokenize everything. Instead, what you want is to tokenize just the data that came in in the past 24 hours. Right. And then the next day you wanna, you wanna tokenize just the data that came in the previous 24 hours. So it's like, you know, you do a big bang on day one, and then you're doing like incremental deltas on subsequent days just to get like the previous days data. And, and folks typically orchestrate that through airflow or, or similar tools.
Chiara Colombi (50:57):
Okay. Yeah, I think that was kind of, part of my question is explaining that incremental mode that we offer so that you're not doing it on all your data over and over and over
Adam Kamor (51:05):
Again. Yeah, that's right. Yeah. for warehouses that's really important, especially for a database like snowflake a where, you know, you wanna run queries as, as you, you wanna run the smallest queries, you can just for the, you know, timing, cost issues and you're typically dealing with very, very large amounts of data.
Chiara Colombi (51:24):
Yeah. also on that note, how often are we finding tonic users running generations?
Adam Kamor (51:31):
Oh, it, it entirely depends on the application database side. Typically like how often you're running tonic is proportional to how often your database is changing. Right. And by changing database, I could mean the, the schema is changing. In which case you definitely need to rerun tonic or just lots of new, interesting data is frequently coming in. In which case you also made the run tonic more frequently. Those are typically the two considerations on the data warehouse, tokenization side. It's a bit different front it it's typically done on a more regimented schedule, like, like I said earlier, like that big bang and then daily, for example, and, and it just depends on like how often data's coming in. I think our, our customer with like, I'd say like the most impressive data warehouse probably has about 50 terabytes of day 50 terabytes a day of incoming data. So they, they need to, you know, stay on top of that before it balloons in size.
Chiara Colombi (52:25):
Yeah. there's also a follow up question that I have for you around the encryption that, that dealing with encrypted data. Do users need to share their decryption encryption keys with tonic in order to work with that data?
Adam Kamor (52:41):
So typically on the, on the, on the decryption side, if, if the data is like being encrypted with like a production key by the application you have a few options for decrypting an in tonic. One would be to share a key, which is like less than ideal, obviously tonic can also integrate with key management systems KMS and AWS, or oh shoot. There's another one. I think it's called vault. I'm forgetting the name of it though. And tonic can integrate with those so that we don't actually have to see the key material oftentimes.
Chiara Colombi (53:15):
Adam Kamor (53:16):
Okay. And, oh, and then on the encryption side, when we go to re-encrypt afterwards, typically in those situations, you're encrypting the data back with like a staging or dev key as opposed to the broad key, and it's much less sensitive and that's typically okay.
Chiara Colombi (53:27):
All right. And is all this happening on premises since to has deployed on prem?
Adam Kamor (53:32):
Oh, you know, I, I like did not mention that actually. So there are two ways that our customers run tonic. One is we can host it for you. We, we have a SaaS offering. The other option is that you actually deploy tonic on your own prem. And when it's deployed on your own prem, obviously everything stays on your prem and we never see or touch your data. In fact, we have many customers that run tonic on completely air machines. So yeah. Thank you, Kiara, I, I definitely forgot to mention that earlier. Oh
Chiara Colombi (53:57):
Yeah, no, no. Just wanna make sure we clarify all those things. Yeah,
Adam Kamor (54:00):
No, it's very important.
Chiara Colombi (54:02):
I think the last question that I have, cause we are coming up on the hour, but feel free folks to, to drop those last questions in the QA. What support do we provide?
Adam Kamor (54:12):
Sure. so we, we create slack channels or teams channels, or, you know, whatever your, your, you know, chat app of choice is with all of our customers so that all of our customers have like direct access to solutions, architects, support engineers, tonic engineers, I'm, I'm constantly in the customer channels, chatting with folks. We have a very like healthy and active and responsive support team to, to, you know, get our customers using tonic and, and being happy.
Chiara Colombi (54:39):
Awesome. That's all the questions I currently have and I think we are coming up on the end of the hour. I'll I'll just go ahead and screen share one last slide, Adam, if I can grab that screen share.
Adam Kamor (54:49):
Absolutely. Let me stop sharing. Go ahead.
Chiara Colombi (54:52):
Thanks. And this is just how to get in touch because we always have that question at the end as well. We would love to hear from you if you'd like a demo for your larger team. We're happy to get that scheduled. You can go to tonic.ai and request a demo and, and book something right there on the website. You can also reach out to us by email, firstname.lastname@example.org or, you know, we're on social at tonic fake data. We'd love to hear from you love to get more questions. Feel free to, you know, follow up by email. Also directly to me, my email is C H I ARA tonic.ai. It's Kiara at tonic. And, and thanks so much for joining us today. We have more webinars coming up. We're probably gonna be having something booked soon for early July. So look out for that and I hope to see you at the next one. Thanks so much, Adam. This has been awesome.
Adam Kamor (55:41):
Thank you. Have a nice day, everyone.