Oops! Something went wrong while submitting the form.
November 16, 2021 4:00 PM
Data Warehouse Tokenization: Preserving Privacy and Analytics
Join Tonic Co-founder and Head of Engineering Adam Kamor as he spotlights data warehouses and how data tokenization at scale is enabling teams to protect both data privacy and their capacity for data analytics.
Adam Kamor, PhD
Co-Founder & Head of Engineering
Join Tonic Co-founder and Head of Engineering Adam Kamor as he spotlights de-identification for data warehouses and how data tokenization at scale is enabling teams to protect both data privacy and their capacity for data analytics. More details to come...
Chiara Colombi (00:00):
All right. What do you think, Adam? I feel like we've got a good turnout.
Adam Kamor (00:03):
Yeah. Let's rock and roll.
Chiara Colombi (00:05):
Awesome. Hi everyone, and thank you for joining us today. Our session today is on data warehouse tokenization. I'm Chiara Colombi. I manage Tonic's online events, and I'm happy to be your host today and to introduce you to our speaker as well. Adam Kamor is a co-founder of Tonic.ai. He is our head of engineering and he has been leading the expansion of our platform into the realm of data warehouses. And for these purposes, we've rolled out a number of new database integrations this year, which he'll talk further about. Today, he's going to explore this particular use case and why de-identification and more specifically, tokenization is such a fundamental approach when it comes to managing and making use of data stored in data warehouses. As always, we'd like this session to be more of a conversation, not so much just a presentation. So to that end, please ask us your questions at any time.
Chiara Colombi (00:52):
Like I mentioned, you can put them in the Q&A. You can also drop them in the chat. I'll look out for them there as well and as they come in, I will ask them of Adam. All right. I think that's all from me for now. Adam, the stage is yours.
Adam Kamor (01:06):
All righty. Thank you very much, Chiara. Hi everyone. My name's Adam. As Chiara said, I'm on the engineering team at Tonic and I also helped start the company. I guess it's been three and a half years now, actually. So let me get started with a quick couple slides, but don't worry, there's only a few and then we'll jump right into the product. Share. Alrighty. Perfect. Excellent. Chiara, are you able to see my screen? Yep.
Chiara Colombi (01:32):
Adam Kamor (01:32):
Great. Okay, cool. All right. Well, we can skip the intro slide because we've already talked about it. So many of you have probably been to other webinars or you've already heard of Tonic in some way, but let me give you a quick introduction to the company, what we do, how we do it and basically kind of the precursors to before we get into data warehouse stuff because to me, that's not Tonic 101, that's Tonic 201. So we're Tonic and we're the fake data company. We take production data and we de-identify it. We give you an output data set that looks and feels just like your real data, but it's devoid of any sensitive information. So practically speaking, you could have a Postgres database. It backs your application. We will take that Postgres database, we will make a copy of it that is identical in every way to the production database. The scheme is the same. Most of your columns are the same, meaning the columns that aren't sensitive or unchanged.
Adam Kamor (02:23):
But that handful of columns that actually contains PII or PHI or otherwise sensitive information is going to get transformed by Tonic into something that looks and feels real. Statistically and structurally, it's going to be the same as the source data, but it's entirely fake. So you end up with a new output database, in this case, Postgres that you can then use in any way that you would use the production data, but you're not at risk of misusing or abusing customer information that is sensitive, right? Ultimately with Tonic, the goal is that your output database, you really shouldn't even be able to tell it's a Tonic generated output. You should just really think of it as and treat it as production data. Historically, we've mostly focused on application databases, that is databases that back applications. Think of that as your Postgres, your MySQL, your SQL server, your Oracle, even your Db2 or your Mongo, right? It's not only SQL databases, we also support no SQL databases.
Adam Kamor (03:17):
But recently, excuse me, we've also moved in, or really not recently, over a year now we've started moving into data warehouses. So these are typically databases that store very large amounts of data. Sometimes they're used by the application for specific flows, but rarely are they the backing store of your application, as opposed to that application database. These databases are more used for just the storing of data for long term and for analytical use cases and perhaps for machine learning or data science use cases as well. These are actually at the moment, the data warehouses that we support. Redshift, Databricks, Google BigQuery, Amazon EMR, which is Amazon's managed Spark service, Snowflake, and also just generic forms of kind of bring your own Spark cluster. And I would actually be really curious from folks in the audience, and you're welcome to put it into the chat, what data warehouses are you guys using today?
Adam Kamor (04:16):
It's my opinion that we support the really popular, big data warehouses that are out there, the ones that most of our customers are using, but I would be curious to hear if A, there's any data warehouses that you need that we aren't supporting, and B, if we support one of your data warehouses, I'd still like to hear it. It kind of gives me a sense of like, okay, which of these is the most popular for our future customers or maybe some of our current customers that are on the call as well.
Chiara Colombi (04:42):
Looks like a lot of BigQuery coming through in the chat. Yep. Mostly BigQuery. Databricks and Redshift are also mentioned, but BigQuery is the winner.
Adam Kamor (04:50):
Interesting. Okay, good. Well, today's demo is actually going to focus... We had to choose one of these data warehouses to give a presentation with today. I'm going to be doing this off Databricks, but this would apply equally well to Spark an Amazon EMR. And it applies reasonably well to our other data warehouses like Redshift, Snowflake and BigQuery. Excellent. So let's keep going. All right. So first of all, why tokenize or why even deal with the data in your data warehouse? And let's not even assume we're tokenizing for a second, right? You have sensitive data in your warehouse. You need to restrict access to it in some way, right? You want people to be able to do their jobs without having access to super sensitive information that they oftentimes don't need in order to do whatever it is they're doing, right? One approach to de-identifying and protecting your data warehouse is through tokenization.
Adam Kamor (05:45):
Tokenization is especially useful for various types of compliance that are out there. GDPR and pseudonymization is one thing that comes to mind that's been obviously in the news for the past few years, and it's something that we've helped many customers with. So compliance is one obvious reason for protecting data in your warehouse. A second is risk mitigation, right? Maybe there isn't a specific compliance or regulation that applies directly to you, even though GDPR and CCPA cover a large portion of the globe, but you still want to mitigate risk, right? The users of this data, your employees oftentimes don't need to know someone's social security number or their home address or even their name, right? And finally, there's testing flows. I said this at the beginning, Tonic started off as a company that generated de-identified application databases for testing and development purposes, right? Data warehouses are oftentimes actually used in production workflows related back to the application.
Adam Kamor (06:44):
They're not just used for the data science or the analytics use cases, right? Not everyone is just hooking Tableau up to their data warehouse and then moving on with their lives. They're doing other things with it that affect user experience. And I want to call out something specific with testing flows. I recognize a lot of names in the call. I think a lot of people in this call are already familiar with Tonic, but for those that aren't, Tonic has a notion of consistency. The same input value should always get de-identified to the same fake output value, and consistency in Tonic works across databases. So if you have data in Postgres as your application database, you also have data in BigQuery as your data warehouse, you want to make sure that both data sets get consistently de-identified. So if Adam Kamor appears in both databases, you want him mapped to the same fake token or the same fake name in both situations so that the data still links up. And that's a especially important when you're doing test flows. Excellent. Oh yes. Go ahead.
Chiara Colombi (07:45):
We have a question already and I think it goes back to when we're talking about what data warehouses we support, the question is do we support file systems or raw files?
Adam Kamor (07:54):
So we do support file systems and raw files. We do that though via Spark. So I like a really common use case for us is customers have data sitting in S3, and then we can use Databricks, EMR or bring your own Spark to process that data sitting in S3. Hopefully that answers the question. If there's any follow ups there, feel free to put them in the chat. Great. Excellent. All right. So let's go into Tonic and get going. Okay. So I've connected today to a Databricks workspace. You can see here I'm in Databricks. This is the name of the database in Databricks. I've connected to my Databricks cluster, here's the connection information, and I've specified where I want the output data to go. So once I've transformed my data, I'm going to dump it in S3.
Adam Kamor (08:45):
Additionally, when I'm connecting to a Databricks cluster, for those of you that are already using Databricks, you can also specify additional information such as if you want to use the Databricks jobs cluster instead of your actual underlying cluster for the running of jobs, for those that don't know, these are ephemeral clusters that'll be stood up just for the purpose of running a job and then will be automatically deprovisioned by Databricks. Excellent. But in addition to Databricks, like I said, we support a variety of other databases. Here's EMR Spark, here's BigQuery, here's Redshift, here's Snowflake, and of course, here are some of our application databases as well. Excellent. Let's close that. So I've already ran my privacy scan. When you first connect your database to Tonic, whatever the database is, Tonic is going to run a privacy scan. It's going to identify that columns in your database that are sensitive. Think of this as a way to bootstrap your yourself, right? It kind of shows you where you should first start paying attention.
Adam Kamor (09:39):
The database we're working with today is very simple. It's just two tables; a customer's table and then a retail sales table. The customer's table has one row per customer and the retail sales table is one row per transaction where each transaction is associated to a customer. So we didn't find a lot of sensitive information, but that's just because there isn't a lot in this relatively small schema that we're working with. Of course, most of our customers come to us with hundreds or thousands of tables in a given database. We're only using two tables today because this is meant to be a short lightweight demo. So I could start off by applying the transformations to these columns that Tonic is suggesting. So for example, here's a column of first names in the customer's table. Tonic recognizes it as a first name and suggests using the name generator. So let's do that. In fact, I'm going to do that for most of these. I'll do it for email as well. And I get the birth date, let's do something different.
Adam Kamor (10:31):
I'm actually going to say, you know what? In this use case that we're working with today, birth dates aren't actually sensitive, so let's tell Tonic not to treat that column as sensitive. We can go a level deeper into the database field. So here we are in the customer's table. I've applied some transformations, but what I also want to do now is kind of look at other columns and see if we've missed anything. So in my mind, the marital status is actually kind of sensitive. I'm going to apply a categorical generator and I'm going to do the same to occupation. And then I'm going to briefly note that occupation marital status and gender are likely related to each other, right? And I'll show you what I mean by that, jumping into the table view for a second where we can get a quick preview of what the data looks like. Right. So here's your gender, here's your marital status, here's your occupation.
Adam Kamor (11:22):
These three categorical values are typically related to each other, right? For example, in this data set, we can pretend if you're engaged, you're more likely to be an actor than if you're not engaged, as an example. So I'm going to tell Tonic to actually link these columns together. I'll first link gender and marital status. I'll then go over here and I'll link occupation as well. So now all three columns are going to be generated at once by Tonic when we're generating the output and we'll preserve the underlying original relationship. For our first name and last name, I've actually just applied the appropriate generators so we're going to be generating fake names in this data set. Okay. Let's go back to the database field. There's also this retail sales table, right? So for example, there's things like... Oh, you know what? I didn't mean to click that. There's things like this POS transaction number, there's a customer key, a sales quantity.
Adam Kamor (12:12):
This information, maybe it's sensitive, but my goal today with everyone here is not to really go through all of the different types of transformations that we have in the product. In fact, you can go look at our past webinars if you're interested in that, or just reach out to Tonic and we can give you a demo. I'm more trying to show the things that we have that are specific to data warehouses. So I'm actually not going to apply a lot of transformations and I'm going to kind of gloss over most of the retail sales table and just pretend like it's either not sensitive or we are transforming it. And instead I'll point out a few other features. So for example, it's very common with our data warehouse customers that their data warehouse has a lot of data in it when they start working with Tonic, but every day, more data is added, right? If you're using a Spark style database, maybe you'd expect to see partitioning where within each table, there are partied folders by date.
Adam Kamor (13:00):
So every folder is like a day, like 2021/11/17, and then there'll be that day's data. And then the next day will be 2021/11/18 for tomorrow's data. Right? So typically what users will do is they'll use Tonic to de-identify all of the data that's sitting in their warehouse at once. And then every additional day, they'll want to run a job just on the previous day's data so that they can get that dealt up into their de-identified warehouse every day, right? That's a pretty common use case. So in order to do that, if the customer is using some type of partition filtering, we can do that. So for retail sales, I actually happen to know there's a date column. So I'm going to filter this down to 2021/02/16, which is February 16th of this year. We're going to validate that the partition filter is valid or we're going to confirm the partition filter is valid. And then when we go to generate data, Tonic is actually only going to operate on the data from that date.
Adam Kamor (14:00):
It's going to assume that all previous days have already been de-identified and it's just going to add that day's worth of data to the warehouse. This is a way of not having to rerun your entire data every time you go to generate data. And for example, this is a good example of a feature that we support in our data warehouses and not typically in our application databases because that scenario isn't super common in those style of databases. Great. So let's say I've transformed this data in a way and I'm happy with it, okay? I can go generate data and I'll do that now.
Chiara Colombi (14:33):
Can I jump in with a quick question?
Adam Kamor (14:35):
Yeah. Great timing actually because we got to let the job finish.
Chiara Colombi (14:38):
Yeah. And I think that this is just a matter of jumping up to privacy hub. Does Tonic have logs of all the steps taken within transformations for audit if required?
Adam Kamor (14:47):
Yes, it does. Actually from the privacy hub, this gives you an explicit record of all changes made to transformations on columns. And this trail can be exported via our API, so you can get a full list of everything that's happened. And actually that's a good time to say everything I'm doing in the UI today can be done programmatically as well. Tonic shifts with its own API documentation, it's self documented. And you can use that to basically everything that I'm doing today, all the clicks that I'm making, you can do via Postman or Python or Node or any way of firing off API requests. All right. Let's go back to jobs. All right. The job finished. This was a fast job. Obviously, it only took a minute. How long it takes your jobs to run are going to be a function of the size of your cluster or the size of your compute and the size of the data, right? One of the nice things about all of the data warehouses we support is that you can make jobs run as fast as you want. You can always throw more compute at them.
Adam Kamor (15:46):
Those are things that Snowflake, Redshifts, Spark and BigQuery are very good at scaling in that way. So we have customers that their data warehouse data goes anywhere from the hundreds of gigs to terabytes, even to the petabyte range, and Tonic is processing all of that data. So let's check out this job. All right. So the job completed in a minute. If I click here, it would take me to the Databricks log so I could see each of the steps that happened, but that's not super interesting. Instead, I'm going to copy this job ID and I actually want to go analyze my output real quick. So I've copied the job ID and I'm going to use this little scholar notebook that I have set up. I'm going to copy that job ID here. And I've already obviously set this up for you guys and I'm going to run this. So this is going to basically read in the table that we just created. This is the de-identify table and then we're going to show a few of the columns to the screen.
Adam Kamor (16:36):
These are some of the columns that we de-identified. So here it is. This is data that looks and feels real, but it's been de-identified. The way in which we de-identify it can vary. And I'm actually going to go into some ways that we can preserve various types of analytics in a moment. But right now I'm just trying to give you a high level overview. We also have retail sales. We only operated on 02/16. So for example, if I was to take this data frame... Okay. Hopefully I can do this on the fly. Select date, dot, I think it's distinct. Let's see what that gives us. Oh, sorry. Yep. Oh, there's actually two dates. Oh, that's because I didn't update the job ID. Sorry, I didn't update the job ID below, but if I had done this for the job that we just ran, you'd see just 2021/02/16 as opposed to the 15th and the 16th. Excellent.
Adam Kamor (17:39):
One cool thing about our data warehouses, and this doesn't apply to all of them, but it applies to some, is that in addition to using the Tonic UI to run every facet of the job, we actually also release a Java SDK that can be used if you already have an existing data pipeline that's being used to get data into your warehouse or into your lake. So for example, let's say every night you guys take a snapshot of your production database or specific tables or rows in your production database and you want to put that into S3. And then the goal is that it goes from S3 either into Databricks or into EMR or into Snowflake or Redshift, or maybe it's going to Google cloud storage and it's going into BigQuery, right? But you already have an existing pipeline. If you want Tonic just to integrate with that pipeline, you can use our Java SDK. And I'll give you a quick example of how that works. So here we are. I've actually already imported a few Java libraries that I need, including these two right here which are the two that we publish.
Adam Kamor (18:37):
The rest of these are just, I think, fairly standard Java libraries. And what I've done is I've taken my configuration here, all of the changes that I've made and I've exported them. And I can do this programmatically. When I export this, what it's doing is it's actually giving me a JSON blob that represents all of the transformations I've selected in the product thus far. It's just JSON. I've taken that JSON and I've already uploaded it to my Databricks cluster, okay? From this Databricks cluster then, I can use that configuration and operate on data any way that I see fit as long as that data has the same column names in schema as what I was using in the UI, which is typically not a problem. So let's do this. So I've just created me Tonic workspace, initializing it with that JSON file that I just showed you, and all you now need to do is pass in the table name and a data frame representing your table into Tonic's method process data frame, and that returns to you a new data frame.
Adam Kamor (19:37):
This new data frame, you can do anything you want with it, right? You can pass it onto the next step in your pipeline or you can immediately save it to file, to desk, to GCS or S3, or you could immediately load it into Snowflake or Redshift, right? You really can do anything you want with it. Let me show you what it looks like. Oh, okay. It just ran. Okay. So the configuration that I uploaded is a bit different than the one that I was just using in the UI. The one that we're using is actually using what's called a character scramble on all of these columns here besides the customer key. So here you can see that character scramble. I'll point out, notice how lines 20 and 21 got scrambled to the same value, right? On the surface, that seems like a bug, but actually if you go here and you go to the customer's table and you look at rows 20 and 21, notice the original data actually has Melissa twice. So what I'm doing here is I'm consistently mapping Melissa the to this scrambled value, right?
Adam Kamor (20:42):
And we're going to talk a bit about that in a second, actually, because there's a lot of benefits to using transformations of that nature, right? But anyways, what I'm trying to illustrate here is that you don't have to actually use the Tonic UI to run these jobs if you don't want. You can use our Java SDK and integrate Tonic's transformation functionality into any data pipeline you already have that uses Java or a JVM language, I suppose. We can do the same thing with the retail sales data. I can apply my transformations, but I can also filter it to a certain date partition, right? Just like I did in the UI. It's really no different. Excellent. We can run that quickly just so you get a sense of it. Great. So here are some de-identified retail sales data, and I'm only showing you a handful of the columns for now. Great. Okay. So what I want to do now is actually, I want to show you guys a different set of transformations that I think are especially useful for our data warehouse customers.
Adam Kamor (21:39):
A lot of these transformations that I'm using now, they're used for data warehouses, but they're used more commonly perhaps for our application style databases. Oh, go ahead.
Chiara Colombi (21:51):
There's a question that came in a little bit earlier. How does Tonic measure risk level like re-identification risk?
Adam Kamor (21:59):
Oh, that's interesting. So that is a very hard thing to compute without knowing more about the data and knowing more about what the threat models are, how the data's being used, who has access to it. It's hard to answer that question without knowing a lot more than just what the data looks like. With that being said, we do have the ability to generate what we call generation detail reports or they're also called privacy report, which give you a report of what columns have been de-identified, how strong we believe the transformations are that you've applied to the columns that are sensitive, and also it calls out things like, oh, these columns that we believe are sensitive haven't been de-identified and things of that nature. And you can use those reports to make better decisions as to what transformations you should apply, and you're going to see this in a moment.
Adam Kamor (22:54):
We have generators that run the gamut from useful, but not super secure to incredibly secure, and you're going to see kind of the whole breadth of that today. So let's keep going. Chiara, are there any other questions actually before we go into this next step?
Chiara Colombi (23:11):
No, go ahead.
Adam Kamor (23:13):
Excellent. Okay, cool. All right. So let's go now to a different workspace. This is the same data brick cluster, but I just kind of cleared my configuration in it so we can kind of start from scratch, and I'll show you another way of operating. So earlier we were using name generators on these first name and last name columns, and that generates new names, but that might not be useful. For example, what if you're trying to run some analytics you want to get a sense of, for example, like frequency counts based on like how many men versus how many women, for example, are making purchases. Going to the retail sales table, how many purchases can I attribute to men and how many to women, right? That's something that you might want to do or you might want to do other things like customer key is sensitive for you, but you still need to be able to run statistics to understand like, okay, well, how many users make more than 10 purchases a year, for example, right?
Adam Kamor (24:10):
Typically, you'd use either your customer key for that or you'd assume first name, last name combinations are unique or maybe first name, last name and email are unique. So you need some kind of unique identifier that can define a user, right? The transformations that I was using earlier, they can do that in a way, but they're not going to give you as good statistical quality. So let me show you some other options that you have. So just to kind of put a fine point on it, let's actually go into the table. I think it'll be more helpful if we can kind of look at the data as we're doing this, and we'll give it a second to load. So what I did earlier is apply that first name generator, and I can make it consistent. What consistency's going to do is it's going to ensure the same input goes to the same output. So you can see here that Melissa goes to Rosella both times on lines 20 and 21, right?
Adam Kamor (25:01):
That was that same value that we saw in our workbook a few minutes ago, right? But unfortunately what can happen in this situation is that other people that aren't named Melissa can also get mapped to Rosella, right? This first name generator, it'll guarantee that the same input goes to the same output, but it won't guarantee that two different names don't get the same fake name, right? So because of that, if you were trying to use a combination of first, last name and email as a unique identifier for a user, consistency might not work for you, right? Because it is going to generate collision so that your counts on the real data versus your counts on the fake data actually could be a little different. Now the chance of collisions is that you can actually come up with back of the envelope computations for how light collisions are. And the likelihood of collisions does go down the more characters you have in the string and the more variety you have in the string; uppercase, lowercase, punctuation, white space, et cetera.
Adam Kamor (26:00):
But there is still that chance of these collisions. And depending on your threshold for having slightly different statistics in your source and output data sets, that might be okay or it might not be okay. Historically with our customers that use these types of transformations, we've seen that the collision rates are very, very small and in fact, it's normally okay for customers, but we do have class of generators that will guarantee no collisions, which are much better for certain you use cases. Let's use those transformations now. Let's start off with our alphanumeric string key generator. Any of our generators that have the word key in them are going to be ideal for tokenization. So our alphanumeric string key generator is going to take every name and it's going to generate a unique and consistent output that's going to be composed of alphanumeric characters. So for example, Melissa gets the same token each time and in that case, the token is 90I3DOG, right?
Adam Kamor (27:02):
So whenever you see a Melissa, it's going to be this 90 string, but no other name will ever generate the 90 token. So in that case, you do get your consistency and a query such as when you're trying to find out how many sales on average is each user responsible for, that query is going to give the same results in the source and the output if you're using these key generators. We can do the same thing on last name. We'll use alphanumeric again. For gender, we can do this on gender. Gender's a bit interesting because there's only three options, MF and nul, notice there's only going to be three tokens, null, I and nine, right? So in that situation where, okay, yeah, it's tokenized, but how reversible is that, right? There's only two options. If you happen to know in the source that men occur 51% of the time and women, 49% of the time, then you can likely reverse what's going on in that situation, right? So you need to be careful still in some situations.
Adam Kamor (28:03):
I could do the same thing for email, but this is going to cause an error. There we go. The following exception was thrown. Most likely the string contains an invalid character. Yep. That's because the alphanumeric generator requires it to be only the character A to Z and zero to nine. So let's instead use our ASCII key generator, which will work because these email addresses are all ASCII. But again, you'll get unique tokens and you can tokenize as many of your sensitive columns as you need to. Are there any questions?
Chiara Colombi (28:36):
You have a question here for you. Yes. Can we use custom hashing on columns if required?
Adam Kamor (28:43):
So whoever asked that, they may might need to clarify a bit what type of custom hashing they mean. Do they mean they have their own hashing algorithm they would like to apply in Tonic?
Chiara Colombi (28:54):
I will let you know if I get a follow up.
Adam Kamor (28:58):
Sure. I'm going to assume that's what they mean just because I think that's a reasonable assumption, but if I'm wrong, whoever asked, please correct me. You have a few options in that regard. If you're using our Java SDK, then you don't need to tell us about your custom hashing function, right? You can apply the Tonic transformations in the columns you want and you can apply your own hashing function on the other columns and everything will be good, right? If you want to use Tonic via the UI and you want, for example, there to be a dropdown item for my custom hashing function and you want it do whatever it is your hashing function does, that is something that we normally can support. We would have to work with you on what that hashing function is and figure out how to get an implementation of it and get it into the product, but that's something that we typically will work with with customers on, and a typical turnaround on things like that is normally pretty quick.
Chiara Colombi (29:51):
I didn't get a follow up yet so I think we can keep going.
Adam Kamor (29:54):
Okay. I'm just going to assume that we've answered their question. Excellent. So in this situation now, we can go generate our data. Your output's going to be tokenized as you want. All new data that comes in everyday can be tokenized in the same manner and the statistical queries that you care about typically are going to result in the same answers as they would've on the source database, and you're meeting various compliances out there if you're properly tokenizing. With that be said, I think we're at a pretty good natural stopping point. Chiara, this leaves about 10 minutes for questions?
Chiara Colombi (30:27):
Yeah. Yeah. Definitely. And I do have another one. No. Go ahead.
Adam Kamor (30:31):
Yeah. I think let's just open it up for questions now, and I'm happy to show more if the need arises.
Chiara Colombi (30:37):
Yep. Sounds great. One that just came in is, do we support HIPAA 18 identifiers?
Adam Kamor (30:43):
Right. So for those that aren't aware, HIPAA has a notion of what's called safe harbor guidelines, and under their safe harbor guidelines... Well, actually, let me take a step back. HIPAA is a regulation related to healthcare data. Under HIPAA, there is this notion of safe harbor. Safe harbor lists 18 different fields that if properly de-identified means the data is safe harbored and you can use it as you need to without really any fear of risk Tonic has generators for those 18 fields, if that's the question being asked, and I think it is. We don't have specific biometric transformations that you can use because I think there are certain biometric fields that are listed under that 18, but some of our other generators you can use in lieu of the biometric data types.
Chiara Colombi (31:36):
And the next question for you is a bit of a higher level question, which is how have teams been solving for this problem to date? Prior to having a tool like Tonic to step in, what have they been doing or has it just not been so much a problem so that they haven't had to come up with solutions until GDPR came along?
Adam Kamor (31:54):
Right. I mean, I think things like GDPR and CCPA have definitely increased the need for tools like Tonic. What were folks doing before or what were they doing leading up that, that now need to change? Well, I mean, there was a few options, running the gamut from good to not good. On the proper side, they were implementing their own de-identification. That's very challenging to do. You have to keep it up to date. You have to have a very strong head for privacy. You have to basically adapt it every time the schema or data model changes. And it becomes a fairly big burden, especially for complex data sets which most of our customers possess. On the other extreme end of that, they were doing nothing. They were giving unfettered access to production data and that was really the end of it. And then there were options in the middle where, yeah, the production data, it wasn't de-identified, but maybe a couple columns were encrypted and they did their best to limit access to it and ran the gamut between those three.
Adam Kamor (32:57):
We've certainly seen all of those and everything in between just working with our customers.
Chiara Colombi (33:03):
Another question, since you mentioned working with our customers, you mentioned that teams are using Tonic for their data warehouses as kind of an expansion of the existing application database use case. How common is that? Is it more common for a team to only be doing one use case or the other?
Adam Kamor (33:20):
Huh. That's interesting. I believe most of our customers that are using our data warehouse functionality are also using it on application databases. Most, but not all. I can think of a few prominent exceptions, but on the flip side, we have many customers that are using it for application databases and not for data warehouses. And I'm not sure why that is. It might be that they're not really heavily invested in data warehouses right now. I'm unsure. That's a good question for the business folks at Tonic.
Chiara Colombi (33:53):
All right. You may have seen this question in the chat. My understanding is that GDPR has ruled explicitly that tokenized PII data is not considered non-PII. So would still be considered PII, I guess. And they ask, have you heard of this?
Adam Kamor (34:12):
So what I'm familiar with with GDPR, and I'm by no means a GDPR expert, but we do have people at Tonic that are so please reach out, we're happy to chat, is that GDPR looks kindly upon pseudonymization. And if your data is pseudo anonymized, there are things that you can do with it that you otherwise wouldn't be allowed to do under the regulations. For example, you're allowed to keep the data longer. I believe you don't have to adhere to data deletion requests and you can use the data for use cases that otherwise you can't use it for because I believe GDPR requires that if you collect data for a purpose, you can only use it for that purpose. But if it's been pseudo anonymized, you don't have to listen to that. So then the question is, okay, well, what does it mean to a pseudo anonymized data? And there, I think there's a little more wiggle room and it's kind of like more open to interpretation.
Adam Kamor (35:06):
So I think pseudonymization, it means a lot of different things to a lot of different people. I don't think I want to get into it on this call, but for those that are curious about how Tonic can help with GDPR, there's a lot we can do and there's a lot we currently do for our customers and I'd encourage you to reach out. Chiara, what's the best way for folks to reach out to us actually?
Chiara Colombi (35:25):
You can shoot us an email at email@example.com. You can also email me directly, firstname.lastname@example.org. The name is spelled C-H-I-A-R-A @tonic. And I do want to mention, to this question, we are prepping our next eBook. It is all about data privacy compliance for developers and for these use cases. So it'll speak directly to this of like, what are the requirements for de-identifying data for GDPR? And I think what you've said is accurate. And where GDPR talks about pseudonymization CCPA, the California rule talks about de-identification. It's just the same concept, two different ways of calling it. But yes, once data has been de-identified in a way or pseudonymized in a way that it cannot be linked back to an individual, it is considered safe for use even if, like you want to call it, tokenization, the PII is no longer considered PII. And so you can use the data outside of your original purpose for collecting it? Any other questions? Let's see here. Yeah. A couple of questions about working across databases. Can Tonic perform joins across databases?
Adam Kamor (36:44):
Tonic can preserve your joints. I mean, Tonic itself is not doing joints, but Tonic can preserve joints. So let's say you have several tables that when joint together comprise of you, right? It's a materialized view and it's defined by a joint statement between two tables. And both of those tables contain sensitive information, specifically their sensitive information in their primary and foreign key columns, which is what you're joining against, right? If you were to use our tokenization generators, then your joints will continue to work, right? Because you're going to tokenize the primary key value and the foreign key value to the same thing because the same input always gets the same output. That's going to work. If you were to use our non-tokenization generators, for example, our character scramble, which does a good job of maintaining uniqueness, but it doesn't guarantee it. That's the transformation where I was saying, collisions are very unlikely, but they can occur.
Adam Kamor (37:37):
Then again, your joints are going to work, but you do have the chance of adding rows that might join against each other that otherwise shouldn't. But typically, that's pretty unlikely.
Chiara Colombi (37:48):
The other question was around across database types, does consistency work across database types?
Adam Kamor (37:53):
Yeah. Consistency works across all of our databases in the exact same way. I mean, we could literally put data into all six of our data warehouses and it's going to get transformed in the same way if you're using consistent generators.
Chiara Colombi (38:07):
And then here's another question that came through the chat. Do you support warehouses running on premise on Oracle?
Adam Kamor (38:15):
We certainly support on-prem, and this is shame on me, something I should have said at the beginning of the webinar, Tonic is typically deployed on-prem. We do host Tonic for a handful of customers, but most of our customers deploy Tonic onto their prem. It could be into their own data center or it could be into their own network in their public cloud, like Azure, AWS, GCP, et cetera. So in terms of a data warehouse on Oracle, is this something like exadata? I think Amit asked this question. Amit, are you talking about exadata or some other Oracle technology? Ah, just x86? Amit, we do support Oracle. I think we support up to 19C right now, but we add support for new versions of Oracle as needed. So if you have data in Oracle that's on-prem that you would need de-identified, that's certainly something we can support.
Chiara Colombi (39:11):
Great. Another question, and I think you touched on this somewhat earlier, how automated are customers able to make Tonic's processes for data warehouses? Thinking about, you mentioned partitioning the sheer amount of data that is coming in.
Adam Kamor (39:25):
Right. So typically what we see is that, when a data warehouse customer first comes in, they're using the Tonic UI to kind of get everything set up and take care of that initial big bang that is operating on all the data that's currently in the warehouse. This could be petabytes of data for all we know, right? And then after that's done, typically they move Tonic... They don't move. They start using Tonic less via the UI and more via our API and through their own types of job scheduling softwares, right? They might want to run a job every night that processes yesterday's data. We have customers that manually trick our API to do so via cron jobs, or we have folks using, oh God, is it Airflow or air table? I always mix up the two names. I think it's Airflow. Using tools like Airflow to trigger these jobs as well. And at that point, it becomes pretty hands free.
Adam Kamor (40:16):
You can also use Tonic's Webhooks to do things like receive a Slack notification or an email when a job succeeds or fails. You can use the API to pull for job status, et cetera. We make it very easy to automate all of this.
Chiara Colombi (40:30):
Awesome. And kind of a follow up to that is how often do Tonic users run generations?
Adam Kamor (40:38):
The answers different for application databases and data warehouses. Data warehouses, I mean, depending on how frequently the data in the warehouse is getting updated, typically they're running it on the same schedule as the data updates come in, right? Nightly is pretty common. For application databases, you're typically rerunning Tonic whenever there's a schema change or really whenever your data has changed in a meaningful way, I would say, which it could be every day, but it can also be less often.
Chiara Colombi (41:04):
Yeah. Yeah. I remember in some case studies, teams have mentioned four times a day and those have been, I think, the application databases,
Adam Kamor (41:13):
That's right. Four times a day, that's very frequent. Certainly we have customers doing that, but I think normal is like once a day, four times a week, something along those lines.
Chiara Colombi (41:24):
Yep. Here's another question for you. With regard to observability of Tonic systems, what options do you provide?
Adam Kamor (41:33):
Sure. So the Tonic service is deployed on your prem, as I said. Resume share. Okay, sorry. Sorry, lost my train of thought. I got a weird Zoom notification. So Tonic is deployed on your prem, as we said. It's deployed via doc containers. Each of the containers has health check endpoints that you can hit at whatever cadence you like. If you deploy Tonic via Kubernetes, which many of our customers do, you can use the suite of options available to you in Kubernetes for observability and for the restarting of services and pods and containers as things happen. Additionally, Tonic logs everything it does to standard out on each container and you can take those logs and put them into your own observability systems like Splunk or Elasticsearch plus Kibana, or really whatever it is you need to do.
Chiara Colombi (42:27):
And there was a question I was going to ask you earlier around role based access, if you wanted to talk about, because the question that came in around the audit log, what levels we provide and what that entails.
Adam Kamor (42:38):
Sure. So Tonic ships by default, I think, with three different access levels, or really four. The top level is the owner of the workspace. So you guys saw, I created those workspaces today. I'm the owner. As the owner, I can share my workspace with the other people. When I share it with others, I have to kind of state what level of permission I want to give them. And in fact, it might be best just to do this in the product. Let's go to the workspace manager. These are all my different workspaces. So let's go to Databricks. Other, sure. Let's go here. I'll share it. So I need to choose who I'm going to share it with. Chiara, I will share it with you for fun, Chiara@tonic, and I can make Chiara a viewer, which really gives them only read only access to the Tonic workspace configuration. A viewer can't even see the data from the data source so they wouldn't be able to see that table preview, for example.
Adam Kamor (43:36):
An auditor has a few additional permissions. I believe they can see the data and they can also see the Tonic workspace configuration. And an editor can do everything an owner can do, but they can't share it out.
Chiara Colombi (43:50):
That's great. That's helpful to see in action. And feel free to share that with me because I don't think I have a workspace for Databricks.
Adam Kamor (43:58):
Yeah. Okay. Sorry. Clicked the wrong thing. Yeah. I'm going to do it right now. Otherwise, I'll definitely forget. Chiara@tonic, editor. All right. A lot of responsibility. There you go.
Chiara Colombi (44:11):
Thanks. All right. I think that's most of the questions. Oh, there was a question around performance. So how does the size of the data warehouse impact Tonics performance? But maybe that's more on that first configuration.
Adam Kamor (44:26):
Right. So it depends entirely on how big your cluster is and data warehouse you're working with, right? On things like Spark or Snowflake, you can change the compute, right? If your data size doubles, but you also double the number of Databricks Spark workers you have or double the compute size of your Snowflake warehouse, then the job time's not going to go up at all. It's going to stay the same, right? That's really the nice thing about most of these data warehouse technologies. They scale very well with data sides. You can always just throw more compute at it in order to make the job go faster, and that's different than application style database like PostgresSQL server, which are going to be more constrained by the single point of data access on the server itself.
Chiara Colombi (45:15):
Well, that is all the questions that I see. If anyone has one more question, feel free to throw it in to the chat or the Q&A and we can get that answered. And if not, I will just close things out by saying thank you to everyone for joining. Thank you in particular to Adam. We do have another question. So I'll go ahead and ask that before I close things up. Do you document anywhere what generators are available across different data stores?
Adam Kamor (45:39):
Sure. So in general, and there are a few exceptions, the same generators are available across all data stores. Again, there are a handful of exceptions to that, but that is a good rule of thumb. I think you can go to our documentation page to get a full list of transformations though.
Chiara Colombi (45:55):
Yeah. Could you just open up the docs real fast and show that.
Adam Kamor (46:00):
Certainly. Yes. It is going to be here, generators. Great. Here, we list all of our generators and it goes all the way down. This list might not be perfectly up to date. We're constantly adding new generators and very rarely do we take them away, so we're probably just missing some. But there are a few generators on here, for example, that we don't support on Databricks at the moment, but we're constantly adding support for new ones.
Chiara Colombi (46:36):
And another, I feel like I keep on bringing up eBooks, but we are currently working on an eBook that is going to be more of like a glossary of our generators. I'm glad that question was asked because we could specify in that glossary this may not apply to the data warehouse use case.
Adam Kamor (46:51):
Chiara Colombi (46:51):
Yeah. Yeah. I'm excited about that eBook. It's probably going to be more like 2022 that we'll launch that one. But look out for the compliance eBook. That one is coming up very soon. I would say within the next two or three weeks, that should go live. So excited about that. Other things, we do have another webinar coming up in early December based on our fake data anti-patterns eBook. So if you've seen that eBook or if you haven't, go check that out. Definitely sign up for the webinar as well. We'll go through that in more detail. So thanks again to everyone for joining. This has been great. Adam, awesome presentation. Thank you. And if you'd like to learn more about Tonic, head on over to tonic.ai. You can book a demo with us or you can also request a sandbox, which is basically a two week free trial of Tonic if you want to get your hands on the features and try things out yourself. Awesome. Thanks to everyone. Thank you, Adam. And we will see you next time.
Adam Kamor (47:46):
Thank you. Bye.
Chiara Colombi (47:47):
Fake your world a better place
Enable your developers, unblock your data scientists, and respect data privacy as a human right.