Oops! Something went wrong while submitting the form.
February 14, 2022 4:00 PM
Scaling Fake Data for the Enterprise
What are the challenges involved in data synthesis for the enterprise? What are the pitfalls to avoid and the solutions to rely on? Join Tonic.ai CTO and co-founder Andrew Colombi and Mockaroo CTO and founder Mark Brocato as they share their insight and experiences on the frontlines of big data generation.
Founder & CEO
Adam Kamor, PhD
Co-Founder & Head of Engineering
“At scale”—two small words for a universe of challenges in the realm of software development. From performance issues to endless integration requirements to the risks of relying on open source tools, building developer solutions at the scale required by the enterprise is riddled with complexities. And when it comes to test data generation, big data is making “at scale” an ever bigger problem.
What are the challenges involved in data synthesis for the enterprise? What are the pitfalls to avoid and the solutions to rely on? Join Tonic.ai Head of Engineering and co-founder Adam Kamor and Mockaroo CTO and founder Mark Brocato as they share their insight and experiences on the frontlines of big data generation. They’ll reflect on what they’ve learned over the years in scaling up their products, as well as scaling up software companies over the course of their careers from startups to industry leaders.
They’ll speak to:
lessons learned from Log4j, and the inherent risks of open source software
the trade-offs of building tools in-house vs. third-party solutions
optimizing platform performance for large-scale deployments
and the growing pains along the journey from startup to scale-up and beyond.
Chiara Colombi (00:06):
Hi everyone. Welcome to today's conversation about scaling fake data for the enterprise, with the founders of Mockaroo and Tonic. My name is Chiara Colombi. I am the product marketing manager at Tonic, and I'm very glad to be joined by our speakers today. We have Mark Brocato, founder and CTO of Mockaroo, and Adam Kamor, co-founder and head of engineering at Tonic.
Adam Kamor (00:26):
Happy to be here. Thank you.
Chiara Colombi (00:28):
This is the third time we've sat down for kind of an informal chat like this with Mark of Mockaroo. I always enjoy these conversations and the stories that turn up between these leaders in the field of fake data. For context, both of our products fall within the category of data synthesis, most often in the context of test data generation, but we approve approach the challenge from two very different angles.
Chiara Colombi (00:49):
I think it's helpful to set that stage before we get started with the questions. Mockaroo is a platform that enables you to create fake data from scratch based on rules that you define to dictate how that data looks. So you might set like the ratios and distributions within certain data types. Tonic, on the other hand, is a platform that enables you to create fake data based on your data.
Chiara Colombi (01:11):
So you connect Tonic to your production databases and use Tonic's generators to mimic your real-world data. And what you're creating is fake data that looks, acts and behaves like production data, but strip clean of sensitive information like PII and PHI. So since we approach the same problem from such different angles, I can ask the same questions of our speakers and get very different but always very valid answers.
Chiara Colombi (01:35):
On that note, we've got a lot of great questions already lined up, but I'd love to hear your questions. So please drop them in the Q&A, drop them in the chat at any time and I'll keep my eyes out for them and ask them if they come up. Awesome. Let's get started. We're talking about scaling up data generation to meet the needs specifically of the enterprise.
Chiara Colombi (01:54):
And I think it'd be helpful to establish some baselines here around the specific needs and challenges involved. So my first question for each of you is, what, in your view, are the top needs of the enterprise when it comes to data generation?
Mark Brocato (02:08):
A completely unrealistic number of fields. It's incredible some of these data models that people have to live with. I mean, I get it. I remember working... Actually, when I created Mockaroo, I was working at a biotech company. It was a biotech startup.
Mark Brocato (02:25):
One of the products we were, I don't know, aiming to use or integrating with was like Oracle's healthcare and life sciences data model, which was literally so big that they had to create like a Google Map style browser for traversing the thing, like zooming in and zooming out and panning over in terms of number of tables.
Mark Brocato (02:47):
But one of the things that always shocks me, and it's a different scaling than maybe you think of. You might think of scaling as like, all right, well, I need to generate a terabyte of data or a petabyte of data. You think of it like longitudinally, number of rows.
Mark Brocato (03:02):
But the complexity of the schemas that you see in the real world and in the enterprise, just the number of columns that translate to Mockaroo's peak is sometimes really where the challenge can be. And so I'm always careful. Sometimes people will ask me about, "Hey, how long is it going to take to generate 10 million rows of something?" And it depends, rows and columns. It's a product. And sometimes I'm surprised.
Adam Kamor (03:26):
Size on disc is how we typically look at it, really. Because, I mean, when it comes down to it, that's really kind of the product of the rows and columns, as you're saying. Mark, I think Tonic's record, in terms of just like that horizontal width and the total number of tables and columns, I think our record with a customer, it was approximately like 47,000 tables, probably average, 10 columns per table.
Adam Kamor (03:55):
So hitting close to half a million individual columns. I mean, forget the generating for a second, right? I mean, that needs to load in your UI. I mean, rendering 500,000 of anything in the UI can actually be just complex to begin with. So step one was making sure the UI would even load, which it did, but it took some work.
Mark Brocato (04:19):
Penalize your react component, folks.
Adam Kamor (04:21):
Yeah. Virtualize everything when you can. I'm going to add on to what Mark said, because yeah, enterprises come to you with a tremendous amount of complexity. And it could be like complexity between columns or just like massive scale in terms of tables and columns and their numbers.
Adam Kamor (04:37):
I think two other things they care about, and I would just put these at the same level as what Mark said, would be the utility of the output data. I mean, you're generating this for a reason. So why are you generating it? Well, presumably it's to test something or to use it in a sales demo or for machine learning purposes, something like this.
Adam Kamor (04:56):
These are all use cases that Tonic and Mockaroo customers all have. So there's the utility of that output data. And then there's the privacy of that data. Depending on how you generate your output data, it might be too similar to the real data, in which case, okay, maybe you shouldn't be using it.
Adam Kamor (05:12):
You can't use the real data. So can you really use a fake version of it that is too similar to it? So it's really just the privacy utility conversation. They have to find the balance that they want to strike. As they increase the privacy of their data set, the utility of the data goes down. So they might not want to do that.
Adam Kamor (05:31):
So both tools try to give you that controlled knob for like optimizing the balance between the utility and the privacy. And I think that's especially important for enterprises because it's super non-trivial to do at the scale that they are at.
Chiara Colombi (05:44):
Yeah. I think you both touched on like the three key points. There's utility, there's privacy, but there's also just the complexity involved of the data that you're trying to create. So what would you say are the top challenges in meeting those needs?
Mark Brocato (06:02):
One is design, because a design that works really well at small scale may not be the same design that you would use really well at large scale. And a design that accommodates for large scale may be overly complicated for small scale. So Mockaroo really was designed more with the small scale in mind.
Mark Brocato (06:26):
And I've added on things that help it scale out over time, just design-wise, nevermind performance-wise. In the beginning, all I ever envisioned was somebody generating like a CSV file worth of stuff. And then people were like, "Well, I have multiple tables. So how do we solve the integrity constraints between them?"
Mark Brocato (06:47):
And so it's been a long journey to where we are now. Originally, I allowed people to upload their own CSV files and then reference them in Mockaroo. And then it became obvious, well, people were generating a series of CSV files, all corresponding to tables that related.
Mark Brocato (07:03):
So you would go through this workflow of generate the data in Mockaroo, download it, re-upload it back in, and then reference it in future schemes. And it was actually only until about the last year where it would do that whole process for you, where you could just click "create dataset."
Mark Brocato (07:22):
And then you could reference that later and you didn't have to download and re-upload. And then you could do it at scale. So there used to be a pretty small constraint on how big a stored dataset could be in Mockaroo, and it wasn't working for some of the biggest customers. So I wound up doing some pretty radical refactoring to make that possible. One thing I did, and sorry to ramble, but I feel this is an interesting cautionary tale for people using cloud services.
Chiara Colombi (07:49):
These are the tails I love.
Mark Brocato (07:53):
In order to be able to pick random rows or sequential rows from a really large dataset without like downloading and parsing it, I used S3. And S3 has this wonderful capability called S3 Select, where you can literally like write SQL against an S3 file.
Mark Brocato (08:08):
And you can just basically say, "Oh, pick me some random rows." And it does it really quickly. And it's really expensive. It's not obvious that they charge you for the compute behind all that until you see the bill. So I ran...
Adam Kamor (08:20):
Mark, is it AWS Athena is what you're talking about?
Mark Brocato (08:25):
They may share the underlying same technology, but there's literally this feature called S3 Select.
Adam Kamor (08:31):
Oh, okay. Interesting.
Mark Brocato (08:33):
Yeah. Where you can literally just... It's one of the many thousands of objects in their SDK. You can call that thing up and then you can send it a select statement. And it's pretty complete as to what it can do. I got a bill one month that was like 10X what it normally is, like total surprise. And then I wound up re-implementing that.
Mark Brocato (08:53):
I took it from Ruby to Crystal, basically. I used to do it locally with Ruby. Then I switched to doing it in the cloud with S3 Select and it was like ridiculously expensive. And then now it's optimized in Crystal, which gives you about a thousand times the performance of Ruby. So it's actually been pretty good. That's the long journey of accommodating that complexity and then the scaling challenges involved in accommodating that complexity.
Adam Kamor (09:23):
On the Tonic side, I'll remind folks of the question, it's, what's the biggest challenge you face when you are solving these enterprise challenges?
Chiara Colombi (09:32):
Yeah. And how are you addressing it? I would say is following.
Adam Kamor (09:34):
Right. I like Mark's answer a lot on design and it's definitely true. And that's a good point, Mark. I guess I knew it, but I had never said it out loud before. It's like the design you choose for your smaller early customers is potentially very different than what you choose for your later customers, who bring those 47,000 tables that you're talking about.
Adam Kamor (09:55):
I like Mark's answer. I'll try to add another one that is maybe a little more Tonic-specific. And this is something that we get asked a lot with our enterprise customers. It's, okay, you've generated for us dataset. It has good utility. We're happy with it because we've tested it.
Adam Kamor (10:11):
But how private is it, really? So answering that question about, okay, well, how private is the data you've generated and how accurate is it really? Because when you have those tens of thousands of tables, you can't go through by hand and manually check everything.
Adam Kamor (10:24):
So being able to actually answer that question on privacy and utility and kind of give a customer an idea of this is how safe you are, these are the risks you're taking, is a very challenging problem. And it's one that we've only... I'm saying recent, but it's been in the past six months or nine months, but we've only recently started to tackle it head on via a new Tonic feature called... God, I hope this is the name of the feature, it's called Privacy Reports, I believe.
Adam Kamor (10:49):
And that's where we answer those types of questions for you. That's been a real challenge and it continues to be one just because like the mathematics behind privacy and the guarantees that you get are, A, challenging and, B, it's actually like a field of research that is constantly improving and getting better with like the latest and greatest algorithms and heuristics that are coming out, and Tonic tries to stay up-to-date with all of that.
Chiara Colombi (11:14):
Right. Which I think also speaks to another challenge within just test data generation, any solution that you build, there's so much maintenance that is involved to keep up-to-date and to improve along the way. We have a couple of questions from the audience and I'm going to jump in with quickly. Number one, are there any language-specific libraries that can be used to generate data right inside the code using Tonic or Mockaroo?
Adam Kamor (11:38):
Oh, interesting. Mark, I'll let you answer first.
Mark Brocato (11:40):
Well, yeah, Mockaroo has always supported Ruby syntax in a few different places. So there's the formula data type that you can use Ruby code to compute values based on other columns. And then there's inline formulas, which is that little Sigma symbol to the right of every field, where you can use a same formula syntax to just like adjust the output of that column slightly. So that's built in.
Mark Brocato (12:03):
Mark Brocato (12:24):
Adam Kamor (12:50):
Interesting. So on the Tonic side, our approach, this is a bit different, even though we have explored the idea of letting customers write their own code directly in the product, typically when it comes to the point where they need like a super specific custom generator, Tonic's engineers and support team will write it for the customer.
Adam Kamor (13:08):
Typical turnaround on things like that is typically just a day or two, depending on the complexity of the transformation they need. But a lot of Tonic's generators are actually like composable and they kind of plug and play with each other nicely.
Adam Kamor (13:20):
So you can actually build quite complex logic just with some of Tonic's more like base transformations, transformations like conditional generators, for example. This is a transformation that allows you to conditionally apply transformations on a given column based on condition statements that you kind of build up based on values either in that column or on different columns. And that's typically how we approach complexity.
Chiara Colombi (13:45):
There's another question around just capabilities and that I'll ask before we go back to the subject of scaling. "We are in healthcare," the question asks or says, "And need C-CDA documents in XML output. Is XML output supported?"
Mark Brocato (13:59):
In Mockaroo, yes.
Adam Kamor (14:01):
In Tonic, yes, as well.
Chiara Colombi (14:02):
Okay. Awesome. Thanks. All right. So going back to the topic of scaling up, let's broaden the conversation and think about scaling up software products in general. The first topic I'd like to cover there is that of in-house tools versus third party solutions. As you've been scaling up your products, what trade-offs have you encountered between building the tools you need in-house versus relying on third party solutions along the way?
Adam Kamor (14:26):
Oh, we go third party whenever we can. Our team is very good at building data privacy tools and we'd like to keep everyone focused on that. So we go with third party solutions whenever possible. That could be, for example, like logging and telemetry solutions that we bake into the product. It could be even simpler things like our VPN provider and where we maintain our source code and things like this. We go hosted wherever we can.
Mark Brocato (15:06):
Yeah. Same. I've had kind of, I don't know, a belief system in the teams that I've built that you should only own one thing, and that should be the one thing that makes your business different.
Adam Kamor (15:22):
Yeah. And makes it money, to be frank.
Mark Brocato (15:26):
Yeah. Differentiated and valuable. And so Mockaroo, it's just one of my projects. I have a bigger day job. But even with Mockaroo, it's always been on cloud providers. I use Ruby on Rails. The whole point of Ruby on Rails is it does like literally everything for you with the minimal amount of effort and maybe the minimal amount of performance too, but it's meant to get things up and running and going quickly.
Mark Brocato (15:53):
And so I really just wanted to focus on the design of the application and the data generation, and I'm using all the standard patterns and services. I should have been using Rollbar since like day one for error tracking. There may be other services out there now, but I've always been happy with Rollbar. I switched cloud providers a few times, actually.
Mark Brocato (16:16):
I try to go as low cost as possible. So I don't actually run on Amazon for the most part. I run on DigitalOcean and more recently Hetzner. So if anybody's looking for a really good cheap cloud provider out there, Hetzner based out of Germany is actually quite good.
Mark Brocato (16:33):
But yeah, in other businesses where we've handled web traffic, hosting platforms, we say we only want to own the critical path, so the path of the request take through the system, and everything else we want to rent from somebody. Because there's just so many solutions out there right now that there's so much downward price pressure on every piece of software, that it's very difficult to recreate something for cheaper, especially when you take the human cost into a comparison.
Adam Kamor (17:01):
Mark, given the one-man nature of how it's operated, I mean, you have to be even more careful, I suppose.
Mark Brocato (17:11):
Yeah, for sure. Because you have to live with everything that you build. You have to maintain everything that you build. So you feel it kind of viscerally. I am a man of limited talents. I'm not a Unix guy. I'm not a CIS admin. I barely am functional when it comes to like keeping a server up and running. Maybe some of my users can attest to that from time to time. So I really am just a software developer and I try to use as high level managed services as I can.
Adam Kamor (17:41):
That's true. I mean, even at Tonic where we have more engineers, it's the same ethos. It really is. It comes down to what you said. It comes down to like you want to focus your people where you're making your money, which in this case is a data privacy or a data generation tool.
Chiara Colombi (17:59):
So following up on that, why do you think it is that we so often encounter, especially in the enterprise as well, teams that are saying, "No, no, no. We're going to build our test data generation in-house?"
Adam Kamor (18:13):
I'll take a stab at that first. And maybe this is controversial, but I think they underestimate typically the challenge in doing it yourself. And that leads them down a path that often, and there are exceptions, but it often leads to pain and ruin. I can get into why it's so complicated, but that's kind of a longer conversation.
Adam Kamor (18:39):
I'll try to say it briefly though. I'll try to kind of paint a picture and explain this. When you want to create like a de-identified or fake version of your production database for testing, it's one thing to fill that output database with data and columns that is the correct data type, for example, right? Anyone could do that.
Adam Kamor (18:58):
But then you got to go connect your application to this now de-identified database. And that application has certain assumptions it makes of the data, like of the relationships between rows and the relationships between columns. Like, if the value in this column is one or two, then the value in this other column must be null. Things like this.
Adam Kamor (19:16):
And then if the application doesn't get data that it's expecting, it's going to crash or fail, and you're not going to be able to do your tests. So what you end up doing is when you create these like programs to generate fake data for yourself in-house, you end up having to recreate all of the business application logic in your test data generation application.
Adam Kamor (19:36):
Otherwise, your test data isn't worth anything because you can't actually run your application against it. So it gets very complex very quickly and it's not obvious at the beginning that it's going to be so bad. Because most people don't appreciate how hard something is when they first start it because they don't have a lot of knowledge of it to begin with.
Adam Kamor (19:53):
And to the previous conversation of you want to go third party when you can because it's their expertise. Tonic and Mockaroo are both very good at solving different shapes of this problem. And we've been doing it for a long time. We do it across many different customers and we know how to generate that really good test data.
Mark Brocato (20:15):
I think I totally agree. And I see another angle too, which is more psychological. It's more fun to build than to buy, in general. Engineers, we're put on the...
Adam Kamor (20:29):
Yeah, especially a greenfield app, something from scratch. Nothing is more fun than that.
Mark Brocato (20:34):
And it's especially tempting to build tools, because they're conceptually easy to understand. It's a problem that you can create all of your own requirements and just set about starting to build stuff. It's probably an easier problem to wrap your head around than actually what you're getting paid to build.
Mark Brocato (20:55):
Oftentimes the reason why... For Tonic to sit there and for somebody to build a CI/CD pipeline, it's a lot easier to understand that somebody who had to figure out how to de-identify that data come up with that core algorithm. I'd rather go and work on a thing where yeah, I get to exercise my muscles as an engineer and build stuff and get the reward mechanism of seeing something run.
Mark Brocato (21:17):
But I'd rather not solve this really incredibly difficult problem. So let me go and work on some tooling. I see that a lot from very good engineers. So it's like we have some counterproductive instincts as engineers, actually. That we'd rather go work on things that are easy, as long as they kind of give us that reward mechanism of shipping something.
Chiara Colombi (21:36):
Yeah. I remember you bringing that up in a previous conversation as well. It's kind of like engineers can sometimes be their own worst enemy if they're saying, "I can build it. I'll do this." And then you're totally distracting yourself from your core work.
Adam Kamor (21:50):
I mean, I'll take it a step further. I mean, maybe it's just me, but I bet every engineer in this call has started a project thinking it's going to be easy. And then after day two or three, you start seeing like, oh my God, this is very complex. And you stop and you don't do it anymore. That's very common when writing these test data generation applications.
Mark Brocato (22:09):
Yeah. I was saying that underestimating level of effort, all progress depends on it. We would never do anything if we weren't good at estimating level of effort.
Adam Kamor (22:20):
That's a good counterpoint, actually. Yeah, that's true.
Chiara Colombi (22:23):
That is great. I'm going to take that back to the team because I think that'll make people feel a little better. So I do have another question from the audience I'll pose quickly. Please address NoSQL solutions like MongoDB, JSON queries. If you could talk about those.
Adam Kamor (22:38):
Go ahead, Mark.
Mark Brocato (22:40):
Well, let's see. Mockaroo does have some built-in data types specifically for Mongo, like the ID type that you generate right there. JSON output is one of the most popular types of output. And as of recently, you can now link JSON data sets that you generate in Mockaroo to further schemas that you generate in Mockaroo. So the relational integrity is there as well. I think it's JSON and CSV are maybe 50 in terms of usage on Mockaroo.
Adam Kamor (23:13):
So on the Tonic side, we have full support for Mongo. So that means you can connect Tonic directly to your production Mongo database or to a copy of it, apply transformations on various paths in your JSON documents. And then tonic will generate a new Mono database that has all the same data as the production, but the specified paths will be de-identified according to whatever transformations you've selected.
Chiara Colombi (23:39):
Great, thanks. We've been talking about third party solutions. Let's narrow down that category to what I think is, given some events earlier in the year, a rather timely and possibly even controversial category, that of open source tools. So I'd like to hear if either of you were impacted by the Log4j vulnerability. We could also talk about, if we wanted to, what happened with Faker. But first, just generally, what are your thoughts about underlying on open source software within your platforms?
Adam Kamor (24:09):
To me, the answer is very similar to, do we go third party tools when we can? Yeah, we go open source whenever humanly possible. Most of the database drivers that we use for connecting to the various databases we support are open source. Our stack itself is primarily .net, which is now open source through Microsoft.
Adam Kamor (24:35):
We were not affected by Log4j in any serious way, just because we don't really run any Java. For a short period of time, we were using a log shipping solution that I believe used Log4j in it, but we haven't had that in any version of the product for well over 18 months, I think. So we weren't really impacted in any serious way, I'd say.
Mark Brocato (25:02):
Yeah, I remember previous jobs were Java based and Log4j is such a staple there. I imagine the...
Adam Kamor (25:07):
Oh yeah, it's everywhere.
Mark Brocato (25:09):
Wide-ranging and will be rippling through time for a while, I think. Fortunately, Mockaroo is Ruby on Rails. So it wasn't susceptible to that particular problem. On another project I was working on, we were hit by not Faker, but the guy...
Mark Brocato (25:29):
Mark Brocato (25:53):
He was angry that Fortune 500 companies were using his code and not paying him for it. It caused a certain amount of panic. And there's been like a few of these things over the last year, where remember like going back a few years ago, there was left-pad. The guy made this one line library, pulled it down off of MPM.
Adam Kamor (26:09):
Yeah. It broke everything.
Mark Brocato (26:12):
Yeah. So it does seem like the community's becoming a little bit more unhinged in the last few years. We are all incredibly lucky to have this amazing driving force of open source software to build on. I hope it keeps going that way. But I understand people's worries about open source.
Mark Brocato (26:38):
When you really think of it, we're all like dancing on the Titanic right now. Every day we push up code, that CI/CD runs and it runs an MPM install and it pulls down like a thousand libraries and hope nobody did anything wrong with those libraries and hope you're locking your files correctly and somehow people didn't figure out how to get around that.
Adam Kamor (26:57):
Mark, what do you think of the new... Well, it's been at least a year now, the GitHub feature, where I think you can donate to open source projects directly through GitHub now, can't you?
Mark Brocato (27:04):
Yeah. I have not been a part of an effort that's done that yet. But I will say that we've used this product called Snyk. Do you guys use Snyk at all?
Adam Kamor (27:19):
Is it S-N-Y-K?
Mark Brocato (27:20):
Adam Kamor (27:21):
No, we don't use it right now.
Mark Brocato (27:23):
So we used Dependabot for a while, which is pretty easy to use in GitHub. And it's very kind of overzealous at pointing out every possible thing you could upgrade and then you upgrade them all and you break everything. Snyk is like much more judicious.
Mark Brocato (27:37):
So if you're worried about open source software vulnerabilities, I find Snyk is a good balance between just like crying wolf all the time and actually providing you with meaningful feedback. And they have a module in there that'll actually inspect your code for vulnerabilities. And it finds like some pretty shocking things.
Adam Kamor (27:53):
Oh yeah. We actually use Dependabot for dependencies. Interesting what you've said about Snyk though. Maybe we'll give it a go. We use SonarCloud for scanning our own code base. And it does various types of static analysis, including security vulnerabilities, which it typically points out good things, I'd say. There have been times where we're like, "No, that's actually not an issue," and you can just dismiss it, then it goes away forever. But it certainly points out issues as well. It helps.
Chiara Colombi (28:25):
Yeah. It's interesting. I feel like just going back quickly to the Log4j, the only way we were impacted was a bunch of customers being like, "What do we need to do to resolve this?" Ad we were like, "Oh, nothing."
Adam Kamor (28:34):
Nothing. It was great. I know. It was real nice.
Chiara Colombi (28:37):
Mark Brocato (28:38):
People were reaching out to me the next day and they were like, "Does Mockaroo use Log4j?" And they must have had to reach out to everyone of their a thousand...
Adam Kamor (28:45):
Oh yeah. A hundred percent. I mean, there are tools on the market that will scan ship binaries for what libraries and whatnot they're using. Tonic is typically a tool that is run on the customer's prem and we just ship Docker containers. And many of our customers actually scan our Docker containers at every release, looking for known vulnerabilities that will...
Adam Kamor (29:09):
And we're also scanning them before we release. So typically, nothing is found. But things have been found in the past and that would be a good time where they would be able to confirm whether or not we're using Log4j. Save everyone an email chain, I suppose.
Chiara Colombi (29:22):
Yeah. So I want to switch gears a little bit. I can imagine in our space, the scale of a user's data, we've already talked about this a little bit, is going to put a certain strain on performance, which would require a significant amount of optimization. Can you offer any insight around optimizing platform performance for the enterprise and large scale deployments?
Adam Kamor (29:44):
Yeah. Performance is something near and dear to my heart. It's one of the things that I very much like to be involved in at Tonic and still actually actively write code in that area, even though not as much as I used to. I'll give two good rules for performance optimization.
Adam Kamor (30:01):
One, don't do it at all until everything works. And two, only do it after you've actually profiled running code. Trying to optimize things without actually looking at performance traces and performance profiles I think is like, and I want to really not mince words here, I think it's really a fools errand.
Adam Kamor (30:20):
You can't optimize unless you know where the actual hotspots are. And there might be slow pieces of code. But if they're not the actual bottleneck, optimizing them doesn't do anything for you. You're really just wasting time and adding complexity.
Adam Kamor (30:31):
So the way we solve this at Tonic is two-fold. We actually ship Tonic with a built-in profiler. So when customers are having performance issues, we can actually run a profile on their site, using their infrastructure with their data and get like a very accurate profile of where things are slow.
Adam Kamor (30:49):
And then we can use that to make actual educated decisions on where to optimize the code. And then recently we've actually started shipping tooling within the product that is actually constantly running profiles. So we can look at things at like a more holistic level, like, oh, how does this release differ from previous releases in terms of hotspots, et cetera? And that's really helped us up our performance game.
Mark Brocato (31:11):
Yeah, it's a great point. I've said that many times to engineers who work for me, that performance optimization is really kind of a harsh mistress in that often it doesn't make sense why something got slower or faster. That's why measuring is like really critical. Be numbers based and don't be emotional about it.
Mark Brocato (31:35):
For example, I heard recently that Ruby 3 introduced a really good just-in-time compiler optimizer. But actually it has some pretty poor results for Rails, which is like the biggest library for Ruby, obviously. So you should always measure before and after you do anything.
Mark Brocato (31:58):
Don't assume that you actually understand all of the chaotic effects that are lending themselves to performance. For instance, sometimes because of just-in-time optimizations, the first time you run something, it'll be much, much slower. But if you're taking advantage of those, it'll get much, much faster.
Mark Brocato (32:16):
So a small change you make might actually make the first one slower, but then subsequent runs much faster. So which do you actually need? And then another piece of advice I would give is do your best to decouple the code from infrastructure, because you want to make use of scalable elastic infrastructure as much as possible in this day and age.
Mark Brocato (32:40):
Amazon and all the cloud providers have forced prices down so low. In a day of inflation where everything is going up in price, cloud computing really isn't. And so if you make choices in your software that assume any kind of configuration of number of processors or space on disc or anything about the hardware layer on up, you can paint yourself into a corner.
Mark Brocato (33:08):
I was fortunate enough to lock into a setup in Mockaroo where I could scale out across as many processors as I could pay for on AWS. And customers can do that as well. So customers that have much larger needs than what I can accommodate on mockaroo.com, they can run their own infrastructure, spin up 20 48-core servers, generate data across them and then shut them all down. And it costs a dollar or two. So decouple from the physical world as much as possible. Decouple software.
Adam Kamor (33:40):
Mark, the point about separating infrastructure first, the code, is nice. At Tonic, and this is just food for thoughts because I think it's interesting, we don't have that exact luxury, unfortunately. Because the beginning of any Tonic job is reading from the source database, the customer's database, and the end of the job is writing to a database.
Adam Kamor (34:05):
The performance bottlenecks within Tonic jobs are often actually the infrastructure themselves. Tonic scales out very well with CPUs in terms of actually processing the data, applying the transformations. But you can optimize that all day long. If the output database is slow to write, then making those performance optimizations doesn't matter.
Adam Kamor (34:24):
And you actually are infrastructure constrained at that point, which is why it comes back to that profiling conversation. Because 9 times out of 10 when we profile, we find that we're just like disk or network constrained. And then it just becomes a discussion of, okay, well, do you mind upping the instance size, for example, like on an RDS instance or something like that? I mean, yeah, it gets complicated, I'd say. Mark, we can't hear you.
Mark Brocato (34:55):
Yeah. Especially with RDS, there's so many knobs to turn there in terms of like dedicated IOPS and all that stuff. They make a dramatic difference on the performance of your database.
Adam Kamor (35:04):
They really do. Something that we've started doing recently is starting to like introspect a bit more into like the source and output databases that Tonic is operating on. So we can kind of like draw better conclusions around where we actually are constrained. And yeah, it gets so complicated.
Adam Kamor (35:22):
That's one thing. And then imagine like, okay, is Tonic installed in the same region as the databases? Are the databases in the same region? How many network hops does it take? Everything matters when you're talking performance.
Chiara Colombi (35:39):
So there's a question somewhat related to infrastructure around how both of the tools work with CI/CD environments that need to stand up databases of test data as part of the CI. Could you both speak to that?
Adam Kamor (35:55):
Yeah. I'll speak for Tonic. It works very well, because Tonic outputs for you not data, it outputs a database filled with data. Once you have that database, you can do with it whatever you like.
Adam Kamor (36:06):
You can take a snapshot of it and restore it programmatically during a CI/CD run. You could even trigger the job itself during a CI/CD run using our Rest API. There's many ways to integrate Tonic with like the various workflow in your organization.
Mark Brocato (36:22):
Yeah. I think Tonic might be the better answer here. So Mockaroo has quite an extensive APIs, I'll say that. You can create your own mock API endpoint. So like if you are building a UI and the API that would be consuming isn't ready yet, you can mock it in Mockaroo.
Mark Brocato (36:42):
And then all the data generation that you do using the website, you could also do via the API as well. And you could build your schema, save it, and then call it from the API if you want. Or you can just create the schema entirely from the API. So it's got a fully functional API. That's basically what you get.
Mark Brocato (37:02):
So you'd have to make use of that in your CI/CD to generate data, download it and load it into your database. It leaves quite a few steps. If what you're aiming to do in CI/CD is like fire up a real database and then run your tests against it, Mockaroo leaves some steps to be done there. Whereas I think Tonic provides you a much more direct integration.
Adam Kamor (37:21):
That's right. I noticed there's actually a few more folks on the call now than were on the call at the beginning. So Chiara, maybe it'll be worth, again, mentioning the distinction between Tonic and Mockaroo and like the different use cases for which they're typically used.
Chiara Colombi (37:37):
Yeah, of course. I can do my quick high-level overview and then if you want to speak to the use cases.
Adam Kamor (37:43):
Yeah. I think the high-level overview from the beginning would be great.
Chiara Colombi (37:46):
Sure. So basically, what I said at the beginning is we're both in the space of data synthesis, both platforms, but the difference is that Mockaroo is designed to create fake data from scratch. So you are able to set rules. It's rules-based data generation. You define the rules to dictate how that data looks.
Chiara Colombi (38:07):
You can say, "I want ratios." If you're generating... You gave a really great example one time, Mark. You showed generating gasoline prices over the months of the year, and you can say, well, it'd be higher during these months versus these months. I thought that was a great example.
Chiara Colombi (38:23):
So anyway, yeah, you're creating data from scratch. You're just defining what data types you need and what that data needs to look like, how it needs to be distributed. Whereas Tonic on the other hand is a platform that enables you to create fake data based on existing data.
Chiara Colombi (38:40):
Typically, that's based on your production data, basically. So like Adam was saying, you connect Tonic to your production databases. You can connect it to more than one database and you use Tonic's generators as kind of building blocks to mimic your real-world data.
Chiara Colombi (38:54):
So the data that you get, the fake data, it really is designed to look at and behave like production data. But at the same time, a big step of the Tonic process is de-identifying the data. So removing all the sensitive PII and PHI and replacing it with fake data that gives you the same feel.
Adam Kamor (39:13):
That's right. And to the point of the CI/CD question, I mean, the output of a Tonic job is a new database that is schematically identical to the production database. Most of the columns and rows are in fact unchanged. It's just the columns with actual sensitive information are replaced with fake but realistic data.
Adam Kamor (39:30):
We have customers, for example, who have switched from using production data to Tonic data in their development environments and their developers don't even know. They have made the switch, waited a few weeks, no developer said nothing. That's how realistic the data can get. So just to kind of put a pin on that. Thank you, Chiara, for doing that.
Chiara Colombi (39:49):
Mark, was there anything you wanted to add as well?
Mark Brocato (39:52):
Yeah, it's funny. We kind of draw some parallels for our build versus buy argument in that... So there have been many customers of Mockaroo that also became customers of Tonic and use both tools concurrently. I think that's one of the luxuries of the problem I'm solving with Mockaroo versus the problem that Tonic solves.
Mark Brocato (40:15):
Mockaroo is necessarily very easy to pick up and play because there's no dependencies. You don't have to give it access to production data. You can just start playing around. So that part of the engineer is like, "I just want to start building something." Mockaroo tickles that fancy a little bit, that you can very easily just start generating some data.
Mark Brocato (40:31):
And then I imagine some of the customers that buy will like, "Hey, this is proving to us that synthetic data generation is very useful. But if we continue going down this road, it's going to take us forever." And so they bring in Tonic and then it just does the whole database and they're done. I imagine that's one kind of path to upgrade as well.
Chiara Colombi (40:55):
There's another facet of scaling up that we haven't covered yet that I also wanted to talk to, and that's the topic of scaling up your teams and your companies. Because Mark, you've got Mockaroo, but you're also a VP of engineering. You've worked across startups and you've seen them rapidly grow and Tonic is also on that same path. So I was wondering if either of you have words of the advice in that department.
Mark Brocato (41:23):
Yeah. I spent most of my career in startups where there was one engineering team, like ranging from a few people to maybe like 15 people. But I have recently served in a role that was much, much larger, where it's a hundred plus engineers.
Mark Brocato (41:41):
The biggest difference I think as you scale up is you have to spend so much time on communication that you want to make sure that you have really good documentation and regular meetings on things like roadmaps and the reason behind why we're doing what we're doing.
Mark Brocato (41:58):
What you aim to create I think is a lot of alignment so that people can act autonomously in a large organization. Because at a certain point beyond maybe 10 or 15 people, you need to break up the team. And then you have two teams and those two teams need to remain in sync.
Mark Brocato (42:14):
They either remain in sync by meeting constantly and stepping on one another or somehow being aligned and acting autonomously in each other's best interests towards the same goals. And so that's the one piece of advice I would give for people that are moving up in the engineering world and are responsible for more and more people, is to focus on communication and alignment because it allows small organizations to work in parallel efficiently.
Adam Kamor (42:41):
Mark, I'm going to make an analogy to something you said earlier. Earlier, you said the design for your early customers that are smaller and simpler is not necessarily the design for the later customers.
Adam Kamor (42:55):
The processes that you put in place for small teams do not scale for larger teams, because it becomes harder to communicate across a larger number of people and to keep everyone kind of marching towards the same goal or really to make sure everyone knows what the goal that they're marching towards even is, to the point that Mark made.
Adam Kamor (43:15):
So to add on to what Mark said, when we started Tonic, there was four founders: myself, Andrew, who unfortunately, wasn't able to be here today, and then two others, Ian and Carl. For the first year and a half or so, it was really just a small team, four, five, six people. Chiara, you were there for it.
Adam Kamor (43:35):
And we didn't really hire engineers until a good deal later, just a little bit before COVID started in March of 2020. And I very much had this idea of like if you build it, they will come. Meaning we're a cool startup, we're doing cool stuff. We're going to be flooded with world-class engineers that are all going to be knocking down the door to work here. And boy was I wrong.
Adam Kamor (44:00):
In terms of scaling up the team, I think Mark had some great points. But in terms of actually getting the people so you actually have a bigger team, I didn't appreciate how hard it was to recruit. And I didn't realize that like most of my job, not most, but a lot of my job as a founder early on, but even still today is actually centered around recruiting and trying to attract top talent.
Adam Kamor (44:21):
So to that point, if anyone on this call is an engineer and you're looking for a very exciting engineering role, you should definitely reach out to us after the call. I do this everywhere I go. I let everyone know that we're constantly hiring for engineers in other roles.
Mark Brocato (44:36):
Yeah. Plus one on the recruiting thing. There's an interesting struggle that I've faced too, which is now it's easy to hire people everywhere. Many businesses are willing to work with people as independent contractors. Everybody just works remotely anyway. So it kind of doesn't matter.
Mark Brocato (44:55):
The only thing you're fighting against is time zones. And at a certain size, I was able to put together a team where everybody was in a different country. We had a 10-hour time zone span between Eastern Europe and West Coast US and everybody was on equal footing.
Mark Brocato (45:13):
And at a scale of 10 engineers, people could be flexible enough with their schedules to have enough overlap to make it work. But as you scale up, that does become more painful. As when you're a small startup, you're attracting a different kind of people with different motivations than a larger business.
Mark Brocato (45:33):
And so people are a little bit less willing to be that flexible with their schedule and they feel that it's harder for them to have a direct impact in the business. So they're not willing to give up so many different things in their life and live this weird lifestyle of stretching time as you scale up.
Adam Kamor (45:49):
Yeah. That's right.
Chiara Colombi (45:52):
Stretching time. Yeah. I hear that. Well, those are all the questions that I have specific to scaling. If anybody has any more questions, do feel free to ask them in the chat in the Q&A. I'm still looking out for those. Thanks for the questions that have come through.
Chiara Colombi (46:08):
My kind of final question, given that there may be folks in the audience who are enterprise companies and are looking specifically for this solution of test data generation. What is one thought, what is the key takeaway that you would like them to leave with today?
Adam Kamor (46:25):
Mark, would you like to go first or would you like me to?
Mark Brocato (46:26):
Adam Kamor (46:30):
Sure. I'm going to give multiple thoughts, which I know is not what I'm being asked to do, but I'm going to do it anyways. The first thought is, think back to the onset of this webinar, where we talked about kind of the pitfalls of trying to do this yourself. Really be mindful of that.
Adam Kamor (46:44):
If you try to build your own applications for generating test data, it can be very painful for the reasons that we previously stated. There are tons of great providers out there, Mockaroo, Tonic, others.
Adam Kamor (47:01):
If what you need is a de-identified version of your production database that is really identical to production, but the sensitive columns have been replaced with fake and realistic data, then you should definitely give Tonic a try. And the website www.tonic.ai. Or just reach out to us at email@example.com and say you heard about us on the webinar and we can also get in touch with you that way.
Mark Brocato (47:25):
Yeah, I think I'll reiterate some things I said earlier, and maybe I'll add a new phrase that I heard somewhat recently that I really like, which is called run into the spike. If you see something dangerous that you'd rather not deal with, go against your instincts and run towards it instead of veering away from it.
Mark Brocato (47:44):
And that is be skeptical of bike shedding on tooling. Spend your time on what is differentiated and valuable for your business and rent or buy everything else, if you can. Because in the end your customers are, if you're lucky to have customers, paying for the one thing that makes you different from all the other thousands of businesses out there.
Mark Brocato (48:12):
It's so hard, I'm sure the Tonic founders know this, to come up with an idea for a business that hasn't been done before. And most of us are lucky enough to work at a business that has some idea, it's found some place in the world. Your spidey sense should tingle when you're not spending all of your time on that thing.
Mark Brocato (48:32):
You're spending your time on building tools or other things that aren't part of your core business. It's not a great use of time generally. But at the same time, I do believe in sharpening the saw. So you should have good tools, but you should take a little bit of time to find if they're just out there to buy rather than having to build from the start, because most things are available, most ideas have already been thought of and they're actually quite cheap.
Chiara Colombi (48:58):
Awesome. Thanks to both of you. We do have a couple more questions that came in through the chat. The first one I'd like to ask is I'd love to hear about anyone using these products with Salesforce or other SaaS platforms.
Adam Kamor (49:15):
I know, Mark, you'll also be able to talk about Salesforce, I think. So as a rule with Tonic, as long as you can point Tonic at a database that we support, then Tonic can operate on it. Any tool that actually gives you access to the underlying database used to store the data for the application. Tonic can operate on.
Mark Brocato (49:38):
Yeah. Salesforce, the company, was the first enterprise user of Mockaroo, and it's been widely used at Salesforce for a long time. And I know that certain partner organizations of Salesforce are also wide users of Mockaroo. But despite that, I actually don't have anything specifically built into Mockaroo to target the Salesforce community.
Mark Brocato (50:02):
It's always been a good fit for the kinds of things that you build with Salesforce. I mean, there's a lot of data types around customers and relationship management and organizations and personal data and things like that. But there's nothing hidden in there that's very a specific to Salesforce.
Chiara Colombi (50:19):
Right. The other question was, and I can speak to this, asking around why a company would need fake data to begin with if they already have big data. It was also answered pretty well in the chat. But just to reiterate, oftentimes it is just a matter of access.
Chiara Colombi (50:32):
You can't grant everyone access to sensitive production data if you've got customer information in there, personal PII, there's GDPR. There's CCPA here in California that'll regulate how that data can be used and it eliminates the option of using it for testing and development of software.
Chiara Colombi (50:54):
It comes down to data privacy and not only doing what's right by the law, but doing what's right by just being a good, respectful company of people's information.
Adam Kamor (51:05):
Mark Brocato (51:06):
Yeah. I think it's way more common than not that engineers are not allowed to touch the production database without a very high level of clearance and certainly cannot copy the data out of there anywhere for any reason.
Adam Kamor (51:18):
Yeah. It's becoming more and more and true that statement, I would say. You're seeing fewer and fewer companies that are actually using production data for non-production purposes now.
Mark Brocato (51:29):
Yeah. Especially with GDPR, even things that we would've thought 20 years ago are like harmless are no longer considered harmless anymore at all.
Adam Kamor (51:39):
No. Absolutely not. No. And GDPR even calls out specifically what you can and cannot use your user's data for. It has to be just the purposes of why it was collected in the first place, which is to say the application and really nothing else.
Chiara Colombi (51:54):
Awesome. Well, thank you both. This has been a pleasure, as always. So much fun to hear you guys talk about this. And thanks to everyone for joining and for all the great questions along the way. If you'd like to learn more about Mockaroo, head on over to mockaroo.com, such a great resource with a very well-deserved fan base. There's also a great community forum as well.
Chiara Colombi (52:14):
And if you'd like to learn more about Tonic, visit tonic.ai, where you can book a demo directly with a member of our team on our website. You can also request a two-week sandbox, which is basically our free trial to take Tonic for a spin. You can connect to your own data if it's in MySQL or Postgres. Either way, we'd love to hear from you. So do reach out. You can also contact us at firstname.lastname@example.org. All right. Thanks to everyone for joining us. Thanks so much to our awesome speakers. It was a great conversation.
Mark Brocato (52:41):
Thank you everybody.
Adam Kamor (52:43):
Chiara Colombi (52:44):
Fake your world a better place
Enable your developers, unblock your data scientists, and respect data privacy as a human right.