To build or not to build. That is the question.
Join us for a conversation between the CTOs of Mockaroo and Tonic as they share their thoughts on the build v. buy argument with a unique insider perspective.
Mockaroo Founder and CTO Mark Brocato originally built his product as an in-house solution for his engineering team at a healthtech company. Now, he's selling that solution to other teams who decided to go the buy route. Tonic.ai Co-founder and CTO launched Tonic after running into the same build-in-house conundrum when it came to test data throughout his career.
They'll explore the pros and cons of each approach in the realm of data generation and speak to times in which they've chosen to buy dev tool solutions instead of building in-house.
Before you choose to suffer the slings and arrows of custom scripts, tune in to hear what two industry pros have to say.
Chiara Colombi (00:07):
Hello, everyone. Welcome to today's conversation about building versus buying in the context of data generation. I'm your host Chiara from Tonic.ai. And I'm very glad to be joined again by our speakers, Mark Brocato, the CTO and founder of Mockaroo. And Andrew Colombi, co-founder and CTO of Tonic.ai. We've got a lot of great questions already lined up to ask them, but we also love to hear your questions, so please drop them in the Q&A. I'll also be looking for them in the chat as well, either place, I'll keep my eyes out for them and share them with the speakers as they come up. Cool. I'm going to kick things off with a question for Mark. And for those of you who have joined our webinars in the past with Mark, you may have already heard like I've heard kind of a high level origin story of Mockaroo, about how you were working with health sciences or health tech?
Mark Brocato (00:54):
Yeah, it was a healthcare company, a startup a long time ago. Yeah, probably about 10 years ago now, it's almost coming up on the 10th anniversary of Mockaroo. And yeah, I was leading an engineering team there. And it was just a very difficult domain to learn. And so the QA testers who were very junior, some of them were interns, they didn't really create very meaningful data to test the application just by hand. So I created this in-house tool to make it easier to create meaningful data so that their tests were more meaningful and easier to follow from start to finish in some kind of difficult workflows. So yeah, that's the origin.
Chiara Colombi (01:34):
Cool. So yeah, I was just kind of wondering about that, what led you to build in-house versus seeking out the buy option.
Mark Brocato (01:46):
That's funny. I'm a developer, so I want to build everything. I that's my natural tendency. And probably a lot of times in my career, I've gone down the path of building things that I probably shouldn't have. But it has strengthened the muscle in problem solving and building for sure, while maybe wasting some time along the way. This was one where I actually did look around at the time. And probably still, the number two result on Google for like generating fake data is this generatedata.com, and it was pretty simplistic, I don't think it's really changed in the 10 years that Mockaroo's been around. And it was like really low scale.
Mark Brocato (02:28):
And at the time, I was working with Ruby on Rails that had some really great faking libraries that were part of just kind of a standard rails way of doing unit testing. And I was like, well, I know my QA engineers were really not coders, are not going to use this library. But what if I just slapped a UI on, I think it was, wasn't Faker, it was another one. But I just essentially just slapped the UI on some of the things that it could do out of the box. And I probably wrote the UI in bootstrap or something like that, and a rails back end. And threw it up there. And it was pretty useful immediately.
Mark Brocato (03:07):
But I don't know that I really did like ... I didn't do the thing that like a VP or CTO would do, which is go out and find all the tools and get demos. And just too eager as a startup throwing things together. I've done quite a lot of that in my career. And I've seen some bad ones, too. I remember that same company, many years earlier had actually written its own bug tracker, if you believe that. They weren't using a Bugzilla or they weren't using JIRA. I think this was before JIRA was kind of mainstream. But they wrote their own bug tracker in ColdFusion.
Mark Brocato (03:41):
And this was an interesting thing, because when we switched, I think we switched to Bugzilla many years later, because it's the thing that you always find with build versus buy is like, you built it that was fun. It works in the beginning and then the guy who builds it leaves and you're like, well, now, we have to maintain this thing, aside from the things that can make us money. And that's not a good thing. And it was a perfect workflow for our company. Way better than what like Bugzilla could provide, it had a lot of assumptions of how we work built in and it was just perfectly crafted. But then like maintaining that was just a nightmare. And then at one point, I think the server died and no one knew how to restart it, and it was like, all right, now we're off to the Bugzilla. So yeah, some crazy things there.
Andrew Colombi (04:23):
That's like abrupt evolution right there.
Mark Brocato (04:28):
Yeah. It's often, let's see what happens.
Chiara Colombi (04:30):
Did you see the chat? Someone says, "Mark: they built their own bug tracker. Everybody: he's talking about my company!"
Andrew Colombi (04:38):
Yeah, seriously.
Mark Brocato (04:40):
And it's funny I keep like threatening with every team that I'm going to go and build my own darn bug tracker again, because I don't like any of the ones that are out there to be honest, like I've used JIRA over and over again. And now the joke has been every company I've worked for, always says the problems not JIRA, it's our setup in JIRA, but it's like every one of them says the same thing. So maybe it is actually JIRA, I don't know. I'll get off my ramp in a little bit. But we actually just started using an app called ClickUp, which I think is like raised a bunch of money. And they're getting super popular. It's like an all in one OS for a company. And amongst the many things it does, it can be a bug tracker and track sprints and all that it actually does it quite well. So if anybody's interested check out ClickUp.
Chiara Colombi (05:26):
Andrew, at Tonic, did we build your own bug tracker?
Andrew Colombi (05:29):
No, we didn't. We're too young a company to fall into that trap and too young company and too old of developers to fall into that trap. We used GitHub from the start and we use Asana a little bit as well. But yeah, no, we didn't do that. I think every developer is experienced that moment. And actually, there are some companies that can do it, right, like Microsoft and Google, they have their own bug trackers, and they can get away with it. Because a fraction of a percent improvement for the efficiency of Google means millions of dollars. So it's worth billions, maybe not billions. But, yeah, if you look at the revenue of Google, I bet you a 1% improvement in their productivity is billions of dollars. So they can spend $2 million making a bug tracker software, if it makes them 1% faster.
Chiara Colombi (06:17):
Oh, well, since I asked you for your take on Tonic, I'd love to get some of your background story as well. Andrew, when you were at Palantir, I know that part of the reason Tonic was founded is because your experience at Palantir with some of the co founders was that you and the engineers often had to build fake data in-house. What did that look like?
Andrew Colombi (06:38):
Yeah. So when we were confronted with this question of how are you going to get the fake data? Our situation was kind of similar and very similar, right? We had QA that wasn't able to get really good passes in on the when they were doing testing because they didn't have proper data. And then developers were making bad assumptions as they were developing the tools because they didn't have good data. And similarly, I kind of just jumped right into how do we solve this, I didn't even consider, I think that's true, by the way of this problem, this problem of, we have data that we need to create for our testing, it's always starts with let's build something ourselves. And I think there's a couple reasons for that, one of them, this is a sign of a tangent, I'm aware, but I think it's worth mentioning.
Andrew Colombi (07:29):
One of them is that, I think people think it's easy. It sounds easy, right? We're going to build some fake data, I mean, surely that can't be that hard. And two, it feels like it's going to be very bespoke to your needs, right? Like, I have MySQL database that has these things triggers in it or integrations with our ED. And there's like, all these stories, you tell yourself, of like how our data or our setup is different than anyone else's and it won't be that hard. That leads you to this conclusion of like, well we should build it ourselves. So I think that's one of the reasons why I the story, when you hear, what our stories Mark and I are very similar. It's because we kind of coming from at the same yeah, there's these commonalities there.
Andrew Colombi (08:14):
But going back to the Palantir story. So yeah, me and the other co founders of Tonic decided we don't want to find these issues in the field anymore, we would like them to be surfaced earlier. And also, by the way, when we found issues in the field, like the devs can repro it. So even if we are finding issues in the field, the least we can do is allow the devs to repro it. And so we've embarked on the journey, and it was harder than we thought it would be. But it was also more valuable than we thought it would be, which I think is kind of the interesting flip side of that coin.
Andrew Colombi (08:51):
It's like, yeah, it was harder, it took months instead of days. But the data we produced ended up being super useful, well beyond the scope of the project that we initially had scoped it for. Initially, we were like, okay, we're going to use it for this specific project, we're working with a specific client, and we fake data that was specific to that client. But it turned out that there were other clients that had similar data setups, and so the data became useful for them. And also generally speaking, when our sales people are doing demos for customers, just having really, really intricate, sophisticated data, even if it isn't exactly the data that the client would otherwise want to see, can be very, very valuable.
Andrew Colombi (09:34):
So I feel like, if you look at like the lifetime of the project, and not the data project, but rather the project we were trying to support. That project probably had like a year and a half worth of legs to it, where we were using the data that we generated for it. But the data itself got used in other projects for years, like three, four or five years later that data was still being used. So that's where I think yeah, it was harder than we thought it would be, but also was more valuable than we thought it would be. So that kind of made it worth it.
Chiara Colombi (10:03):
And when you say that data, was it just the data you generated at one time, or the infrastructure that you created-
Andrew Colombi (10:09):
The data we generated that one time, because it was really hard to make infrastructure. As hard as it was to make the data once, anything infrastructure to make that data, get out of here. I've seen so many projects to make infrastructure for data fail internally, like internal versions of that product are very, very difficult. I mean, that's why we need a company for it. And that's why we have like, 45 people working on it.
Chiara Colombi (10:32):
Yeah, that was actually another part of my question was how many people at Palantir? You mentioned months to do that and how many people were involved?
Andrew Colombi (10:41):
Honestly, a lot of the development was me. But then there was a lot of research into what should the data look like? That was the other person I was working with, in one project it was Ian and the other project it was Karl, guess what, my co founders are Ian and Karl.
Chiara Colombi (10:59):
Who would have guessed.
Andrew Colombi (10:59):
So yeah, so there you go.
Mark Brocato (11:00):
That's kind of the problem, the person with the instinct and drive to go create that kind of a solution is like, not long for maintaining that kind of a solution, and they are probably going to go be the CTO somewhere and some other poor fellow that stuck with maintaining it.
Chiara Colombi (11:14):
And then in your case Mark, you have maintained Mockaroo, and you've built it out and it's become just more and more powerful over the years. And I'm wondering if you could guesstimate how much time you've spent? Probably-
Mark Brocato (11:30):
Yeah, so I actually looked up when this started, I was it surprised me yesterday, Mockaroo, I started development in 2012. So it's like actually coming up on a 10 year anniversary. So I think it's something like five or 600 hours in total, I work in these bursts. So normally, in the beginning, I probably like didn't sleep for a little while. It's just like the way I am, I try to bang the whole thing out at once. And I say, how hard could it be? And then I look up three days later, it's like, well, it's a little harder than I thought.
Andrew Colombi (12:02):
Very hard.
Mark Brocato (12:03):
But it's that kind of dumb optimism, I think that keeps innovation going. Probably a lot of the great things in the world. Somebody started off with, how hard could it be? And then they're looking back and saying that was ridiculous. I'm never doing that again.
Andrew Colombi (12:18):
Ignorance is bliss.
Mark Brocato (12:19):
Yeah, it really it is.
Chiara Colombi (12:22):
Could you explain, some of the aspects that made it just really hard? Like, what was it that?
Mark Brocato (12:32):
Scaling for one was like, really hard. It's the, I don't know, 80/20, or 90/10 rule of like, it always feels like you're so close to having something that's viable. And then actually, sometimes when you get something that's viable, then the next person says, "Well, if only I could do that, but like a little bit more," and then you realize, oops, a little bit more, it's a lot more work. On the surface, it looks like it's going to be this wonderful solution. But then somebody wants to correlate two datasets, and it's like, it's not built to do that, or somebody likes the 1000 records that could generate, but they need like 10 million, well, it's not built to do that either.
Mark Brocato (13:11):
So once you've got something that you're happy with in place, and then trying to turn it into what people actually need without breaking that original thing that made it good is difficult to make that change in place, I think. So that took a lot of effort for sure. And then also when you're a single person kind of unconstrained in your choices, I made a lot of wandering choices as I was doing it just to learn things along the way, as an engineer. Mockaroo did not start out as a business. It started out as a project for me to learn things and grow my skills, in a way to vent do something completely different than I was doing at the time.
Mark Brocato (13:50):
So I didn't have like strict requirements or timelines or anything. So I wandered and made some choices that needed cleaning up later. At one point, I think I had react, angular and backtone on the front end. It was kind of fun. Yeah, hey, don't [crosstalk 00:14:08]
Andrew Colombi (14:08):
How dare you?
Mark Brocato (14:09):
Were talking about Astro. Astro actually has a demo where you can run all of the frameworks on one page. That's actually a value proposition of it. For large companies that do micro front end, it's actually becoming a thing now. But see, I was ahead of my time. Yeah. So very wandering path to where I am now for sure.
Chiara Colombi (14:32):
Yeah. I mean, you say ahead of your time, well, if you started developing in 2012. I think you probably very much definitely ahead of your time. And then it wasn't until 2014 right that you've launched it online. Is that right?
Mark Brocato (14:43):
It was about that. I think it probably was out there for like a year on the internet just taking like PayPal donations or something. It just I wanted to see if anybody was actually interested in this and then all of a sudden, there was a very big customer that came and wanted it to do some things that were well beyond its capacity and like any smart entrepreneur out there just always say yes, when somebody ask, can it happen, and then figure it out later. Which is I did. And it really drove me to success, having that pressure was kind of nice.
Chiara Colombi (15:12):
So like he said we're almost a decade after the launch of Mockaroo and coming up on four years since the creation founding of Tonic. And now you're both founders and CTOs of viable fake data solutions coming from that background in the build in-house. I think that that speaks volumes in and of itself. So you've kind of touched on this, like you've both said, build in-house is just the default for the type of people, because those considerations you mentioned, Andrew, so how often do you speak with teams, As a sellable product, do you speak with teams who are considering building in-house or have already attempted to build something in house.
Andrew Colombi (15:48):
I would check it every time. Every time, yeah. I mean, that's probably an exaggeration, but it's like 90% of the time, it depends on the company to a degree, if the company is really small, really young, then maybe they haven't tried yet, because they're just that young, their first step was, well, maybe we can buy it. But if the company is more than three or four years old, they're going to have something in place. Or they're going to have tried something in place. And, yeah I think a lot of what Tonic sells to is ... a lot of what Tonic competes with, is build your own. Also, a lot of what it sells to is replace what you built that you don't like, for sure. Tonic has had customers that they did the pilot, and then the pilot did not end up closing, they're not did not succeed. And then a month later, they come back, and they're like, Hey, how about we do this after all, and it hasn't happened a lot. It's like a small handful, because we don't actually lose that many pilots, almost all of our pilots succeed. But there's been a fraction of those pilots that didn't succeed that actually came back later.
Andrew Colombi (17:02):
And it's because I think also part of the assumption of build it yourself is, the calculus you do is like, well, Tonic is going to cost X, I can hire someone to do it for Y and they'll do it exactly what we need. So maybe I pay the Y instead of the X, and I can and I can get the thing I want. But they don't counter in ... they're just looking at Y and X, that's the only thing they're looking at. What they're not looking at is that actually Y might fail. Believing yourself is really hard. And people don't appreciate that. And I think there's a lot of build it yourself implies that you can build it. And actually you might not be able to, it's very difficult.
Chiara Colombi (17:46):
Yeah. Mark, did you have something to add to that? I feel like you would recall some.
Mark Brocato (17:51):
So I have a different angle that I've been thinking about in some of my recent work, which is, they're calling this the great resignation, right? Maybe you guys might be exempt because you're growing so fast. But probably everybody out there in the audience is known somebody who's been key and left their job recently, especially amongst developers. And so if you're like thinking about keeping a team together, I would really want the people on the team only working on the absolutely most interesting and rewarding and important things you can. And if you buy from like Tonic, let's say, Tonics not going to quit on you, they have a contract, and they're just going to keep serving you what you need until you don't need it anymore.
Mark Brocato (18:35):
But if you have somebody in-house, building it in this day with people moving around so much, this is a very good chance that that small team or that one person, you can wake up and just not be there the next day, and that's one of the scariest things about being an engineering leader right now, people are finding all sorts of great opportunities and it's great for them. It's very hard for people running teams. So I have really championed by lots of stuff in my career now especially from AWS cloud services, things of that nature, observability platforms, bug trackers, bug reporting, all that stuff, because it's like, I know that stuff's not going to go away, but the price might go up, service might occasionally go down, but I know I don't have to keep a team happy. Maintaining the software, which is eventually not going to be fun to maintain. That's just like the most demoralizing thing for a manager for a team that people leave. So it's a very different take on it but top of mind for me in this last year.
Chiara Colombi (19:38):
It's interesting. Yeah. It makes a lot of sense. And I actually kind of ties into another question that I had for you guys which is an example of when you actually, you chose to build something in-house, you felt that was the better path forward and after having done that you feel it was worth the effort of being in-house.
Andrew Colombi (19:59):
Yeah, I mean, for the case of Palantir, yes, I do think so. With that said, at the time Tonic didn't exist. And if I could have bought Tonic definitely would have, it just didn't exist. The origin story of Tonic in many ways is creating that infrastructure. We talked about it earlier, like, was it the infrastructure or the data? And it's like, well, we didn't have infrastructure, because it was just hard enough to make the data. And so that's where Tonic is coming from, how can we create that infrastructure.
Chiara Colombi (20:35):
And so just to be clear, I also mean, anything outside of degeneration that you chose to build in-house and you felt was worth the effort that had to be put in.
Andrew Colombi (20:46):
Actually, this is like very few people in the audience that can point me in some interesting directions on this one. There's one weird thing about Tonic, which is, we're a part of a growing cohort of on-prem SaaS products, meaning that we don't have a single instance in the cloud that everyone logs into. Instead, everyone puts their own in their own VPC or whatever the Google Cloud equivalent. And what that means is that, there's a lot of product offerings out there right now for SaaS companies that are like the normal SaaS companies like Datadog or something like that. That don't necessarily work as well, once you're an on-prem SaaS company, like the pricing might not work, right? Because it's like oh, we charge for instance, and it's like, I have no control over the number of instances, a customer will take the Tonic binary, and it's on whatever instance they want. And now I have to pay for every one that they create, that's a problem. Or part of the infrastructure will assume that it's kind of that we have more access than we do.
Andrew Colombi (21:52):
So for Tonic, we've kind of been occasionally forced to build stuff, an example of that is our logging infrastructure is a little bit in-house on top of vector, which is a data dog open source project. That's pretty cool. But yeah, so if people have experienced this and seen the good ways to build or good ways to buy software that can support the on prem SaaS product, I'd be interested to hear about that.
Chiara Colombi (22:24):
So you're building in-house right now. But even then, you really want to buy something?
Andrew Colombi (22:29):
Yeah, definitely. I mean, to all the reasons that Mark talked about, about keeping people excited about their work, because they're working on the thing that the company is, rather than supporting the thing that the company is, yeah. One of the things that's like as an engineering leader on the team, that's the kind of stuff that you end up giving to people that can kind of take that, that either want to work on DevOps stuff, or that can like, take a hit of not working on of not working on the actual thing. And there are a few of those people, right, there are more people that want to work on the actual thing. Than the thing that supports the thing.
Mark Brocato (23:13):
I built a thing myself recently, that I don't think exists, there's things that are like it exists. So if anybody out there has worked with a monorepo, it's like it's a new thing that everybody's doing now. Publish meltable packages that are all in the same version, they need to kind of talk to one another. A lot of people put things in monorepo. So those people growing apps with monorepos, now where like the app and all of its libraries that support that app, and maybe support other things are in that one repo. I think it's actually way overused, I think it was originally created by the folks at Babel for like Babel, and all of its plugins that was learner is what we use, in a lot of cases. It's a monorepo manager. But what I built was, we have a thing where we have a bunch of repos that need to work together, but they're all totally independent. They're released independently. They're on independent versions. There's there's no concept of like, they all must be on the same branch. They're very loosely coupled. So I built the thing that's like a monorepo manager, but it's called herding cats.
Mark Brocato (24:18):
And basically, you can just like, there's one herd master that says, I know what all the other repos are, and then you do herd pull, and a pulls down all those other repos and wires them up together, but they're very loosely coupled. And you can actually commit to them individually and all that. So you would think that something like that exists. But turns, I could find anything that was like that.
Andrew Colombi (24:41):
Yeah, I've heard of this project before. Something similar happened to Palantir when I worked there, but yeah, it's another example of one of those things that I guess people tend, could frequently be built themselves.
Chiara Colombi (24:55):
And it sounds like it was worth the effort and worth industry.
Mark Brocato (24:59):
It actually was, yeah, the team's had been using that for a very important project now for three or four months, and it like totally enabled their workflow. And I actually did build it as something like that could be open sourced, and it's meant to be more general it whether or not it ever gets there is like whether or not I have time to do it. I've got tons of cool cat emoji every time you do anything. It's fun.
Chiara Colombi (25:21):
We've got some questions that have come through from the audience. So why don't ask a couple of those. How flexible are your tools for integrating with my custom automation framework to generate fake data?
Andrew Colombi (25:34):
Well, I'll speak to the Tonic side of things. There are a couple of things that I would say around the flexibility one, we connect to a lot of different data systems. So we can connect to tabular data systems in a database like Postgres, Oracle, MySQL, et cetera, even DB two. We connect with non tabular sources like Mongo, we connect to tabular sources that are not databases like Spark, BigQuery, Redshift, Redshift kind of databases, so is BigQuery but you get the idea that there's a lot of different connectors we can do. So that's thing number one.
Andrew Colombi (26:12):
Thing number two is we have a pretty flexible API and we also have web hooks in there so that you can like both command Tonic from another process like Jenkins or CircleCI, and you can also get react like even react to Tonic, have Tonic inform you of when various things finish. So that's another part of the customization that we allow. And then the final thing is Tonic can also have custom, you can do a simple forms of programming in Tonic. It's more like point and click, but to enable more custom logic, if you need it. So those are the three areas of customization. I would have to go into more detail about what custom things you have or what things you would want to address to be able to address the specific.
Chiara Colombi (27:06):
Just a quick follow up and so this is Mark, will answer that first question and also this one does Tonic connect to Salesforce?
Andrew Colombi (27:14):
Tonic currently does not have a connector to Salesforce. Yeah, we do not currently have a connector to Salesforce.
Chiara Colombi (27:20):
Okay, so both of those questions to you, Mark.
Mark Brocato (27:22):
Oh, well, so for Mockaroo, if you click on the upper right hand corner on the help menu, there's a guide to the API. And so anything you can do through the front end in Mockaroo, you can totally automate with code in the API. And it's actually probably the fastest growing and also really large already use case of Mockaroo, a lot of people are building on top of Mockaroo. In fact, I think there's actually a few SaaS companies or training companies that have built products on top of Mockaroo as well, using that API's.
Mark Brocato (27:53):
And the traffic I get on that is quite large at this point. So it is a very common thing, either in unit tests or in probably a bunch of different use cases for people to automate. And then there's there's two types of API's. One is essentially you're just automating anything you could design through the front end of Mockaroo. And then there's also the ability to create your own mock API's in Mockaroo. That would like be a mock of a REST API that you might one day have the real version of.
Mark Brocato (28:20):
So you can set up like the routes and the URL patterns and the responses based on the input parameters. It's a whole API marketing tool. So yeah, that's definitely a primary use case. And Mockaroo on like chronic it doesn't connect to anything. Mockaroo spits out data in files, and you can download it. So I've always had a large community of users from the Salesforce community. And I think mostly they're exporting CSV. So it's always been very widely used and Salesforce. In fact Salesforce was my first enterprise customer.
Andrew Colombi (28:51):
Something I would add is, we don't currently have a connector to Salesforce. That shouldn't stop you from reaching out to our sales team, we regularly create new connectors. And it's just a matter of customer demand. We very recently released the DB two connector, for example, we released the Mongo connector recently as well. So making new connectors is a common thing.
Mark Brocato (29:18):
So a couple more connector questions. Regarding Salesforce in particular, can you connect and sync Salesforce, or SFTC with MySQL and then connect MySQL to Tonic, would that be a workaround?
Andrew Colombi (29:24):
Yeah, totally.
Chiara Colombi (29:24):
And how about can you connect to D 365 and or Azure CDS?
Andrew Colombi (29:34):
I don't know what Azure CDs is. We do connect with like the many common Azure things like Postgres in Azure, MySQL in Azure, that kind of thing. But I don't know what CDS is. And what was the other one Chiara.
Chiara Colombi (29:54):
D 365.
Andrew Colombi (29:56):
D 365. I have to look that one up too. Sorry. 365 Microsoft, what is this?
Chiara Colombi (30:05):
Common Data service plus Microsoft Dynamics 365.
Andrew Colombi (30:10):
Yeah, again, I think this is something that we don't connect to now, but we make new connectors all the time.
Chiara Colombi (30:16):
The question that came in earlier, how are you tying the generated fake data to specific test use cases that you want to test in an app?
Andrew Colombi (30:28):
Yeah. So I'll speak to the Tonic side, quickly, typically, so what Tonic does right is it tries to create data, like it sees in the source database. So if you have examples of the tests that you'd like, example scenarios that you want to test in the source database, like for example, if you have let's say you're looking for an example of a customer that has one thing in their cart, if that exists in the source database, then Tonic will have some propensity to recreate that. What Tonic is not is like a scenario by scenario test generation thing.
Andrew Colombi (31:04):
So what our customers have done that want specific scenarios in the output database, is they identify customers, they identify parts of the database that exhibit these things they want to test, and then tell Tonics subsetor to extract just those examples. So for example let's say you want to test, the example I gave, one item in the cart, you could tell our subsetor, hey, give me examples of users with one item in their cart, and it would pull the user pull the cart, pull the item, pull that user's history, pull other items that they purchased in the past, pull maybe other users that have purchased that item. So it's going to pull like an ecosystem of data based on that seed request. And so that's how customers have been doing it, basically using the subsetor to identify specific cases they want to hone in on and extract data that represents that use case.
Mark Brocato (32:06):
That's really smart. I'm kind of jealous, that's really smart.
Andrew Colombi (32:10):
It works pretty well.
Mark Brocato (32:10):
Mockaroo in contrast, you're essentially on your own, Mockaroo is great for creating data from scratch, where there is no prior art. So a lot of people use it for creating new products. And they may even simulate years worth of production data with it, but it's for a product that doesn't have years worth of production data. So it gives you a lot of tools for shaping the data however you want, for particular use cases or to simulate trends or sizes or integrities. But yeah, you're designing something from scratch, essentially with Mockaroo.
Chiara Colombi (32:47):
Oh, we have a question about time series data. Someone is particularly interested in fake time series data, what are the plans for more realistic scenario building which defines shape, syncopation, et cetera, of this data type, especially for ML AI training use cases, it's really hard to build something without using real data.
Andrew Colombi (33:04):
Yeah, we have a technology. It's called Smart linking in the product that essentially trains variational auto encoder on data. And what we currently do with it, is try to create more data that looks like the input data. But one thing we've been hypothesizing is, could we bias it? Meaning, let's say you wanted to shape it in a certain way, you wanted if it's time series data, maybe you want the times to kind of trend, the values to trend up instead of trending flat or something like that. And we've been wondering if we can bias it either through some sort of rejection sampling technique, or so the way that it's kind of getting in the weeds, I'm going to get in the weeds for just like 30 seconds from now, like two minutes.
Andrew Colombi (34:04):
So basically, the way the variational auto encoder can work, or the way that generation technique can work is, you supply at one of the layers of the neural net, kind of your a random noise that is going to produce on the output layers, your fake data. And if you kind of understand a little bit about the random noise, you can affect what the outputs are going to be like. So if you can, essentially provide the random noise such that you're going to get more output in the direction you want. That can be your technique for biasing the output. So the two ideas are basically projection sampling, which is a kind of straightforward you like ask it for an output and if you don't like that, it's not like in the bias that you want, then you throw it away. And the alternative is this kind of bias the noise that creates new data.
Andrew Colombi (35:02):
And I talked about variational auto encoders. The other form is this technique called CT Gann, which stands for like, conditioned something, Gann I don't remember what the T is. But in that formulation, it becomes a lot easier to think about the noise layer that is like otherwise opaque and conditioned the noise on some prior thing like, conditioned it on, I want this user to be from Boston. And so then you get like more Bostonians in your output, because you've conditioned the generation process. I know that was pretty hand waving and also, it's very difficult to describe in a minute and 40 seconds. But those are some ideas that we're working with, right now, we just kind of tried to recreate the data that is in there. If you have a use case, for biasing your output in a certain direction, we can certainly connect you with our data science team that looks at this problem on the daily.
Chiara Colombi (35:58):
So, wow, I cannot give a sophisticated answer like that, unfortunately.
Andrew Colombi (36:03):
But we've been thinking about this problem for us.
Mark Brocato (36:07):
I've been thinking about it, but I think I'm just banging my head on it, which is like this is really difficult. And this is why, like I'm hoping somebody out there in the audience will just solve this for me, which is like, all right, so Mockaroo generates data starting from record, one, to record, whatever N and it's doing in an order. And then to make it go faster, I carve that process up to multiple CPUs or multiple machines. So like one worker, okay, you get records, one through 1000, another worker, you get records, 1001, through 2000, et cetera. And it's difficult to make time series data out of that, because unless you're in a completely normal interval where like record one and second one, record two, a second two, you don't know where record 1000 is going to fall. So it's like you can't inform that other worker where to start. So anything that's like a well defined sequence, it's really easy to parallelize, but anything that has to have that kind of randomness, it's quite hard. So what you can kind of do is like, you can start with a sequence and then like perturb it a little bit to like, say, well budget up or down a little bit in generates this kind of pseudo randomness.
Mark Brocato (37:13):
And then what you can do to simulate trends if you don't need the randomness, and the time series can be more regular, you can simulate trends by just saying, use a row number generator, which is I think, on the default schema and macro, the idea of the row number generator just outputs the row number, and then derive things from that, like you say, okay, my date, or my timestamp is a constant plus that many seconds, the row number seconds, use a formula to derive that. And then you might derive the value that you're associating with that timestamp, again, from the row number and just say, like, it's using an exponential scale or something to ramp up. And so in that way, you can generate time series data that fits a lot of use cases, but it doesn't simulate maybe the randomness, like in perfect trend that you might see a lot in IoT. Yeah, so I'd love to have a solution to that problem that's scaled, but I don't.
Andrew Colombi (38:12):
Sometimes I like to divide the series problem into two components, like one component is like, when did the events happen? When did the measurements happen or whatever you want to call it? And then what does the data look like, at the event, and I wrote a blog post in the early days of Tonic when I still wrote blog posts about how to do event timing. And it's pretty interesting stuff you can do with Poisson processes. There are more sophisticated answers these days. But I feel like the Poisson process is just your go to for inter event arrival times. The oldie but a goodie.
Chiara Colombi (38:50):
We still point people to that all the time whenever they asked about time series. All right, another question is or does the data faking process have repeatable results? I.e. the results are the same every time? If yes, do you also have referential integrity across different platforms? Different distributed systems, mainframe, et cetera?
Andrew Colombi (39:12):
Yeah, is a short answer, then there's like some more nuance, but referential integrity is a very important thing for Tonic users. So we like spent a lot of time, trying to make that possible. And certainly, almost all of our customers are working with multiple databases and needing to learn across all of them so that a user ID in one is going to match the user ID in another, et cetera, et cetera, even after you've applied the transforms. So that for sure now, like the exact parsing of your question, if you said like, can you always get the same output from Tonic? It does depend a little bit, at least on the referential integrity part, I can give you a thumbs up. But like if your input data changes and the statistics of the of that change, then Tonic is going to learn a new state of the world, it's going to be like, well, this is what my data looks like now. And it's going to try to replicate that. So the answer is caveated with, as long as the input data also doesn't change.
Chiara Colombi (40:19):
And is consistency plan, too, the answer this question as well?
Andrew Colombi (40:22):
Yeah, exactly. That's a big part of it. There's a feature in Tonic called consistency, which is directly addressing this question of like, how do I keep IDs consistent across multiple systems. And it's not just IDs, you can do a lot with it. IDs are a very important use case. But just as like an idea, to give you a shape of what's going on, you can tell Tonic, like I want this value to be consistent on this ID. So examples would be I want the name to be consistent on the ID. So whenever I see ID 157, I get the name of Andrew. And that means no matter what database I am on, if I see ID 157, I'm going to get name Andrew, which is really useful to be able to create a database that like makes sense where it's not like in this micro service it thinks the name is Andrew, in this microservice it thinks the name is Maria, and testers are like is that a bug? Or is that the data is just bad? And it can never be the data is just bad otherwise, our users get angry at us.
Mark Brocato (41:23):
So for Mockaroo, there's a very common use case that there's a tutorial video on, if you just go to the help section in Mockaroo, one of the tutorials I have up there is, you create a schema, and then you can create a data set based on that schema, it's stored in Mockaroo. And then you can reference that data set and future data sets. So you could generate primary key information in one data set and then reference that as foreign key information in another data set. That's how you build out the referential integrity. So it's absolutely a very common thing.
Mark Brocato (41:53):
As far as repeatability Mockaroo the design is biased towards controlled randomness, or generated trends. So it's less common to want to generate things that are the same every time. So most of the data types, you'll find kind of default to generating random things within a range or due to some distribution that you can control. But certainly with things like formulas and constants you can have the same data every time. So it's kind of a meta question. When you design a trend, you should get about the same trend every time. But obviously, it's different data points in it. So you have a fair bit of control, I would say.
Chiara Colombi (42:35):
Well, before go to, there's one question that came in earlier, I'm happy to answer, someone asked about the relationship between Tonic and Mockaroo. The relationship is really, we both kind of were early players in the space, big fans of big data the founders connected early on, and we really enjoy chatting, having these conversations with Mark, because there's a lot of overlap in what we do. But there are also differences, we solve two different use cases, from taking existing data to creating data from scratch. So products are independent, if that was part of the question, not integrated or anything.
Mark Brocato (43:10):
Yeah, I just like hanging out with Chiara and Andrew.
Chiara Colombi (43:14):
We have conversations, and [crosstalk 00:43:16]
Andrew Colombi (43:16):
If you joined early, you would have noticed like before we started, we were just like bantering about whatever. And so that's kind of a glue that keeps it all together.
Chiara Colombi (43:25):
Yep. Wish we can have these conversations more frequently honestly. All right, so this actually speaks to that two different use cases, someone asked more often than not, we do not have access to this production database. Is there a way Tonic can look at the schemas metadata and generate fake data in a lower level environment.
Andrew Colombi (43:45):
That is a Mockaroo use case more than a Tonic of use case. Because Tonic really is designed for you have some guide of what the data should look like from data.
Mark Brocato (43:56):
Well, thank you so much for pumping that one to me because I get the same question. And it's really tough to answer because you have to think it through. What people are often asking is like, okay, take a look at my database table and you've got integer, varchar, varchar, varchar, date, time integer, can you generate fake database in that? It's like, I don't know anything about what's actually in that other than there's some strings and some integers and I know roughly the length that I can stuff in there, but I don't know whether those strings or names or ages or addresses or whatnot. So it is actually really difficult to learn just based on a schema.
Mark Brocato (44:28):
So one of the bare bones, things that Mockaroo does to accelerate that is you can literally just take the DDL for a table and paste it in Mockaroo and it'll grab all the column names and prefill with just blank data types, you have to choose them. That's about the best they can do. So in a way it is like maybe that's a better use case for Mockaroo, but you're going to have some work, you're going to have to tell Mockaroo what it's going to do.
Andrew Colombi (44:51):
The one thing I would say like what Mockaroo had, I think I was thinking when I mentioned that is like Mockaroo has a ton of different industry specific, and also generic data creation things like, I mean, you could probably speak to it better than I can, but you've got things from like vINs to like ICD-10 codes to probably shipping container code, I mean, every ID that you can think of Mockaroo has a way of creating that ID that fits for that data type.
Andrew Colombi (45:21):
Whereas Tonic, it's much more like, I see you have IDs that look like this, I can make more IDs like that it doesn't have a specific way of making a specific kind of idea, just sort of apes what's already in the database.
Mark Brocato (45:34):
Yeah, maybe that goes to the building. One of the tips I would have for anybody who's trying to build one of these things on their own, is I kind of built it to be a little meta in that there's an easy way for me as a developer in Mockaroo to take a CSV data set and just make a data type out of that. And so a lot of the same code is is reused to just pick random rows out of that data set. And it's allowed me to extend things with tons of data types over time. So if you're thinking about building something on your own, think about maybe the future be a lot easier, rather than having to code a generator for every single thing, just to be able to take flat file data sets from the public and plug them into a system.
Chiara Colombi (46:18):
That was actually one of my questions for both of you is like, what are the top three considerations to keep in mind if someone is setting out to build their own solution?
Andrew Colombi (46:28):
It's much harder than you think it is. I've already said that.
Chiara Colombi (46:29):
That's not what I mean.
Andrew Colombi (46:32):
Item one, item two and item three, much harder if you're thinking that. One thing that I think kind of surprised us. The reason why it was harder than we thought it would be to keep harping on that. But is that the interconnectedness and richness of data is really deep. And if you think you can just say, well, I'm just going to replace this value with a random string to the point we were talking about earlier, like I got int, varchar, varchar, varchar, what do I do with that you're basically wrong, like there is very few use cases where just a random string is going to suffice. And so already, you're starting to look at like, okay, well, what kinds of strings exists in this database addresses, usernames, telephone numbers, et cetera. And so there's a big richness there. And that's just like a richness of data type in the sense that, okay, there's the data type of string, but really, the data type is address, right?
Andrew Colombi (47:38):
And then there's a richness of the interconnectedness of the data, that is very difficult to capture and simulate, yeah, when you're generating the state data. And by that, I mean, just as a simple example I use a lot. If you look at claims data, claims data is every time you go to a doctor, you make a medical claim to your insurance so they have all that in data and everything is in there, right? You got a shot, they gave you aspirin, those are two different claims. And there's going to be a story in there, there's going to be a story of a person who went to the doctor, then got an MRI on their knee, then got a knee replacement, then got PT. And having those stories in the data is really, really, really hard. And something that, it's like the peeling the onion, when you're doing this, it's like, oh, it's not going to be that bad. And you start peeling the onion, you're like, oh, my goodness, I need to simulate a person getting a knee replacement. That's going to be hard. So yeah, the richness, I think is a thing that if you were going to do this on your own, pay attention to that. Think about that upfront.
Chiara Colombi (48:54):
Mark what about you?
Mark Brocato (48:56):
Yeah, so a couple of things. One is usability. As a developer you build something, and it's probably going to make sense for you. But the need for it to make sense for anybody else and anybody else of a different background or different skill set, that's really rough. And a lot of times the people that could actually build a data generator or something that can generate lots of volumes of data at scale quickly, are not also the people that like to worry about usability and user interface design and whatnot, most developers hate that stuff. So if your audience is less technical crew than you are, which is probably likely for anybody who's thinking about building this. Building something that they can use and empowers them, it's a really serious challenge, I think. I think that's where I kind of got lucky that I seem to have done that. And I don't know how.
Mark Brocato (49:50):
So another one is, the cost of keeping the resources running and maybe if you're a big company like you don't care but if you're not a big company, if you build something and it's not optimized, just keeping those servers running for the 99% of time when no one's using them to generate data, that maybe is a serious cost. So there are technologies you can use to mitigate that, like Functions in a Service or serverless, you're just paying for the amount of compute time that you use. But then again, you have to know those technologies. And they're kind of new and raw. And it actually takes time to consider how to build this thing in a way that is cost effective.
Mark Brocato (50:30):
But then also, I think one of the most important thing is just like the people cost to maintain a system like this, which is going to be really interesting for the month that you're building it and then not so much for the next 10 years that you have to support it. It's not likely to be anything that's truly important to the business of people supporting it will be second class citizens, and it's just no fun. So yeah, it's funny, because I always want to build everything myself. But if anybody who works for me wants to build something themselves, I always force them to buy and sort of build. So yeah, there's a lot of cost to be considered there.
Chiara Colombi (51:06):
We have a question that I think the answer will be short. I know we're coming up on the hour, but can either of your tools create images of people's faces in a JPG format with a limited image size?
Andrew Colombi (51:17):
No, not Tonic, don't want to speak for Mockaroo. But we focused on structured data.
Mark Brocato (51:26):
Yeah, for Mockaroo, I think there is an avatar type. But there's a great site out there that I should just integrate with, in fact, I think I'm going to go do that after this, because it's probably really simple there. I think if you Google, this person does not exist. It's literally like AI to create a person that looks totally real, but they're not a person that actually exists. And I'm pretty sure they have an API, and you can control image size and stuff there. So I just need to plug that into Mockaroo. But that's a great question because it's a great idea for a new datatype.
Chiara Colombi (51:53):
There's another tool I just came across, that also uses AI and I'm blanking on the name, which is not very helpful. But you can basically take multiple images of people's faces and merge them and like kind of say, okay, I want to up the characteristics of that person's face decreased, you play around, if anybody out in the audience knows what that is called. Not that one. Yeah, it was something art, it had a weird name that grammatically didn't make sense. So grammatic things that don't make sense does tend to stick in my brain. So the last comment, not so much a question that came through, but maybe giving you guys an idea for time series, just want to make sure you saw that in the chat. What about the generating threads adding a random interval to the rows and then a post processor converting a known start timestamp by adding those sequentially? Not sure how that-
Mark Brocato (52:44):
I think that was Tom suggesting a solution to my time series problem before. I thought about things like that, what makes that difficult is Mockaroo is right now a one pass algorithm, meaning it just goes from start to finish. And there's no like post processing of all the data. And sometimes generating all that data takes quite a while and it's an immense volume of data in the end. So making a second pass and reading that data after it's been assembled from all these different workers is like a whole order of magnitude more work and complexity. So I have not jumped that shark yet, I've thought about that very solution. But from fundamentally a one pass to a two pass or multiple pass system is a thing I don't want to do, if I can help it.
Chiara Colombi (53:34):
I did find that website that is kind of like this person does not exist called artbreeder.com. Like I said-
Andrew Colombi (53:42):
Thispersondoesnotexist.com also has a thing.
Chiara Colombi (53:43):
Right, that is and then our breeder is the one where you can take multiple images and merge them in some way and kind of like, toggle the characteristics of those images. So people use that to create fake faces in like the context of, I don't know, creating characters for a story or something. That was interesting. I think that is all the questions that we have. Any final thoughts from either of you guys on any of what we discussed?
Andrew Colombi (54:12):
Well, having perused this person does not exist. It's really convincing unless there's like another person in the image and then that other person's broken. [crosstalk 00:54:24] I'm going to share my screen, we can end on this. Here we go. This person, super convincing, this person.
Chiara Colombi (54:37):
Does not exist.
Andrew Colombi (54:38):
Does not exist.
Mark Brocato (54:42):
I'm also willing to bet that person on the left mostly exists. I don't think that one's been scraped together from as many different images as the one in the center.
Andrew Colombi (54:52):
Oh, interesting. That's an interesting point.
Chiara Colombi (54:57):
Cool. Well, before we say goodbye, I just wanted to give a shout out to some more webinars that we have coming up kind of on this subject as well later in the year. We've got one in November, one in December. December's webinar is again going to be focused on build versus buy, but with a targeted focus on enterprise ROI analysis, so kind of looking at some of the numbers behind what we've discussed here. And so keep your eyes out for that on the events page at tonic.ai. And then in November, we're going to be looking at fake data antipatterns, we recently released an ebook on the subject, all the ways fake data can fail you and how you can avoid those pitfalls. So keep your eye out for invitation emails and for the events on the website. Awesome. Thanks from Finland, we got. Awesome to hear that someone's chiming in from Finland. Thanks to everyone for joining us. Thank you, Mark. Thank you, Andrew. Like we said earlier, these conversations are always great. I'm looking forward to the next one.
Mark Brocato (55:49):
Thank you. It's been a pleasure.
Andrew Colombi (55:51):
Great. Thank you.
Chiara Colombi (55:51):
Bye, everyone.