When companies roll out enterprise AI tools they often find that their data lake may be deep, but it’s messy. Even if they start with carefully curated data, poor data change management can lead to serious consequences downstream.
Chad Sanderson is the CEO and founder of Gable.ai where he helps organizations improve data quality at scale.
I got to speak with him about the importance of data quality and how data contracts can ensure that applications built on large amounts of data maintain their integrity.
Q: You come from a background as a journalist. Do you want to tell us how you ended up in data and being passionate about data science and data quality?
Chad Sanderson: “Data science was something that I started practicing as a journalist because I was running my own website and I needed to set up web analytics. I learned all the GA4, I started running A-B tests, very basic data science. And then I enjoyed it so much that I made it my full-time job, taught myself statistics, ended up going to work for Oracle as an analyst and a data scientist.
And then I started managing teams in the data space. First, it was more on experimentation and analytics teams. Then I began moving more into data engineering and then ultimately to infrastructure, data infrastructure platforms.
So I worked on the Microsoft Artificial Intelligence platform. And then I also led the AI and data platform at a late-stage freight tech company called Convoy.”
Q: You recently spoke at MDS Fest about data contracts and how that allows companies to have this federated data governance. Do you want to briefly explain what that’s about?
Chad Sanderson: “Data contracts are a kind of implementation mechanism of federated data governance and federated data management.
Basically, in the old world, so in the legacy world, on-prem, 20 years ago, you had data architects that would build an entire data ecosystem at a company, starting from the transactional databases, the ETL systems, all of the various mechanisms that you transform data and basically prepare it for analysis and data science and AI.
And all of that data was provided to the scientists from a centralized team. You can think of it in the same way that a librarian operates a library.
They make sure what books are coming in, what books are going out, how the books are organized, and that makes it very easy for researchers to find the information they need for their projects.
But what happened 15 years later, 20 years later, is that we moved to the cloud and software engineers, and software ate the world, as Mark Andreessen says, and every business decided to become a software business. The way that companies were running software businesses was by letting the engineering teams move as fast as they possibly could to build applications in a super iterative, experimental way.
That meant that all of the data these applications were generating were no longer subject to the data architect’s kind of planning out the structure and how it was designed and organized. You just took all this information and you threw it into one place called the data lake. And the data lake was very messy.
The responsibility to make some sense out of all of this kind of swampy information fell onto the data engineer. And so there’s a bit of living in both worlds where you have the decentralized, totally federated application layer and a very, very still centralized data layer and data engineering teams doing their best to make some sense out of it.
The data contract is a mechanism for the downstream data teams and data engineering teams to say, hey, we’re starting to use this data in a particular way.
We have some expectations on it. And that means that the engineers who create the data then take ownership of it the same way that a data architect would take ownership of the entire system a year previously. And that is what actually allows governance to scale, quality to scale.
If you don’t have that, then you just get this very chaotic sort of situation.”
Q: And it’s the garbage in garbage out kind of situation. If change something very small in your data that can have profound ramifications downstream.
Chad Sanderson: “Yeah, that’s exactly right. And there’s a lot of businesses that have had really unfortunate impacts from their AI models just by relatively small changes that the application developers don’t think are a big deal.
For example, let’s say that you’re collecting someone’s birthday because you want to automatically send them a very nice birthday message.
You might be storing that birthday information as three columns with birthday month, birthday year, and birthday date. And you take all that information and then you can do some fancy stuff with it. But if the engineer says, you know what, splitting this into three different columns is stupid.
I just want to have one column for the date. That’s fine. And they’re going to do that because it makes their application easier to use.
But anyone who’s downstream that’s using that data is expecting three columns. So if tomorrow they only get one, and two that they were using are gone, it’s going blow up everything that they had built.
That’s the kind of thing that’s happening all the time at companies.”
Q: You’re the CEO of a company called Gable. What are some of the core challenges you’re seeing companies facing that you’re hoping to solve? How does your platform address some of those issues?
Chad Sanderson: “So the biggest challenge that we’ve heard from most companies moving into the AI and ML space, at least from the data side, is really two things. The first is ownership. So ownership meaning if I’m someone who’s building out AI systems, I’m building the models, I need someone to take ownership over the data that I am using and make sure that that data is treated like an API.
If you’re a software engineer and you’re relying on someone else’s application, you’re doing so through an interface. That interface is well documented. It has very clear expectations.
There are SLAs. It has a certain amount of uptime that’s expected to work. If there are bugs, then someone actually goes and fixes them.
And this is the reason why you can feel comfortable taking a dependency on applications that are not just the thing that you built. And in data, that’s what we’re doing when we are extracting data from someone else’s data set, like a database for example. And then we’re building a model on top of it.
We’re taking a dependency on an interface, but today there is not much ownership on that interface. There’s no real SLA. There’s not a lot of documentation.
It can change at any time. And if that’s how APIs work, our whole internet ecosystem would be in chaos. Nothing would work.
So this is what a lot of companies and data teams are really craving right now, is the ability to trust that the data that they are using is going to be the same data tomorrow that it was yesterday. That’s one piece. And then one of the really essential outcomes of that is data quality.
We care about making sure that the data matches our expectations. So let’s say that I’m working with some shipping data and I’m consuming some information about shipping distances for freight. I would always expect that shipping distance feature to mean the thing that I expect it to mean and not suddenly mean a different thing, right?
If I say this is shipping distance in miles, then tomorrow I don’t want it to suddenly mean kilometers because the AI is not going to know that it’s changed from miles to kilometers. It doesn’t have the context to understand that.
What Gable is all about is making sure that those very clear expectations and SLAs are in place, that all the data that teams are using for AI is clearly owned, and that the entire organization understands how different people within the company are using the data and where that tender love and care is actually needed.”
Q: A lot of the emphasis is on ensuring the data quality to enable AI, but is AI enabling you to do that better?
Chad Sanderson: “AI is awesome, frankly. I think that we’re in the middle of a hype cycle, definitely, 100%.
So people are going to be making some claims about what AI can do that is outlandish. But I think if you’re realistic and you just focus on what AI can do right now, there’s already a lot of value that is adding for our company in particular. So Gable’s primary value add, the thing that we do differently from everyone else, is code interpretation.
Gable is not a data tool. We are a software engineering tool that is built for the complexities of data. And we can interpret code that ultimately produces data to figure out what that code is doing.
So if I have, let’s say, an event that’s being emitted from a front-end system, and every time somebody clicks a button, there’s code that says, hey, this button is clicked. I want to send an event called button clicked into a database. And then from that database, we’re going to send it to our data lake.
And then from our data lake, we send it to model training for some AI system. And what Gable can do is say, that if some software engineer decides to change how that button clicked event in code is structured, which would have an impact on everyone downstream, we can recognize that that has happened during the DevOps process.
So when a software engineer is going through GitHub and making changes to their code, you can say, oh, wait a second, before you actually make this change, we’ve detected that something has gone wrong here.
A lot of that code interpretation, we’ve built out using more machine learning and static analysis-based methods.
But AI, which is very skilled at recognizing convention, like common coding patterns, it does a really great job at providing context into why people are making code changes or what their intent is. So there are a lot of cool ways that we can apply AI for our product in particular.”
Q: If companies want to leverage AI they’re going to need data. What do you see as the biggest opportunities for companies to manage and develop their data? How do they capitalize on that and prepare for it?
Chad Sanderson: “So I think that every company who wants to leverage AI needs to come up with a data strategy. And I think that there’s going to be two data strategies that will be hyper-relevant to every business.
The first is that right now, the big iterative models, the LLMs, the public-facing LLMs that we all know about, like OpenAI, Cloud, Gemini, Anthropic, they’re all using primarily publicly available data, data that you can get from the internet.
And this definitely has utility for a broad, general model. But one of the challenges with these LLMs is something called the context windows, meaning the more information they have to reason over, the worse of a job they do. So the more narrow of a task you can provide them with a limited amount of context, the more effective they are.
It’s kind of like a person, right? If I give you, you know, a book’s worth of information and then ask you about a very specific paragraph on page 73, your ability to recall it is likely going to be low. But if I only give you one chapter to read, you’re likely going to do a much better job at that.
So that’s sort of one point is like a lot of these general models, I think are not going to be as useful for big businesses. And we’re going to start to see smaller and smaller models that are more context-driven. So they’re based around smaller contexts.
And the way that you get finely tuned, high-quality context is by getting highly tuned, great data about that specific, whatever that specific thing is that you’re looking at. And I think this is going to become the data is going to become the competitive moat for most businesses.
So I think that that’s going to be a huge investment that a lot of companies are going to have to make. We need to collect as much high-quality data as we possibly can so that we can feed it into these models and not use the broader models with the larger context windows.”
Q: How are things like GDPR and CCPA in California going to affect how people or companies handle data quality and security?
Chad Sanderson: “I think GDPR and CCPA are really good examples of why a lot of businesses are concerned about what the regulation of these generative models looks like in the future.
Even if the United States says, ‘Hey, this is okay’, if the EU decides that it’s not, ultimately, you have to apply these standards to everyone, right? The big deal with GDPR was you can’t really tell if a customer accessing your website is from Europe or the United States.
And certainly, you can do geolocation and stuff like that. But you might have a European in the United States who is using your application and GDPR does not discriminate between that person and someone who’s actually living in Europe. You have to have the ability to treat them the same.
And that means effectively, you need to treat all customers the same because you really don’t know who this person is on the other side of it. And that requires a lot of governance, a lot of very interesting technological innovation, a lot of changes in how you deal with marketing and things like that. And I think we’re probably going to see something similar with AI when the regulation really starts to come out.
Europe is already beginning to push on it. And this is why it’s just safer for a lot of businesses to do their own stuff, right? I have my own walled garden.
I’m only using the data that I collect from our own applications. And that data is not leaving. We’re not following customers around the internet.
We’re just looking at the patterns of how they actually use our services. I think that’s going to become pretty big. The other thing I think is going to become big is data vendors.
So data vendors have been around for a very long time, or data as a service, where you say, look, I’m going to provide you up-to-the-minute information on the weather, and you pay me for access to that information. And I’m the one who’s already gone through the hurdles of making it safe and making it accessible and making it trustworthy. And I make sure the data quality is high.
That’s already happening. But I think that that’s going to explode over the next five to 10 years if you need data that you can’t collect from your own internal applications. And I think in that world, the concept of these contracts is going to become even more important.
And that’s going to be attached to a literal contract. If I am paying for data to look a certain way then I have certain expectations for it.
I do not expect that data to suddenly change from the last time you gave it to me to today, because now it can really have an impact on my machine learning model, which has an impact on my bottom line.
We interact with AI tools on a daily basis but we hardly ever think about the data that these models rely on. Data curation and management is going to be crucial, especially for companies deploying AI internally.”
Data curation, quality management, and control are going to become more crucial as companies build products that depend on consistently good data.
If you want to know more about data contracts and how to make the most of your company’s data you can contact Chad Sanderson or learn more at Gable.ai.