Artificial Intelligence Is Only as Good as Your Data, Which Likely Sucks

By Scott Ambler

Artificial intelligence (AI) is restrained by the Law of GIGO: Garbage In, Garbage Out. Or, more positively, the higher the quality of the data that you use to train your Ais the more accurately they will perform.  Yes, there’s a lot more to AI than clean data, but without it you’re in serious trouble.

Several years ago Robert Siener, in his book Non-Invasive Data Governance, argued that data is the new water. Although many people like to say that water is the new oil, because it’s proving to be so valuable, Siener argues that it’s more like water. Just like you won’t survive long without clean water, your organization won’t survive long without clean data. Let’s explore this metaphor.

Just Like You Need Clean Water…

Imagine you’re thirsty and would like a glass of water.  You get a glass, you go to the kitchen tap, but the water that comes out is brown and smelly. You realize that that there are several ways to fix this problem. An easy way would be to get a filter jug, pour water into it, and then drink clean water from the jug. But dirty water is also coming from all the other taps in your house, and you really don’t want to shower in it. A better solution for you would be to filter the water as it comes into your house so that way all the water coming from your taps is clean.

Then you discover that your neighbors have the exact same problem. Apparently, all water in your town is polluted.  A better solution for the town would be to put the filter at the inflow pipe for the town’s water supply. That way every house in town gets clean water. But, while this solves the problem for the town, it doesn’t do anything for any of the other towns downstream.

Unfortunately, a factory upstream from where you live is putting filth into the river. One strategy would be to put a filter on the pipe so that way the water from the factory is clean when it goes into the river.  Better yet would be for the factory owners to identify what machinery is producing the filth, fix the machinery, and clean up any mess around it. This would fix the real problem at the source once and for all.

…You Need Clean Data

Just like there are different strategies to improving water quality, each of which have a different level of impact and effectiveness, there are similarly different strategies for addressing data quality (DQ) problems. The way I see it there are five maturity levels for how you approach DQ:

  1. Point-specific cleansing. Include logic in your application, often at the points where you either read some data in or write some data out. This cleansing is typically used to work with a small amount of data at a time, perhaps to show on a screen or in a report.
  2. Point-of-use cleansing. With this strategy large amounts of data, typically entire tables or data files, are cleansed when the data is to be used. Data scientists commonly take this approach when building AI or other data science solutions.
  3. Copy and transform. A large amount of data is copied from an existing, operational data source and transformed into a clean version en masse. Data warehousing teams follow this approach via an extract-transform load (ETL) or extract load transform (ELT) strategy.
  4. Source interface cleansing. Read/write access to a data source occurs through a defined interface, such as an API or a web service. The implementation of the interface includes requisite logic to cleanse the data going out of the data source. Ideally this interface conforms to a data contract to ensure data semantics.
  5. Fix the source. Apply database refactoring or data repair techniques to fix any problems in the data source itself, and any systems that are working with that data.

The following table maps the various strategies for cleaning dirty water to the strategies for cleaning dirty data.

Data Cleansing Maturity Level Water Cleansing Strategy
1 – Point specific Filter jug
2 – Point of use House filter
3 – Copy and transform Pipe filter
4 – Source interface Source filter
5 – Fix the source Fix the source

Comparing Levels

An interesting thing to note is that the strategies at lower maturity levels are easy and perceived as inexpensive. For example, a filter jug is twenty dollars and a filter system for your house a few thousand dollars.  Filtering the water coming into the town’s water supply would take years and millions of dollars for each town to implement, as would the solutions at the factory.  When you consider the overall need for tens of thousands of houses to install filters it’s much cheaper to fix the problem as far upstream as you can. The cheap and easy solutions are quite expensive in practice when you consider the overall situation.  Organizations are seeing the exact same thing when it comes to dealing with data quality problems – they’re spending a lot of money only simpler, low maturity strategies but not fixing the real problem. My advice is to choose to work smarter rather than harder.

Parting Thoughts

Just like people need clean water to survive, our organization needs clean data to survive.  Dirty water poisons people, and dirty data poisons our decision-making systems including the AIs that we build. We need to start treating data like an enterprise asset, and that means we must invest in making it clean. If our data isn’t an asset then it’s a liability.

This article is excerpted from my forthcoming book “Building a Continuous Data Pipeline”, to be released in early 2026.