It’s Not AI, but Data Cleansing Is the Next Sexiest Job

By Shammy Narayanan

Interacting with fresh graduates, I couldn’t stop wondering if most of them were in a mythical dream of speed dating AI algorithms as their magical path to success. Whenever I broached the topic of Data with them, I got to see expressions similar to the one under the influence of the nasty flu. Such deep-rooted belief exists even among seasoned professionals in the industry. Seldom do they realize that any successful AI program spends dinosaurs’ share of its effort in Data cleansing. Those who ignore this unwritten rule end up in 76% of globally failed/ shelved AI projects.

Before diving further, let’s relook at what constitutes data cleansing. It’s not just a naïve technique of identifying and replacing null or missing values, but it’s about a collection of methodologies to detect and resolve inconsistencies and inaccuracies in data along with the sound process of dealing with duplicate, decayed and humongous data so that AI algorithms can be successfully trained and accurately deliver. Contrary to the general perception, it’s not about Big data but Better data. If this sounds like a spiel, let’s analyze three recently failed AI models as they took off their eye on basic data sanity.

Let’s begin with a simpleton case of how a well-planned marketing model derailed due to a lack of basic data redundancy. The organization had departments such as Billing, Sales, and Marketing to collect and store data independently. When the marketing department tried to run a Spring Clearance by integrating all the databases, the campaign ended belly-up. What went wrong? Databases were not interoperable. Say a Customer DB had stored the name as William Roger. In contrast, the Sales Database had it recorded as Bill Roger (a similar scenario exists for Edward and Ed, Margret and Maggy, Robert and Bob, to name a few.), the system treated both as two independent records, and not only it resulted in annoying redundant calls to some of their best and high-value customers, but it also led to steady increase in DND(Do Not Disturb) list, a marketers nightmare. SVP of marketing stepped in to pull the plug off. Not only did the well-preserved brand image go for a toss, but a golden opportunity to create a 360-degree insight into its customer profile was lost by not focusing on data cleansing.

Another example is where an incomplete dataset ended up creating a biased AI model and, in that process, further deepened the dark fissures in society. Chicago Police Department piloted an algorithm that made a list of people deemed most at risk of being involved in a shooting, either as a victim or perpetrator. Details of the algorithm when it was made public turned out that 56 percent of Black men in the city aged between 20 to 29 were featured on it. This model created from an incomplete data set was eventually scrapped. Crimes that could have been prevented; lives that could have been saved through accurate prediction were unscrupulously scarped because of data sanity.

The third scenario is about the failure of the advanced Genome tracking program due to a loss of data focus. The communicable disease department developed an ambitious prescriptive model that will analyze the strain of TB bacteria in existing patients and map it to the county they reside in. Leveraging this model for a new patient system can accurately predict the possible strain of infection and enable them to start with the right level of drugs instead of the time-consuming traditional approach of moving up from the first level of defense. This project used the Patient Database to get the address. At face value, everything was working well. However, this demographic info in the patient database was outdated and collected during enrolment. For example, a patient would have shifted the location several times and not necessarily updated the hospital database. With this obsolete data, this project couldn’t scale half its well-intended goals.

When we can endlessly enumerate such scenarios, the fact remains that Data Cleansing is a non-negotiable for the success of any AI model. We will continue to fall into this trap as long as we continue to approach data as a file or a standalone dataset. No longer is data a by-product of applications; instead, Data itself is a Product. Few attributes of such a product include source, lineage, transformations, security (encryption levels), access levels, and expiry tag. It is a paradigm shift, and it needs to sink in. Org should make the data discoverable and subject to centralized Governance policies on relevance and usage. When there are many emerging tools in the market, such as Collibra, One-Trust, and Talend, please make no mistake; no tool will perfectly fit your situation and will require a heavy amount of customization (In my opinion, MS-Excel remains the single best tool covering significant ground on data validations). Once such a structure, policies, and governance are established, how Data is treated and consumed will radically change from the paleolithic world.

So for those rooting for AI, let’s not get distracted and diverted by the glitters of the utopian dream but hard-focus on the underlying engine that fuels the real growth. When personal computing shook the foundation of the Mainframe, it was the GUI /event-driven architecture that gained momentum. During mobile computing, it was the app that the economy flourished. Likewise, in AI galaxy, it’s not the handful of ensemble algorithms but inexorably the Data that is driving the show.

Narayanan is the Data and Analytics Head of a Healthcare GCC. He is 9x certified in Cloud and blogs about emerging tech and strategies. He can be reached at