Garbage In, Garbage Out: Making Datasets Ready for AI - Architecture & Governance Magazine

We’ll all be familiar with sayings like: “it’s garbage in, garbage out with AI”. “It’s only as good as the data it’s trained on”. “When bad data meets good AI, bad data usually wins”. Lines like these are ubiquitous in the AI era. But who actually knows what “good data for AI” looks like? The author, a London-based tech lawyer often in the company of technical types for whom this is all second nature, found the following to be a helpful way of explaining things.

It’s helpful that there’s a growing corpus of accessible guidance and commentary on this very point. In the UK, as in many places, the focus is often on government data; governments generally being custodians of the oldest, largest and most unwieldy datasets, and keenly aware of the untapped social and economic upside of successfully getting them “ready for AI”.

Two recent examples are the UK Open Data Institute’s (“ODI”) A Framework for AI-ready data (May 2025)[1] and the UK government Department for Science, Innovation & Technology’s (“DSIT”) Guidelines and best practices for making government datasets ready for AI (January 2026)[2]. DSIT’s guidelines take the ODI’s four-component framework for “the AI-readiness of a dataset” and apply it in a more granular way to public sector datasets. Of course, these reports will be just as helpful to private sector organisations looking for a way to start rationalising their data estates for the AI age.

DSIT lands on a “four pillars of AI-ready data” edifice as follows:[3]

Pillar 1: Technical Optimisation – the basic question here is: “can AI actually use that dataset?”. It focuses on the efficient data formats, APIs and infrastructure that will let the data get to the models quickly and reliably. There’s some interesting nuance here. For example, when thinking about processing data at “scale”, DSIT’s report looks beyond “database size” and notes that “scale isn’t just about how large the database is; it’s about minimising data movement. Reducing unnecessary data transfers is critical for maintaining efficiency and throughput.”

Pillar 2: Data & Metadata Quality – the next thing to think about is the data itself, starting from the ground up: “what datasets already exist”? On data quality, even the best model can’t rescue missing, messy or mysterious data. Here the focus is on treating data as a product. This means clear ownership and sufficiently rich metadata so people (and machines) can tell what a dataset is, where it came from, and how far it can be trusted.

Pillar 3: Organisation & Infrastructure Context – getting datasets ready for AI isn’t just a technical process like upgrading a database. This pillar gets to the human and organisational scaffolding: the governance, skills, documentation and ways of working that should lead to sustainable data practices within an organisation. An interesting point here is about silos – both data silos and siloed thinking when it comes to data strategy: “the most valuable AI insights often emerge from combining data across organisational boundaries. AI-ready datasets should therefore support governed information sharing and collaboration while enforcing purpose limitation and access controls.”

Pillar 4: Legal, Security & Ethical Compliance – finally, the legals. And it does not escape the author’s notice that this section comes last. That is, there is a lot to think about before the entry-level legal question comes about: are you allowed to use this data, for this AI system, in this way? This pillar groups the privacy, security and ethics work needed so that datasets are not only ready as a technical matter, but also that using them is lawful and capable of surviving contact with regulators, journalists, users and the like.

“Making datasets ready for AI” sounds like fairly abstract work. But what the DSIT report shows is that it can be said to boil down to four very human questions. Can your systems get to the data in the first place? Do you understand and trust the data? Do you have the people and organisational structures to use it in a sustainable way over time? And are you genuinely allowed to use it in the way the AI team wants?

The public sector guidance is written with government datasets in mind, but the questions are still relevant for the private sector. If you are an enterprise trying to rationalise years of operational databases, SaaS exports and neglected data lakes, treating your key datasets as products and stress-testing them against DSIT’s four pillars is a good place to start. Likewise, if you’re a lawyer looking for a way to understand what the “garbage in, garbage out” cliché means in practice, DSIT’s pillars are a great non-technical explainer.

[1] Here

[2] Here

[3] See here p. 3. Note the DSIT report is published under the UK OGL (see p. 2).