The AI industry's most urgent problem is not semiconductor shortages, energy costs, or regulatory friction. It is something far more fundamental: the world is running out of usable training data, and the consequences are arriving faster than most enterprise AI strategies have accounted for.
Why the Training Data Well Is Drying Up
Large language models learn by ingesting vast quantities of text, images, and code. Each successive generation demands significantly more training material than the last. GPT-3 trained on roughly 300 billion tokens. Its successors required trillions. The exponential appetite of these systems has outpaced the linear growth of the internet by a considerable margin.
Researchers at Epoch AI, one of the most credible independent research organisations tracking AI scaling trends, have published findings suggesting that the stock of high-quality text data could be substantially consumed within a few years. Low-quality data remains abundant, but feeding it into models introduces noise, bias, and degraded performance. The distinction between quantity and quality has become the central tension in AI development, not just in the laboratory but in every boardroom that has staked its digital strategy on continued model improvement.
This matters well beyond the research labs building foundation models. If the training data bottleneck cannot be resolved, the pace of improvement across every AI-powered product category, from automated translation and medical diagnostics to financial modelling and legal document review, will slow. For European enterprises that have made multiyear commitments to AI transformation programmes, that scenario is not a distant theoretical risk. It is a live planning assumption that deserves honest scrutiny.
The Synthetic Data Gamble
Faced with scarcity, labs have turned to a controversial solution: synthetic data, training material generated by AI models themselves. The logic is straightforward. If genuine real-world data is insufficient, create artificial substitutes that mimic its statistical properties at scale.
Companies including Nvidia, Google DeepMind, and a number of European AI developers have invested heavily in synthetic data pipelines. Early results are genuinely mixed. Synthetic data performs well for narrow, well-defined tasks such as code generation and mathematical reasoning. For open-ended language understanding, however, models trained primarily on synthetic material can develop subtle and compounding distortions, a phenomenon researchers have labelled model collapse.
Model collapse occurs when AI-generated content feeds back into training loops, gradually amplifying errors and reducing the diversity of expression. It is the machine learning equivalent of photocopying a photocopy: each generation loses fidelity. Several published studies have already demonstrated measurable degradation in models trained through multiple generations of synthetic content. The risk is not hypothetical; it is observable and documented.
The table below summarises the principal trade-offs between real-world and synthetic training data approaches currently debated across the research community.
| Approach | Strengths | Weaknesses | Best Used For |
|---|---|---|---|
| Real-world data | Diverse, authentic, grounded | Finite, legally contested, expensive | General language understanding |
| Synthetic data | Scalable, controllable, cheap | Model collapse risk, low diversity | Code, maths, narrow tasks |
| Federated learning | Accesses private data without centralising it | Complex infrastructure, slower | Healthcare, finance, government |
| Active learning | Reduces data volume needed | Requires expert annotation | Specialised domains |

The Copyright Minefield Complicating European AI Development
Data scarcity has sharpened legal battles over training data that are already well advanced in Europe. Publishers, news organisations, and creative professionals have pursued litigation against AI companies for using copyrighted material without permission or compensation. These cases are actively reshaping what data can legally be used and at what cost, with European law providing a particularly consequential battleground.
The EU AI Act, which entered into force in August 2024, includes explicit provisions requiring AI developers to disclose the sources of their training data. Dragoș Tudorache, the Romanian MEP who served as one of the Act's lead negotiators in the European Parliament, has been unambiguous that transparency obligations are non-negotiable: providers of general-purpose AI models must publish sufficiently detailed summaries of the content used for training. That requirement alone is forcing a reckoning inside every lab that built its data pipeline on broad internet scraping without meticulous provenance tracking.
If European courts consistently rule against AI companies in copyright disputes, substantial portions of the internet's highest-quality content, journalism, academic writing, literary works, will move behind licensing walls. That would accelerate the data scarcity problem considerably and concentrate competitive advantage among the few organisations that secured licensing agreements early.
- United Kingdom: The government launched a consultation on copyright and AI training; tensions between the creative industries and technology sector remain unresolved as of mid-2025.
- European Union: The AI Act mandates training data transparency; ongoing legal challenges from publishers are testing its scope.
- Germany: German publishers have been among the most aggressive in pursuing licensing frameworks, reflecting the country's strong tradition of press rights.
- France: Mistral AI, headquartered in Paris, has publicly committed to building legally clean datasets, setting a precedent other European labs are watching closely.
- Switzerland: ETH Zurich researchers are actively developing privacy-preserving training methods that sidestep some data rights complications, a line of work attracting significant European Research Council funding.
This licensing race has created a new category of strategic asset. Organisations holding large, high-quality, legally clean datasets, whether hospitals, financial institutions, national broadcasters, or public research bodies, now find themselves with considerable leverage they did not previously recognise. European data protection frameworks, often portrayed as an obstacle to AI development, may in fact force the kind of rigorous data governance that makes those assets more valuable, not less.
The European and UK Multilingual Data Gap
Europe has its own version of the multilingual data problem, and it is more acute than the continent's reputation for linguistic diversity might suggest. English dominates existing training corpora. Models perform materially better in English than in most other European languages, including languages with tens of millions of speakers such as Polish, Dutch, Romanian, Greek, and Czech.
Yann LeCun, Chief AI Scientist at Meta and one of the field's most credible long-term voices, has repeatedly argued that the dominance of English in training data creates structurally weaker models for the majority of the world's population, a population that includes the majority of EU citizens. The practical consequence is a two-tier AI landscape: users operating in English receive demonstrably better performance from frontier models than users working in their native European languages.
Several European initiatives are attempting to close this gap, with varying levels of ambition and resource.
- European Language Grid: A European Commission-funded infrastructure project assembling multilingual datasets for EU languages, though coverage and quality remain uneven.
- BLOOM and OpenEuroLLM: Collaborative open-source model efforts involving European research institutions, specifically designed to improve non-English language performance.
- UK Research and Innovation (UKRI): Has funded several academic programmes on low-resource language modelling, with particular attention to Welsh, Scots Gaelic, and minority languages of Commonwealth communities.
- Germany and France: National AI strategies from both countries have identified language model competitiveness as a sovereignty concern, with state-backed investment in German- and French-language corpora.
These efforts remain modest in scale compared to the resources available to frontier labs in the United States and China. The gap is not purely financial. European data governance architecture, while sound in principle, has in practice created barriers to the aggregation and curation of data at the volumes modern models require. Federated approaches and privacy-preserving techniques offer a partial answer, but they introduce their own infrastructure complexity.
What the Industry Is Doing About It
The search for solutions extends well beyond synthetic data and licensing negotiations. Federated learning allows models to train on distributed datasets without centralising them, potentially unlocking private data held by hospitals, banks, and public bodies. This approach is directly relevant to Europe's legal environment, where GDPR constraints make centralised data aggregation difficult or impossible in many sectors.
Active learning techniques help models identify and request only the most informative training examples, reducing total data requirements substantially. Architectural innovations are also emerging: Google DeepMind's Gemini and Anthropic's Claude have both demonstrated improved data efficiency relative to earlier model generations, extracting more learning signal from the same volume of training material. Whether efficiency gains can keep pace with the industry's scaling ambitions remains genuinely uncertain.
There is a harder question that receives insufficient attention in the commercial AI conversation: what happens to AI capabilities if the data problem is not solved? The risk is not that AI stops working. It is that progress plateaus at precisely the moment when enormous capital investments have been committed on the assumption of continued improvement. That is a scenario with serious consequences for every business and government that has built its AI strategy around the expectation of perpetual capability gains.
Comments
Sign in to join the conversation. Be civil, be specific, link your sources.