Apr 13, 2026 · 8 min read

AI Is Running Out of Training Data, and Europe Cannot Afford to Ignore It

The internet is nearly scraped clean, and the AI industry's most pressing constraint has nothing to do with chips or compute. From Brussels to Berlin, the data bottleneck is reshaping the economics of AI development, forcing labs, regulators, and governments to rethink their assumptions before the scaling era quietly ends.

The AI industry's most urgent problem is not semiconductor shortages, energy costs, or regulatory friction. It is something far more fundamental: the world is running out of usable training data, and the consequences are arriving faster than most enterprise AI strategies have accounted for.

Why the Training Data Well Is Drying Up

Few years

Estimated time before high-quality text data is largely consumed

Epoch AI researchers have published findings suggesting the stock of high-quality publicly available text data could be substantially consumed within a few years at current rates of model scaling.

Source

10-50x

Gap in corpus size between English and many non-English languages

Training corpora for many non-English languages, including a number of widely spoken European languages, are estimated to be 10 to 50 times smaller than their English equivalents, producing measurably inferior model performance for non-English speakers.

Source

Large language models learn by ingesting vast quantities of text, images, and code. Each successive generation demands significantly more training material than the last. GPT-3 trained on roughly 300 billion tokens. Its successors required trillions. The exponential appetite of these systems has outpaced the linear growth of the internet by a considerable margin.

Researchers at Epoch AI, one of the most credible independent research organisations tracking AI scaling trends, have published findings suggesting that the stock of high-quality text data could be substantially consumed within a few years. Low-quality data remains abundant, but feeding it into models introduces noise, bias, and degraded performance. The distinction between quantity and quality has become the central tension in AI development, not just in the laboratory but in every boardroom that has staked its digital strategy on continued model improvement.

This matters well beyond the research labs building foundation models. If the training data bottleneck cannot be resolved, the pace of improvement across every AI-powered product category, from automated translation and medical diagnostics to financial modelling and legal document review, will slow. For European enterprises that have made multiyear commitments to AI transformation programmes, that scenario is not a distant theoretical risk. It is a live planning assumption that deserves honest scrutiny.

The Synthetic Data Gamble

Faced with scarcity, labs have turned to a controversial solution: synthetic data, training material generated by AI models themselves. The logic is straightforward. If genuine real-world data is insufficient, create artificial substitutes that mimic its statistical properties at scale.

Companies including Nvidia, Google DeepMind, and a number of European AI developers have invested heavily in synthetic data pipelines. Early results are genuinely mixed. Synthetic data performs well for narrow, well-defined tasks such as code generation and mathematical reasoning. For open-ended language understanding, however, models trained primarily on synthetic material can develop subtle and compounding distortions, a phenomenon researchers have labelled model collapse.

Model collapse occurs when AI-generated content feeds back into training loops, gradually amplifying errors and reducing the diversity of expression. It is the machine learning equivalent of photocopying a photocopy: each generation loses fidelity. Several published studies have already demonstrated measurable degradation in models trained through multiple generations of synthetic content. The risk is not hypothetical; it is observable and documented.

The table below summarises the principal trade-offs between real-world and synthetic training data approaches currently debated across the research community.

Approach	Strengths	Weaknesses	Best Used For
Real-world data	Diverse, authentic, grounded	Finite, legally contested, expensive	General language understanding
Synthetic data	Scalable, controllable, cheap	Model collapse risk, low diversity	Code, maths, narrow tasks
Federated learning	Accesses private data without centralising it	Complex infrastructure, slower	Healthcare, finance, government
Active learning	Reduces data volume needed	Requires expert annotation	Specialised domains

A wide-angle editorial photograph taken inside a European university data research facility, likely modelled on ETH Zurich or a comparable institution, showing researchers at standing desks covered wi

The Copyright Minefield Complicating European AI Development

Data scarcity has sharpened legal battles over training data that are already well advanced in Europe. Publishers, news organisations, and creative professionals have pursued litigation against AI companies for using copyrighted material without permission or compensation. These cases are actively reshaping what data can legally be used and at what cost, with European law providing a particularly consequential battleground.

The EU AI Act, which entered into force in August 2024, includes explicit provisions requiring AI developers to disclose the sources of their training data. Dragoș Tudorache, the Romanian MEP who served as one of the Act's lead negotiators in the European Parliament, has been unambiguous that transparency obligations are non-negotiable: providers of general-purpose AI models must publish sufficiently detailed summaries of the content used for training. That requirement alone is forcing a reckoning inside every lab that built its data pipeline on broad internet scraping without meticulous provenance tracking.

If European courts consistently rule against AI companies in copyright disputes, substantial portions of the internet's highest-quality content, journalism, academic writing, literary works, will move behind licensing walls. That would accelerate the data scarcity problem considerably and concentrate competitive advantage among the few organisations that secured licensing agreements early.

United Kingdom: The government launched a consultation on copyright and AI training; tensions between the creative industries and technology sector remain unresolved as of mid-2025.
European Union: The AI Act mandates training data transparency; ongoing legal challenges from publishers are testing its scope.
Germany: German publishers have been among the most aggressive in pursuing licensing frameworks, reflecting the country's strong tradition of press rights.
France: Mistral AI, headquartered in Paris, has publicly committed to building legally clean datasets, setting a precedent other European labs are watching closely.
Switzerland: ETH Zurich researchers are actively developing privacy-preserving training methods that sidestep some data rights complications, a line of work attracting significant European Research Council funding.

This licensing race has created a new category of strategic asset. Organisations holding large, high-quality, legally clean datasets, whether hospitals, financial institutions, national broadcasters, or public research bodies, now find themselves with considerable leverage they did not previously recognise. European data protection frameworks, often portrayed as an obstacle to AI development, may in fact force the kind of rigorous data governance that makes those assets more valuable, not less.

The European and UK Multilingual Data Gap

Europe has its own version of the multilingual data problem, and it is more acute than the continent's reputation for linguistic diversity might suggest. English dominates existing training corpora. Models perform materially better in English than in most other European languages, including languages with tens of millions of speakers such as Polish, Dutch, Romanian, Greek, and Czech.

Yann LeCun, Chief AI Scientist at Meta and one of the field's most credible long-term voices, has repeatedly argued that the dominance of English in training data creates structurally weaker models for the majority of the world's population, a population that includes the majority of EU citizens. The practical consequence is a two-tier AI landscape: users operating in English receive demonstrably better performance from frontier models than users working in their native European languages.

Several European initiatives are attempting to close this gap, with varying levels of ambition and resource.

European Language Grid: A European Commission-funded infrastructure project assembling multilingual datasets for EU languages, though coverage and quality remain uneven.
BLOOM and OpenEuroLLM: Collaborative open-source model efforts involving European research institutions, specifically designed to improve non-English language performance.
UK Research and Innovation (UKRI): Has funded several academic programmes on low-resource language modelling, with particular attention to Welsh, Scots Gaelic, and minority languages of Commonwealth communities.
Germany and France: National AI strategies from both countries have identified language model competitiveness as a sovereignty concern, with state-backed investment in German- and French-language corpora.

These efforts remain modest in scale compared to the resources available to frontier labs in the United States and China. The gap is not purely financial. European data governance architecture, while sound in principle, has in practice created barriers to the aggregation and curation of data at the volumes modern models require. Federated approaches and privacy-preserving techniques offer a partial answer, but they introduce their own infrastructure complexity.

What the Industry Is Doing About It

The search for solutions extends well beyond synthetic data and licensing negotiations. Federated learning allows models to train on distributed datasets without centralising them, potentially unlocking private data held by hospitals, banks, and public bodies. This approach is directly relevant to Europe's legal environment, where GDPR constraints make centralised data aggregation difficult or impossible in many sectors.

Active learning techniques help models identify and request only the most informative training examples, reducing total data requirements substantially. Architectural innovations are also emerging: Google DeepMind's Gemini and Anthropic's Claude have both demonstrated improved data efficiency relative to earlier model generations, extracting more learning signal from the same volume of training material. Whether efficiency gains can keep pace with the industry's scaling ambitions remains genuinely uncertain.

There is a harder question that receives insufficient attention in the commercial AI conversation: what happens to AI capabilities if the data problem is not solved? The risk is not that AI stops working. It is that progress plateaus at precisely the moment when enormous capital investments have been committed on the assumption of continued improvement. That is a scenario with serious consequences for every business and government that has built its AI strategy around the expectation of perpetual capability gains.

Updates

29 Apr 2026published_at reshuffled 2026-04-29 to spread distribution per editorial directive
28 Apr 2026Byline migrated from "Sofia Romano" (sofia-romano) to Intelligence Desk per editorial integrity policy.

AI Terms in This Article 6 terms

tokens

Small chunks of text (words or word fragments) that AI models process.

machine learning

Software that improves at tasks by learning from data rather than being explicitly programmed.

federated learning

Training AI across many devices without centralizing private data.

synthetic data

Artificially generated data used to train AI when real data is scarce or private.

model collapse

When AI trained on AI-generated data degrades in quality over generations.

AI-powered

Uses artificial intelligence as part of its functionality.

The Continental - Europe’s morning AI brief

Comments

No comments yet. Start the conversation.

By The Numbers

300B

Tokens used to train GPT-3

GPT-3 trained on roughly 300 billion tokens. Successive frontier models have required trillions, with demand growing exponentially while the supply of high-quality public data grows linearly.

Source →

Few years

Estimated time before high-quality text data is largely consumed

Epoch AI researchers have published findings suggesting the stock of high-quality publicly available text data could be substantially consumed within a few years at current rates of model scaling.

Source →

10-50x

Gap in corpus size between English and many non-English languages

Source →

In This Article