Why Data, Not Compute, Is the Real Bottleneck for Europe's Multilingual AI Ambitions

Data curation, not compute power, is the defining bottleneck for anyone serious about building high-quality AI systems in languages other than English. That is the consensus among European AI researchers and practitioners who have spent years trying to close the gap, and it applies with particular force to any language community that has been slower to digitise its written and spoken culture at scale.

By The Numbers

1-2%

Estimated share of high-quality publicly available non-English text relative to English

Researchers consistently estimate that the total corpus of high-quality, machine-readable text in any given non-English language represents roughly 1-2% of equivalent English content, reflecting decades of English-language digitisation and publishing at scale.

Source

24+

Official EU languages, each with distinct training data requirements

The European Union has 24 official languages, and its member states host dozens of additional recognised regional and minority languages. Each requires its own authentic training data, validation pipelines, and domain-specific corpora for high-quality AI performance.

Source

100,000+

Synthetic medical interactions generated to fill authentic data gaps

Research teams working on domain-specialised language models have used frontier AI systems including GPT-4o and Gemini 2.5 Pro to generate more than 100,000 synthetic medical interactions, creating training data in languages and domains where authentic clinical text is either scarce or held as proprietary by healthcare organisations.

Source

43,316

Synthetic multi-turn dialogues used to augment a major multilingual training corpus

Where authentic conversational data is insufficient, teams have resorted to generating large volumes of synthetic dialogue. One major Arabic-first model training effort produced 43,316 multi-turn dialogues across 93 topic categories to supplement its authentic data sources, a pattern now emerging in European multilingual model development as well.

Source

The lesson is directly relevant to European policymakers and AI developers. The continent hosts dozens of official languages, hundreds of regional varieties, and vast differences in how much machine-readable text exists for each. Minority and regional languages face precisely the same structural disadvantage that Arabic confronts at a global level: the raw material for training competitive language models simply is not there, or is fragmented, proprietary, and poorly documented.

The Scale and Scope Challenge

The fundamental imbalance is straightforward. The English-language internet aggregates centuries of published literature, technical documentation, professional content, and user-generated material from multiple countries and generations. No other language comes close in terms of the volume of high-quality, machine-readable text available for model training. Estimates suggest that total publicly available high-quality non-English text represents a small fraction of equivalent English content, and the gap is felt most acutely in specialised domains: medicine, law, finance, and public administration.

For Europe, this matters enormously. The EU AI Act creates regulatory expectations around model performance, transparency, and non-discrimination that implicitly require AI systems deployed in member states to work reliably in local languages. A model that performs well in English but poorly in Polish, Romanian, or Catalan is not merely a commercial problem; under the Act's requirements around fundamental rights impact assessments and high-risk system deployment, it is a compliance risk.

Margrethe Vestager, in her final months as European Commission Executive Vice-President overseeing digital policy, repeatedly framed data access and data quality as prerequisites for European AI sovereignty. That framing was correct, but the policy response has been slow. The Common European Data Spaces initiative aims to make sectoral data more accessible, but progress in making that data machine-learning-ready for language model development has been limited.

Yoshua Bengio, whose research group at the Montreal Institute for Learning Algorithms has shaped thinking on dataset bias and model alignment, has argued that linguistic underrepresentation in training data is not simply a technical inconvenience but a form of embedded inequality. Models trained overwhelmingly on English-language material will reflect English-language cultural frameworks, reference points, and assumptions, even when generating text in other languages. That structural bias is difficult to audit and harder still to correct after the fact.

The Dialectal and Regional Diversity Problem

Europe's linguistic landscape presents challenges that mirror those faced by teams working on Arabic, though they are often discussed less frankly. Standard written German coexists with Bavarian, Swiss German, and Austrian varieties that diverge substantially in phonology, grammar, and lexicon. French spoken in metropolitan Paris differs from French in Marseille, Brussels, or the overseas territories. These are not merely accent differences; they represent genuine variation in vocabulary, idiom, and pragmatic convention that affects how AI models perform in practice.

The same pattern holds at the level of minority and regional languages. Breton, Welsh, Scots Gaelic, Basque, Occitan, Sorbian, and dozens of other languages have small communities of speakers, limited digital text, and almost no representation in the training corpora of commercially deployed large language models. Welsh is a partial exception: the Welsh Government has invested in digital language resources, and S4C's archive provides some broadcast material. But even Welsh, with active institutional support, is nowhere near parity with English in terms of available training data.

Models trained primarily on majority-language data will exhibit systematic performance degradation when processing minority varieties. This is not a hypothetical concern. It has been documented by researchers at institutions including ETH Zurich, where work on multilingual model evaluation has shown measurable gaps in performance across Swiss national languages, let alone the country's linguistic minorities. Deploying such models in public sector contexts, where they might assist with benefit claims, legal documents, or healthcare queries, raises immediate questions of fairness and equal treatment under EU law.

Wide-angle photograph inside a modern European computational linguistics research facility, showing a diverse team of researchers reviewing annotated text data on large monitor arrays. The room has th

The Translation Trap

One of the most widespread and most problematic shortcuts in building non-English datasets is translating existing English datasets into other languages. The logic is seductive: English-language datasets for instruction tuning, alignment, and domain specialisation have been extensively developed and validated. Machine translation has become good enough to produce grammatically plausible output. Why not leverage this work?

The answer is that translation from English is not a neutral substitution of words. English and other European languages embody different conceptual frameworks, formal conventions, and cultural reference points. An instruction-following dataset translated from English into French will often contain framings that feel alien to native French speakers, not because the translation is technically wrong, but because the underlying assumptions are culturally specific to English-speaking contexts.

At scale, across thousands of training examples, these small dissonances accumulate. The resulting model generates text that is technically fluent but pragmatically odd: it speaks French using English logic. For instruction-tuning and alignment tasks, this matters enormously because the model learns not just language patterns but the cultural and conceptual frameworks embedded in its training data. Translated datasets encode the knowledge hierarchies, reference points, and evaluative frameworks of English-language sources into models serving other linguistic communities.

The OECD AI Policy Observatory has flagged this issue in its analysis of multilingual AI deployment, noting that translation-based dataset construction risks embedding cultural bias at a structural level that is difficult to detect through standard benchmark evaluation. Benchmarks themselves are often translated from English, meaning they test a model's ability to reflect English-language assumptions back in another language, rather than its ability to reason authentically within a different cultural framework.

Practitioners who have worked on genuinely native-language datasets consistently report that authentic approaches, curating real content, generating natively authored synthetic examples, validating against native-speaking judges, produce models that feel more natural to users, even when the volume of authentic data is smaller than what translation could provide.

Synthetic Data and Domain Specialisation

Faced with data scarcity, teams increasingly turn to synthetic data generation: using existing large language models to create training examples for more specialised or linguistically focused successors. The most systematic efforts have targeted domain specialisation where authentic data is particularly scarce.

Medical language is a clear example. Healthcare professionals across Europe use a mixture of formal medical terminology, national language conventions, and colloquial patient-facing language that varies considerably by country and region. Building comprehensive datasets for medical AI applications requires examples spanning diagnostic conversations, clinical documentation, patient education materials, and administrative communication. Authentic medical text in local languages is limited and often proprietary, held by healthcare organisations that have legitimate reasons for not releasing patient-related records.

Synthetic generation using frontier models has become a pragmatic workaround. Large-scale synthetic medical dialogue datasets have been constructed by research teams using GPT-4o and similar systems, creating training material that would otherwise not exist. These datasets allow models to achieve basic competence in medical language without waiting for healthcare organisations to release proprietary archives.

Yet synthetic medical data carries implicit limitations. The synthetic content reflects the training of the models that generated it, which means Western biomedical frameworks and English-influenced medical terminology. Synthetic examples may miss regional variations in clinical practice, vernacular terms used by patients in specific communities, or the particular conventions of medical communication in different national health systems. A synthetic dataset generated for training a model on UK medical English will encode NHS conventions and NHS patient interaction norms. Applying that dataset to train a model for deployment in the German statutory health insurance system introduces assumptions that may not hold.

More broadly, the growing reliance on synthetic data creates a subtle feedback loop. If the majority of training data for a particular domain is synthetic and derived from English-trained models, the resulting model will reflect English-language patterns and framings more than authentic local practice. The intervention that looks like a solution is also a mechanism for perpetuating the original bias.

Paradoxically, one of the most significant dataset challenges facing European AI development is not technical but organisational: the absence of shared, well-documented language datasets covering the continent's full linguistic range. The machine learning field has developed strong norms around releasing datasets with detailed documentation covering sources, composition, known biases, and usage rights. Hugging Face provides a central repository where researchers can share datasets alongside datasheets and evaluation benchmarks. These norms exist because the English NLP community learned, over years, that shared infrastructure accelerates research faster than any individual organisation's proprietary advantage.

The European multilingual ecosystem remains far less mature. Organisations building language models for specific national markets often treat their datasets as competitive advantages, keeping them proprietary. This is commercially rational: a high-quality dataset in, say, Finnish or Romanian represents substantial investment and genuine differentiation. But it means the broader research community cannot access, validate, or build upon this work. Different teams independently solve similar problems in isolation. Data quality standards, dialectal representation choices, and evaluation methodologies are not shared across the ecosystem. Progress is slower, more redundant, and less transparent than it needs to be.

The EU's Gaia-X initiative and the broader push for European data spaces were partly conceived to address this fragmentation, but the translation from policy ambition to usable machine-learning infrastructure has been halting. Mistral AI, the Paris-based frontier model developer, has invested substantially in multilingual capability and has been vocal about the need for European-origin training data, but even Mistral's open model releases do not fully solve the documentation and community infrastructure problem for smaller language communities.

What the field needs is not just more data but better data governance: communities that maintain shared benchmarks, documentation standards that describe dataset composition with enough specificity for meaningful reproducibility, and open-source tooling for data curation and validation that does not require a large commercial team to deploy. Building this infrastructure represents one of the highest-leverage investments the European AI community could make. It remains underfunded relative to its importance, partly because it does not produce the kind of benchmark-topping results that generate media coverage and investor attention.

What European Public Sector Procurement Should Demand

Public sector bodies across the EU and UK are the largest single buyers of AI systems, and their procurement decisions shape what the market produces. At present, most public sector AI tenders focus on benchmark performance, security certification, and price. Very few require suppliers to document the linguistic composition of their training data, disclose the proportion of synthetic versus authentic content, or demonstrate that their models have been validated by native speakers of the target language.

This is a missed opportunity. Public sector procurement has historically driven standards in markets from pharmaceuticals to construction. There is no structural reason why it cannot drive standards in AI dataset quality and linguistic representation. Requiring suppliers to publish dataset datasheets, disclose translation ratios, and demonstrate performance on authentic regional language benchmarks would create commercial incentives to invest in the authentic data curation that the field currently under-produces.

The UK's AI Safety Institute has begun developing evaluation frameworks for large language models, but its published work has focused primarily on safety and capability rather than linguistic equity and dataset provenance. Expanding that remit, or creating a parallel evaluation track within the EU AI Office, would be a concrete policy intervention with measurable benefits for multilingual performance and representational fairness.

The fundamental lesson from those who have tried to build high-quality non-English AI systems at scale is consistent: compute is not the constraint, and algorithmic innovation is not the constraint. Data quality, data authenticity, and data sharing infrastructure are the constraints. Solving them requires deliberate investment, shared community norms, and procurement policies that reward substance over surface-level benchmark performance. Europe has the institutional capacity to lead on all three. The question is whether it will choose to.

Why Data, Not Compute, Is the Real Bottleneck for Europe's Multilingual AI Ambitions

The Scale and Scope Challenge

The Dialectal and Regional Diversity Problem

The Translation Trap

Synthetic Data and Domain Specialisation

What European Public Sector Procurement Should Demand

Updates

Comments

The Scale and Scope Challenge

The Dialectal and Regional Diversity Problem

The Translation Trap

Synthetic Data and Domain Specialisation

The Documentation and Sharing Gap

What European Public Sector Procurement Should Demand

Updates

Comments