If European large language model (LLM) projects do not train on all European languages—reflecting the full linguistic diversity of the continent—then they fail to truly represent Europe. Many important European languages are still missing, including Catalan (10 million speakers), Galician (2.5 million), Occitan (1.5 million), Scots (1.5 million), Sardinian (1 million), Welsh (0.9 million), Basque (0.75 million), Venetian (0.7 million), Friulian (0.6 million), Ladin (0.2 million), Breton (0.2 million), Asturian (0.4 million), Aragonese (0.1 million), Kashubian (0.1 million), Romani (4–5 million), Aromanian/Vlach (0.1–0.2 million), Rusyn (0.1 million), Sámi languages (about 30,000), Yiddish (50,000), Cornish (a few thousand), and Manx (a few hundred).
It is particularly surprising that a model trained at the Barcelona Supercomputing Center does not include Catalan—a language spoken by around 10 million people and central to the region’s cultural and linguistic identity.