Text Generation
Transformers
Safetensors
llama
text-generation-inference

Why TildeOpen not to include all european languages?

#11
by mcanet - opened

Why do EU-funded European LLM projects not include training in all European languages, reflecting the continent’s full linguistic diversity?

At present, many European languages remain excluded, such as Catalan (10 million speakers), Galician (2.5 million), Occitan (1.5 million), Scots (1.5 million), Sardinian (1 million), Welsh (0.9 million), Basque (0.75 million), Venetian (0.7 million), Friulian (0.6 million), Ladin (0.2 million), Breton (0.2 million), Asturian (0.4 million), Aragonese (0.1 million), Kashubian (0.1 million), Romani (4–5 million), Aromanian/Vlach (0.1–0.2 million), Rusyn (0.1 million), Sámi languages (around 30,000), Yiddish (50,000), Cornish (a few thousand), and Manx (a few hundred).

If European language models are to truly represent and serve Europe, they should embrace this linguistic richness rather than overlook it.

It's a valid question. On one hand, one could argue that it is indeed a shortcoming—many languages are left out - and why is that?!
On the other hand, formal support to score political points does not benefit the speakers of these languages. We trained TildeOpen to provide equal support for the languages we covered. Compared to other LLMs, this results in text outputs of high linguistic quality for the languages with a small speaker base, in total 170 million Europeans. This was already very challenging for some languages like Icelandic, Albanian, and Montenegrin. It definitely wouldn't work for Cornish, Welsh, or Scots, having very little written data. Providing formal support when the quality is not there seems dishonest and minimizes the actual problem.

As for the Catalan, there are very good models out there for it https://huggingface.co/BSC-LT/ALIA-40b
When we started our work, there was very weak language support in most of Europe east of Berlin.

Hope this clarifies.

Sign up or log in to comment