An Analysis of Multilingual Models on Hugging Face

Community Article Published September 18, 2025

Many people recognize the existence of low-resource languages, but few people have to face the reality of working on languages other than English, Chinese, and a handful of European languages. Here, we analyze the models on Hugging Face to get a sense of the landscape of available models for a wide range of languages. The main takeaways are that for many languages, there are extremely few models. The models that are there are not necessarily tailored to the language, as they are often massively multilingual models. Most models are trained on English, which imbues most NLP research with an English-centric bias. We find these to all be concerning.

Before conducting this survey, we had some sense of these trends, just from looking for models ourselves. This was a big motivation for the Goldfish models. As part of publishing the Goldfish paper, we realized that not everyone knows this, which motivated us to quantitatively analyze the models on Hugging Face.

Method

To do the survey, we first collect all the models tagged with the text-generation or text-classification task labels on Hugging Face in the first week of September 2025. We identified 272,512 text-generation models and 100,604 text-classification models. These labels generally correspond to decoder-only/autoregressive and encoder-only/bidirectional models, respectively. In total, that’s 373,116 models.

We only consider a set of 325 languages, which represents the unique ISO 639-3 codes among the languages in the Goldfish suite of models. We do this in order to set a practical cut-off point for the analysis. Using simply ISO 639-3 codes collapses over script codes, e.g. languages that differ only in script (e.g. srp_cyrl and srp_latn), therefore, we have fewer than the 350 languages in the Goldfish paper.

We considered de-duplicating models based on some simple, manually determined rules. For example, there may be many fine-tuned or quantized versions of popular models like those in the Llama family, but these are all quite similar. However, the goal of this analysis is not necessarily to determine how many “unique” models exist (which would be difficult to define). Instead, we are simulating a more practical scenario in which a user is looking for existing models for a given language. We ask how many models are tagged for that language. Therefore, our figures here may in some sense over-estimate the number of “unique” models, but this seems preferable to having to determine what counts as a unique model and risk excluding models that should be included.

For all 373,116 models, we collected the language tags. Hugging Face does not standardize language tags, so we normalized across ISO 639-1 codes (‘en’), ISO 639-3 (‘eng’), and names of languages (‘English’). We converted these to ISO 639-3 codes for consistency.

Out of the models we found, 198,993 (73%) text-generation and 88,207 (88%) text-classification models are not tagged for any language. So, here we only consider the 85,916 models with language tags. The pervasive lack of language tags is an example of the #benderrule: when it is not stated what language is being worked on, it is implied that English is the sole language of study. This is the result of a deep English-centricity in the field. If we assume that even half of these unlabelled models are English-only (the rate is probably much higher), the reality is that language representation is even more skewed towards English than we show here. For the purpose of this survey, however, we discard the unlabelled models from our analysis.

Distribution of Models by Language

Next, we counted the number of models tagged for each language, in order to characterize the distribution of the number of models for each language. Unsurprisingly, the vast majority of models that have language tags are tagged for English: 88% of models. The language with the next highest number of models is Mandarin Chinese; however, Chinese has less than 1/10th the number of models that English does (8%, or 6398 Chinese models compared to 75577 English models). Furthermore, 5542 (87%) of the models tagged for Chinese are also tagged for English.

image/png

The 25 languages with the most models available, in descending order of number of models, are English, Chinese, French, Spanish, Italian, Korean, Portuguese, Japanese, Thai, Hindi, Russian, Arabic, Vietnamese, Turkish, Polish, Indonesian, Dutch, Czech, Romanian, Finnish, Ukrainian, Swedish, Danish, Bengali, Farsi. Unsurprisingly, European languages are highly represented in this. These languages are disproportionately well-represented relative to their speaker populations. For example Czech (9.6M speakers, 670 models), Finnish (5M speakers; 621 models), and Danish (6M speakers; 534 models) are all in the top 25 languages. But languages with much higher speaker populations fall much lower down in the rankings, having a fraction of the number of models. Swahili, which has approximately 100M speakers, only has 328 models and is ranked 34th. Telugu and Tagalog are in similar positions, with between 80-100M speakers each; they have 323 and 221 models, respectively. Despite having 10-20x as many speakers as some European languages, these languages have half the number of available models.

Within the top 25 languages, the number of models continues to decrease. But after the 25th language, the number of models plateaus and slowly trails off. This is nicely illustrates of the long tail of languages according to their resource levels in NLP.

image/png

Interestingly, there are languages for which there are text generation models but no text classification models. However, there are no languages for which there are only text classification models. The languages for which a higher proportion of models are text generation models are a mix of extremely high-resource languages and much lower-resource languages. However, the languages for which most models are text-classification models are generally lower-resource languages like Toka-Leya-Dombe, Kedah-Perak Malay, Zapotec, Mari, Inuktitut, Brunei Bisaya-Dusun, and Kituba. Anecdotally, we have seen that languages in this group are often represented in massively multilingual models like AfriBERTa or Glot-500. This may be because the amount of training data for those languages is better suited to training bidirectional models or training smaller classification models like POS, NER, etc. Or perhaps more worryingly, it could indicate that low-resource languages have been even more left behind by the shift to autoregressive models over bidirectional models in recent years.

English Dominance

Most models with language tags are multilingual. Across all languages, the mean proportion of multilingual models is 76% (69% for text generation models, 90% for text classification models). Of those, 70% have English as one of the language tags (57% for text generation, 86% for text classification). The proportion of English tags is higher for classification models, while text generation models tend to be less multilingual and less English-centric.

And looking by language, for 159 languages, all of their multilingual models were trained on at least some amount of English. Breaking down by task, 182 languages have all their multilingual text generation models trained on some English, while 234 languages have all their multilingual text classification models trained on some English.

Multilingual vs. Monolingual Models

For most languages (83%), most of their models are multilingual (68% of languages considering only text generation; 95% of languages considering only text classification). Most of the research for languages other than English, Chinese, and some European languages is being done in a (massively) multilingual context.

image/png

Out of 325 languages, 150 do not have monolingual models other than the Goldfish models. In other words, for 150 languages, the release of the Goldfish models represents the first open model dedicated for that language. There are also two languages for which there were no models on Hugging Face until the release of the Goldfish models. Those are Kituba and Brunei Bisaya-Dusun. There are still many languages which are not represented at all. On average, the Goldfish suite increases the number of models available for a given language by 60%.

If we consider only text generation models, for 215 languages, the Goldfish models are the only monolingual autoregressive models. For 47 of those languages, the Goldfish models are the only autoregressive models at all. These languages are: Gilaki, Central Aymara, Tosk Albanian, Kituba, Central Kanuri, Low German, Khalkha Mongolian, Zazaki, Standard Latvian, Tedim Chin, Standard Estonian, Banjar, Lak, Veps, Nigerian Fulfulde, Brunei Bisaya, Northern Frisian, Southern Pashto, Western Punjabi, Bishnupriya Manipuri, Piedmontese, West Central Oromo, Rusyn, North Azerbaijani, Northern Kurdish, Neapolitan, Kabyle, Wu Chinese, Southwestern Dinka, Lingua Franca Nova, West Flemish, Dargwa, Mirandese, Eastern Mari, Buginese, North Levantine Arabic, Northern Uzbek, Cusco Quechua, Central Bikol, Ingush, Lezghian, Gulf Arabic, Eastern Yiddish, Plateau Malagasy, Mingrelian, Komi-Permyak, Min Nan Chinese. These languages tend to be quite under-resourced in NLP.

We calculated the mean number of languages that each model was tagged for by language. The language with the lowest mean* number of languages is English, with a mean of 1.93. This means that most models tagged with English have 2 or fewer language tags. Given that there are 61,575 monolingual English models (81% of English models), this makes sense. On the other end of the scale, for about half of languages, the mean number of languages per model is 100 or more. This means that the majority of the models for that language are massively multilingual models. This is important because not all representation here is equal. It doesn’t mean the same thing for a language to be represented in a model in a monolingual context vs. in a massively multilingual context, where the data for the target language makes up only 0.1% of the training data. In some ways the monolingual model may be better, even if it is much smaller. This is one of the main points for discussion in the Goldfish paper. (*We exclude two languages whose only models are the Goldfish models, so their average is 1.)

image/png

Conclusion

The figures here paint a pretty sad picture of the state of multilingual NLP. While the Goldfish models have made a difference in language representation on Hugging Face, there is still a massive difference between the have and the have not languages. Without even necessarily having new models, it would be beneficial to the community to have better practices around adding language tags to models and datasets. This would give us a better picture of what we have and would make it easier for people to find what they need.

We encourage you to dig into the data (all data and code are on GitHub) and see if there are other interesting trends in the data!

Community

Sign up or log in to comment