MaLA corpus
MaLA Corpus for Massive Language Adaptation of Large Language Models https://mala-lm.github.io
Viewer • Updated • 1.14B • 1.33k • 2Note The MaLA monolingual corpus's noisy version that integrates texts from different sources without cleaning.
MaLA-LM/mala-monolingual-filter
Viewer • Updated • 1.42B • 14k • 2Note The MaLA monolingual corpus's filtered version that performs further data filtering
MaLA-LM/mala-monolingual-dedup
Viewer • Updated • 969M • 14.2k • 2Note The MaLA monolingual corpus's deduplicated version that removes repeated data points
MaLA-LM/mala-monolingual-split
Viewer • Updated • 538M • 4.51k • 2Note The MaLA monolingual corpus's final version is processed by splitting the filtered and deduplicated version into training and test sets
MaLA-LM/mala-bilingual-translation-corpus
Viewer • Updated • 14.4B • 879 • 5Note The MaLA bilingual translation corpus contains parallel data in more than 2,500 language pairs (500+ languages).
MaLA-LM/mala-code-reasoning
Viewer • Updated • 44.9M • 64 • 1Note The first version of the MaLA code and reasoning dataset used for training https://huggingface.co/MaLA-LM/emma-500-llama2-7b
MaLA-LM/mala-code-reasoning-v2
Viewer • Updated • 89.7M • 100 • 2Note The 2nd version of the MaLA code and reasoning dataset used for training EMMA-500 Llama 3(.1) Mono/Bi model series.
MaLA-LM/mala-opus-dedup-2410
Viewer • Updated • 44.3B • 3.51k • 2Note This mala-opus-dedup-2410 is the bilingual part of the MaLA Corpus. It is a cleaned and deduplicated version of OPUS corpus, collected from OPUS with a cutoff of October 2024 (2410).
MaLA-LM/mala-opus-dedup-2410-sample
Viewer • Updated • 6.48B • 350Note A sampled set of MaLA-LM/mala-opus-dedup-2410
MaLA-LM/mala-opus-dedup-shuffle-2410
Preview • Updated • 1.59kNote A shuffled version of MaLA-LM/mala-opus-dedup-2410