Moroccan Darija Datasets
A collection of all available datasets for pretraining LLMs
Viewer • Updated • 1.17M • 556 • 21Note A collection of moroccan darija texts (155M tokens). Can be used for pretraining Moroccan Darija LMs.
atlasia/TerjamaBench
Viewer • Updated • 850 • 48 • 15Note A culturally aligned translation benchmark for evaluating Machine Translation for Moroccan Darija.
atlasia/DODa-audio-dataset
Viewer • Updated • 12.7k • 273 • 8Note A collection of 12,743 parallel text and speech samples for Moroccan Darija, including its transcription in both Latin and Arabic scripts and English translations.
atlasia/moroccan_darija_domain_classifier_dataset
Viewer • Updated • 189k • 47 • 3Note A collection of 190,000 synthetically generated (using Gemini-2.0-Flash) text in 26 topics. Can be used to train text classification models.
atlasia/Moroccan-Darija-Wiki-Dataset
Viewer • Updated • 10k • 27 • 7Note A collection of 10,044 parallel text samples of Moroccan Darija sourced from Darija Wikipedia.
atlasia/Moroccan-Darija-Wiki-Audio-Dataset
Viewer • Updated • 492 • 258 • 6Note A collection of 551 parallel text and speech samples of Moroccan Darija sourced from Wikipedia Darija.