The UniversalCEFR Data Directory

UniversalCEFR is a largescale, multilingual, multidimensional dataset comprising of texts annotated according to the CEFR (Common European Framework of Reference). The collection comprises of a total of 505,807 CEFR-labeled texts in 13 languages as listed below:

English (en), Spanish (es), German (de), Dutch (nl), Czech (cs), Italian (it), French (fr), Estonian (et), Portuguese (pt), Arabic (ar), Hindi (hi), Russian (ru), Welsh (cy)

UniversalCEFR Data Format / Schema

To ensure interoperability, transformation, and machine readability, adopted standardised JSON format for each CEFR-labeled text. These fields include the source dataset, language, granularity (document, paragraph, sentence, discourse), production category (learner or reference), and license.

Field	Description
`title`	The unique title of the text retrieved from its original corpus (`NA` if there are no titles such as CEFR-assessed sentences or paragraphs).
`lang`	The source language of the text in ISO 638-1 format (e.g., `en` for English).
`source_name`	The source dataset name where the text is collected as indicated from their source dataset, paper, and/or documentation (e.g., `cambridge-exams` from Xia et al., 2016).
`format`	The format of the text in terms of level of granularity as indicated from their source dataset, paper, and/or documentation. The recognized formats are the following: [`document-level`, `paragraph-level`, `discourse-level`, `sentence-level`].
`category`	The classification of the text in terms of who created the material. The recognized categories are `reference` for texts created by experts, teachers, and language learning professionals and `learner` for texts written by language learners and students.
`cefr_level`	The CEFR level associated with the text. The six recognized CEFR levels are the following: [`A1`, `A2`, `B1`, `B2`, `C1`, `C2`]. A small fraction (<1%) of text in UniversalCEFR contains unlabelled text, texts with plus signs (e.g., `A1+`), and texts with no level indicator (e.g., `A`, `B`).
`license`	The licensing information associated with the text (e.g., `CC-BY-NC-SA` or `Unknown` if not stated).
`text`	The actual content of the text itself.

Accessing UniversalCEFR

If you're interested in a specific individual or group of datasets from UniversalCEFR, you may access their transformed, standardised version here: https://huggingface.co/UniversalCEFR

A separate Github Organization is also available containing the code from the UniversalCEFR paper: https://github.com/UniversalCEFR

If you use any of the datasets indexed in UniversalCEFR, please cite the original dataset papers they are associated with. You can find them when you open each dataset in this organization.

Note that there are a few datasets in UniversalCEFR---EFCAMDAT, APA-LHA, BEA Shared Task 2019 Write and Improve, and DEPlain---that are not directly available from the UniversalCEFR Huggingface Org as they require users to agree with their Terms of Use before using them for non-commercial research. Once you've done this, you can use the preprocessing Python scripts in universalcefr-experiments repository to transform the raw version to UniversalCEFR version.

How to Join?

We want to grow this community of researchers, language experts, and educators to further advance openly accessible CEFR/language proficiency assessment datasets for all.

If you're interested in this direction and/or have open dataset/s you want to add to UniversalCEFR for better exposure and utility to researchers, please fill up this form.

When we index your dataset to UniversalCEFR, we will cite you and the paper/project from which the dataset came across the UniversalCEFR platforms.

Contact

For questions, concerns, clarifications, and issues, please contact Joseph Marvin Imperial ([email protected]).

Reference

@article{imperial2025universalcefr,
  title = {{UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment}},
  author = {Joseph Marvin Imperial and Abdullah Barayan and Regina Stodden and Rodrigo Wilkens and Ricardo Muñoz Sánchez and Lingyun Gao and Melissa Torgbi and Dawn Knight and Gail Forey and Reka R. Jablonkai and Ekaterina Kochmar and Robert Reynolds and Eugénio Ribeiro and Horacio Saggion and Elena Volodina and Sowmya Vajjala and Thomas François and Fernando Alva-Manchego and Harish Tayyar Madabushi},
  journal = {arXiv preprint arXiv:2506.01419},
  year = {2025},
  url = {https://arxiv.org/abs/2506.01419}