Pretraining datasets?
#6
by
saattrupdan
- opened
Do you have an overview of the pretraining datasets, that you've trained the model on?
You're linking to several datasets in your YAML metadata (e.g., HPLT2.0, Fineweb2 and MADLAD-400). Is that an exhaustive list?
That would help a lot with transparency :)
We have listed: culturax fineweb-2 hplt hplt2 madlad-400
Not listed because the data is not on Wikipedia, HF is speakleash, Eurolex, corpora from OPUS (cc_matrix, paracrawl, Europarl, ect). There were some data donations from Slovenian, Slovak and Estonian institutions, which were mostly data already found in other data sources.
We will detail more precisely in the technical report when we get to it.
saattrupdan
changed discussion status to
closed