Pretraining datasets?

by saattrupdan - opened 1 day ago

Discussion

saattrupdan

1 day ago

•

edited 1 day ago

Do you have an overview of the pretraining datasets, that you've trained the model on?

You're linking to several datasets in your YAML metadata (e.g., HPLT2.0, Fineweb2 and MADLAD-400). Is that an exhaustive list?

That would help a lot with transparency :)

TBergmanis

about 23 hours ago

We have listed: culturax fineweb-2 hplt hplt2 madlad-400
Not listed because the data is not on Wikipedia, HF is speakleash, Eurolex, corpora from OPUS (cc_matrix, paracrawl, Europarl, ect). There were some data donations from Slovenian, Slovak and Estonian institutions, which were mostly data already found in other data sources.

We will detail more precisely in the technical report when we get to it.

saattrupdan

about 1 hour ago

@TBergmanis That's great, thanks!

saattrupdan changed discussion status to closed about 1 hour ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment