Dataset Language Distribution

#44
by aslawliet - opened

What ratio of English and Chinese dataset was Yi-34b trained on? Was it at least trained on 2 trillion+ tokens of English?

Hi there! Thank you for your question! Yi-34B was indeed trained on 2 Trillion+ tokens of English!

@MeisterDeLaV is it 2.4 Trillion+ English?

@aslawliet Yes it is!

@MeisterDeLaV Then on how much Chinese/MultiLang was it trained on?

01-ai org
richardllin changed discussion status to closed

Sign up or log in to comment