Dataset Language Distribution
#44
by
aslawliet
- opened
What ratio of English and Chinese dataset was Yi-34b trained on? Was it at least trained on 2 trillion+ tokens of English?
Hi there! Thank you for your question! Yi-34B was indeed trained on 2 Trillion+ tokens of English!
@MeisterDeLaV is it 2.4 Trillion+ English?
@MeisterDeLaV Then on how much Chinese/MultiLang was it trained on?
Report is out. https://arxiv.org/abs/2403.04652
Discord channel https://discord.gg/zQ4A6b6H
richardllin
changed discussion status to
closed