Dataset Language Distribution

#44

by aslawliet - opened Jan 6, 2024

Jan 6, 2024

What ratio of English and Chinese dataset was Yi-34b trained on? Was it at least trained on 2 trillion+ tokens of English?

Jan 11, 2024

Hi there! Thank you for your question! Yi-34B was indeed trained on 2 Trillion+ tokens of English!

Jan 12, 2024

@MeisterDeLaV is it 2.4 Trillion+ English?

Jan 15, 2024

@aslawliet Yes it is!

Jan 15, 2024

@MeisterDeLaV Then on how much Chinese/MultiLang was it trained on?

01-ai org Mar 11, 2024

richardllin changed discussion status to closed Mar 19, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment