Special thanks to:
- The Common Crawl folks (Greg Lindahl, @pjox and others)
- The datatrove and FineWeb teams at Hugging Face ( @guipenedo and others)
Special thanks to:
head
?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. Nice! In my experience preference tuning with the ultra feedback datasets does not really change benchmark scores (and sometimes even makes them worse) but it does seem to improve the real-world user experience when chatting with the model.
I'm also not sure if orpo only on UF is better than sft on UC + DPO on UF, especially if you're also trying to do language adaptation. That, or first continue pretraining the model and then doing orpo.
While the "rules" of OpenAI do get frustrating from time to time, I do not blame others who do not follow the same path as I do. If I am asked why my licenses are different from someone else's I will answer according to what I've written in the post above (the rules suck and our vague, I understand why people do what they do and I do what I do because of other reasons). But I definitely do not want to go around and point fingers pre-emptively in hopes that people just use my models. Our community for Dutch is already quite small so I rather just lift each other up and build on each others work through friendly "competition" than to compete in bad faith.
So I think that for my future models, I'll just make use of ultrachat+ultrafeedback, which should be cleared for training apache 2.0 models because they were created with Azure. This may negatively impact the model's performance (especially for code because it does not include the Stack Overflow set) but I hope the impact is limited.
What do you mean with compliance in this context? I'm not sure how I can market being non-commercial as a good thing ๐
Cool! Looking forward to what you'll build with this!
I think the correct place for this is to make a new issue on their issue tracker: https://github.com/ScandEval/ScandEval/issues
scandeval
) to run your own benchmarks with. As part of project "Leesplank" (with Michiel Buisman and Maarten Lens-FitzGerald) we recently added GPT-4-1106-preview scores to add a good "target" to the leaderboard.Understandable. I'm especially attracted to the broad vocabulary, which can be of use for language adaptation.
What kind of weird results? In terms of loss, or really qualitative output?
Indeed, there is not a lot of metadata. There's also a discrepancy between the no. scores/languages and the no. paragraphs in the text. I've notified the authors about that. CulturaX is an attractive dataset, too!