96 18 159

Bram Vanroy PRO

BramVanroy

https://bramvanroy.github.io/

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

upvoted a paper 3 days ago

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

updated a model 3 days ago

BramVanroy/BLEURT-20

updated a model 3 days ago

BramVanroy/BLEURT-20-D12

View all activity

Organizations

Posts 15

Post

244

What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?

Post

726

Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r (https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.

View all Posts