Bram Vanroy's picture

Bram Vanroy

BramVanroy

AI & ML interests

Artificial intelligence, natural language processing, computational linguistics

Recent Activity

updated a collection 31 minutes ago
BLEURT
liked a dataset 39 minutes ago
common-pile/dolma-cccc
updated a Space about 2 hours ago
BramVanroy/mateo-demo
View all activity

Organizations

Language and Translation Technology Team's profile picture BigScience Workshop's profile picture How to teach Hugging Face?'s profile picture Hugging Face Fellows's profile picture Blog-explorers's profile picture HPLT's profile picture ZeroGPU Explorers's profile picture TENACITY's profile picture signon-project's profile picture Occiglot's profile picture Social Post Explorers's profile picture Hugging Face Discord Community's profile picture Networks of Ideas and Knowledge in the Ancient World's profile picture Instituut voor de Nederlandse Taal / Dutch Language Institute's profile picture ml-fw-prerelease's profile picture Common Crawl Creative Commons's profile picture ModernBERT-NL's profile picture

Posts 13

view post
Post
920
📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
view post
Post
994
In the spirit of "Better late than never", I've finally written a brief overview paper for GEITje 7B Ultra. Initially released 10 months ago (oops), but still reaching around 1300 monthly downloads across the HF ecosystem (not including ollama).

GEITje 7B Ultra: A Conversational Model for Dutch (2412.04092)

While the paper discusses the model a little bit, I especially wanted to write about the datasets, which to this day seem an important asset for Dutch LLM training (SFT and preference tuning). We have a long way to go for Dutch, but publishing transparent and reproducible artefacts seems an important step to me, alongside having open discussions about data, bias, architectures.

In that spirit, thanks are in order for the creation of GEITje 7B Ultra and all related datasets:

- Michiel Buisman and UWV for providing the means to create the datasets
- Flemish Supercomputer Center (VSC) for the compute
- The Hugging Face Fellows and rest of the team for their discussions and insights
- The Dutch NLP community, notably @Rijgersberg for building the base GEITje model and the fruitful discussions we've had

More to come, step by step!

BramVanroy/geitje-7b-ultra-65c1ee010ad80fd1f6a8f208