ml-fw-prerelease

Enterprise

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

hynky authored a paper about 17 hours ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

guipenedo authored a paper about 17 hours ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

guipenedo authored a paper 21 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

View all activity

hynky

authored a paper about 17 hours ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 2 days ago • 23

guipenedo

authored a paper about 17 hours ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 2 days ago • 23

guipenedo

authored a paper 21 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 22 days ago • 42

alielfilali01

authored a paper 30 days ago

Llama-3-Nanda-10B-Chat: An Open Generative Large Language Model for Hindi

Paper • 2504.06011 • Published Apr 8 • 1

Zaid

authored 5 papers about 1 month ago

Masader: Metadata Sourcing for Arabic Text and Speech Data Resources

Paper • 2110.06744 • Published Oct 13, 2021

Arabic Stable LM: Adapting Stable LM 2 1.6B to Arabic

Paper • 2412.04277 • Published Dec 5, 2024

Rephrasing natural text data with different languages and quality levels for Large Language Model pre-training

Paper • 2410.20796 • Published Oct 28, 2024

Ashaar: Automatic Analysis and Generation of Arabic Poetry Using Deep Learning Approaches

Paper • 2307.06218 • Published Jul 12, 2023

MOLE: Metadata Extraction and Validation in Scientific Papers Using LLMs

Paper • 2505.19800 • Published May 26 • 1

SivilTaram

authored a paper about 1 month ago

General-Reasoner: Advancing LLM Reasoning Across All Domains

Paper • 2505.14652 • Published May 20 • 22

BramVanroy

posted an update about 2 months ago

Post

3248

📢💾 Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!