Jared Sulzdorf's picture

Jared Sulzdorf PRO

jsulz

AI & ML interests

Infrastructure, law, policy

Recent Activity

reacted to BramVanroy's post with šŸš€ about 5 hours ago
šŸ“¢šŸ’¾ Introducing the Common Crawl Creative Commons Corpus (C5)! C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected. --- šŸ“„ data: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons 🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons --- </> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the `head`?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. 🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality. šŸ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
reacted to BramVanroy's post with ā¤ļø about 5 hours ago
šŸ“¢šŸ’¾ Introducing the Common Crawl Creative Commons Corpus (C5)! C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected. --- šŸ“„ data: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons 🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons --- </> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the `head`?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. 🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality. šŸ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
View all activity

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Georgia Tech (Georgia Institute of Technology)'s profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture wut?'s profile picture Inference Endpoints Images's profile picture

jsulz's activity

upvoted an article 4 days ago
view article
Article

Welcoming Llama Guard 4 on Hugging Face Hub

• 30
upvoted an article 10 days ago
view article
Article

17 Reasons Why Gradio Isn't Just Another UI Library

• 28
upvoted an article 16 days ago
view article
Article

Introduction to ggml

• 190
upvoted an article 19 days ago
view article
Article

Cohere on Hugging Face Inference Providers šŸ”„

• 124
upvoted 2 articles about 1 month ago
view article
Article

Welcome Llama 4 Maverick & Scout on Hugging Face!

• 142
view article
Article

Training and Finetuning Reranker Models with Sentence Transformers v4

• 125
upvoted an article about 2 months ago
view article
Article

The New and Fresh analytics in Inference Endpoints

• 19