Jared Sulzdorf's picture

Jared Sulzdorf PRO

jsulz

AI & ML interests

Infrastructure, law, policy

Recent Activity

reacted to BramVanroy's post with πŸš€ about 10 hours ago
πŸ“’πŸ’Ύ Introducing the Common Crawl Creative Commons Corpus (C5)! C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected. --- πŸ“„ data: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons 🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons --- </> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the `head`?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. 🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality. πŸ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
reacted to BramVanroy's post with ❀️ about 10 hours ago
πŸ“’πŸ’Ύ Introducing the Common Crawl Creative Commons Corpus (C5)! C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected. --- πŸ“„ data: https://huggingface.co/datasets/BramVanroy/CommonCrawl-CreativeCommons 🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons --- </> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the `head`?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. 🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality. πŸ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
published a model 3 days ago
jsulz/test
View all activity

Organizations

Hugging Face's profile picture Spaces Examples's profile picture Georgia Tech (Georgia Institute of Technology)'s profile picture Blog-explorers's profile picture Journalists on Hugging Face's profile picture Hugging Face Discord Community's profile picture Xet Team's profile picture open/ acc's profile picture wut?'s profile picture Inference Endpoints Images's profile picture

jsulz's activity

reacted to BramVanroy's post with πŸš€β€οΈ about 10 hours ago
view post
Post
1665
πŸ“’πŸ’Ύ Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
πŸ“„ data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

πŸ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
  • 1 reply
Β·
published a model 3 days ago
upvoted an article 5 days ago
view article
Article

Welcoming Llama Guard 4 on Hugging Face Hub

β€’ 30
reacted to merve's post with β€οΈπŸ€—πŸ‘πŸš€πŸ”₯ 5 days ago
view post
Post
2504
Meta released Llama Guard 4 and new Prompt Guard 2 models πŸ”₯

Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image πŸ›‘οΈ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B

Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections βš” meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers πŸ€—

Try the model right away πŸ‘‰πŸ»https://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb

Read our blog to learn more and easily get started πŸ‘‰πŸ» https://huggingface.co/blog/llama-guard-4 πŸ¦™
  • 1 reply
Β·
reacted to AdinaY's post with πŸš€ 5 days ago
view post
Post
2712
DeepSeek, Alibaba, Skywork, Xiaomi, Bytedance.....
And that’s just part of the companies from the Chinese community that released open models in April 🀯

zh-ai-community/april-2025-open-releases-from-the-chinese-community-67ea699965f6e4c135cab10f

🎬 Video
> MAGI-1 by SandAI
> SkyReels-A2 & SkyReels-V2 by Skywork
> Wan2.1-FLF2V by Alibaba-Wan

🎨 Image
> HiDream-I1 by Vivago AI
> Kimi-VL by Moonshot AI
> InstantCharacter by InstantX & Tencent-Hunyuan
> Step1X-Edit by StepFun
> EasyControl by Shanghai Jiaotong University

🧠 Reasoning
> MiMo by Xiaomi
> Skywork-R1V 2.0 by Skywork
> ChatTS by ByteDance
> Kimina by Moonshot AI & Numina
> GLM-Z1 by Zhipu AI
> Skywork OR1 by Skywork
> Kimi-VL-Thinking by Moonshot AI

πŸ”Š Audio
> Kimi-Audio by Moonshot AI
> IndexTTS by BiliBili
> MegaTTS3 by ByteDance
> Dolphin by DataOceanAI

πŸ”’ Math
> DeepSeek Prover V2 by Deepseek

🌍 LLM
> Qwen by Alibaba-Qwen
> InternVL3 by Shanghai AI lab
> Ernie4.5 (demo) by Baidu

πŸ“Š Dataset
> PHYBench by Eureka-Lab
> ChildMandarin & Seniortalk by BAAI

Please feel free to add if I missed anything!
posted an update 6 days ago
view post
Post
2301
At xet-team we've been hard at work bringing a new generation of storage to the Hugging Face community, and we’ve crossed some major milestones:

πŸ‘· Over 2,000 builders and nearing 100 organizations with access to Xet
πŸš€ Over 70,000 model and dataset repositories are Xet-backed
🀯 1.4 petabytes managed by Xet

As we move repos from LFS to Xet for everyone we onboard, we’re pushing our content-addressed store (CAS). Check out the chart below πŸ‘‡ of CAS hitting up to 150 Gb/s throughput this past week.

All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.

Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph

Join the waitlist to get access! https://huggingface.co/join/xet
reacted to danielhanchen's post with πŸš€β€οΈπŸ€—πŸ”₯ 10 days ago
view post
Post
5684
πŸ¦₯ Introducing Unsloth Dynamic v2.0 GGUFs!
Our v2.0 quants set new benchmarks on 5-shot MMLU and KL Divergence, meaning you can now run & fine-tune quantized LLMs while preserving as much accuracy as possible.

Llama 4: unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF
DeepSeek-R1: unsloth/DeepSeek-R1-GGUF-UD
Gemma 3: unsloth/gemma-3-27b-it-GGUF

We made selective layer quantization much smarter. Instead of modifying only a subset of layers, we now dynamically quantize all layers so every layer has a different bit. Now, our dynamic method can be applied to all LLM architectures, not just MoE's.

Blog with Details: https://docs.unsloth.ai/basics/dynamic-v2.0

All our future GGUF uploads will leverage Dynamic 2.0 and our hand curated 300K–1.5M token calibration dataset to improve conversational chat performance.

For accurate benchmarking, we built an evaluation framework to match the reported 5-shot MMLU scores of Llama 4 and Gemma 3. This allowed apples-to-apples comparisons between full-precision vs. Dynamic v2.0, QAT and standard iMatrix quants.

Dynamic v2.0 aims to minimize the performance gap between full-precision models and their quantized counterparts.