FineData

community

AI & ML interests

We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

Recent Activity

guipenedo updated a dataset 3 days ago

HuggingFaceFW/fineweb

guipenedo updated a dataset 3 days ago

HuggingFaceFW/fineweb-edu-score-2

thomwolf authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

View all activity

guipenedo

updated 2 datasets 3 days ago

HuggingFaceFW/fineweb

Viewer • Updated Jan 31 • 25B • 210k • 2.23k

HuggingFaceFW/fineweb-edu-score-2

Viewer • Updated Apr 11 • 13.1B • 7.54k • 77

thomwolf

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

lvwerra

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

craffel

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

negar-foroutan

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

vsabolcec

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

hynky

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

guipenedo

authored a paper 7 days ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published 8 days ago • 55

davanstrien

posted an update 25 days ago

Post

2861

Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.

Key capabilities:

- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation

The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.

Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)

https://github.com/davanstrien/hub-semantic-search-mcp

loubnabnl

authored a paper 27 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 29 days ago • 42

guipenedo

authored a paper 27 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 29 days ago • 42

craffel

authored a paper 27 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 29 days ago • 42

eliebak

authored a paper 27 days ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published 29 days ago • 42

thomwolf

authored a paper about 1 month ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 107

clefourrier

posted an update about 2 months ago

Post

823

Always surprised that so few people actually read the FineTasks blog, on
✨how to select training evals with the highest signal✨

If you're serious about training models without wasting compute on shitty runs, you absolutely should read it!!

An high signal eval actually tells you precisely, during training, how wel & what your model is learning, allowing you to discard the bad runs/bad samplings/...!

The blog covers in depth prompt choice, metrics, dataset, across languages/capabilities, and my fave section is "which properties should evals have"👌
(to know on your use case how to select the best evals for you)

Blog: HuggingFaceFW/blogpost-fine-tasks

2 replies

loubnabnl

posted an update about 2 months ago

Post

2955

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

3 replies

davanstrien

posted an update 2 months ago

Post

2260

Came across a very nice submission from @marcodsn for the reasoning datasets competition (https://huggingface.co/blog/bespokelabs/reasoning-datasets-competition).

The dataset distils reasoning chains from arXiv research papers in biology and economics. Some nice features of the dataset:

- Extracts both the logical structure AND researcher intuition from academic papers
- Adopts the persona of researchers "before experiments" to capture exploratory thinking
- Provides multi-short and single-long reasoning formats with token budgets - Shows 7.2% improvement on MMLU-Pro Economics when fine-tuning a 3B model

It's created using the Curator framework with plans to scale across more scientific domains and incorporate multi-modal reasoning with charts and mathematics.

I personally am very excited about datasets like this, which involve creativity in their creation and don't just rely on $$$ to produce a big dataset with little novelty.

Dataset can be found here: marcodsn/academic-chains (give it a like!)

meg

posted an update 2 months ago

Post

2633

New launch: See the energy use of chatbot conversations, in real time. =)
jdelavande/chat-ui-energy
Great work from @JulienDelavande !

thomwolf

posted an update 3 months ago

Post

5366

If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at

pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!

1 reply

AI & ML interests

Recent Activity

Team members 17

HuggingFaceFW's activity