Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
6.8
TFLOPS
4
17
rasgaard
rasgaard
Follow
lhoestq's profile picture
jfcalvo's profile picture
21world's profile picture
7 followers
·
29 following
AI & ML interests
None yet
Recent Activity
liked
a model
9 days ago
hexgrad/Kokoro-82M
reacted
to
davanstrien
's
post
with 🤗
17 days ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations! FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it? Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative. Today, I'm happy to share the first classifier trained on this data. 🔍 What we've built: - A lightweight classifier that efficiently removes low-quality content - 90%+ precision demonstrated on Danish & Swedish - Can process the 43M+ documents in Danish FineWeb2 with minimal compute 🌍 Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers. Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
upvoted
a
collection
about 2 months ago
Danish Text Datasets
View all activity
Organizations
rasgaard
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
liked
a model
9 days ago
hexgrad/Kokoro-82M
Text-to-Speech
•
Updated
about 9 hours ago
•
46.6k
•
2.57k
liked
a dataset
about 2 months ago
HuggingFaceFW/fineweb-2
Viewer
•
Updated
22 days ago
•
12.5B
•
71.4k
•
398
liked
2 Spaces
about 2 months ago
Running
12
🌐📊
FineWeb 2 - Community Leaderboard
Running
on
CPU Upgrade
33
🌐
FineWeb-c - Annotation
liked
a Space
2 months ago
Running
on
Zero
1.72k
❤️
Kokoro TTS
Now in 5 languages!
liked
a model
6 months ago
parler-tts/parler-tts-mini-v1
Text-to-Speech
•
Updated
Nov 25, 2024
•
28.5k
•
134
liked
2 Spaces
6 months ago
Running
86
⚔️
MTEB Arena
Running
on
Zero
181
🎨
Artist
Aesthetically Controllable Text-Driven Stylization w/o Train
liked
2 models
7 months ago
facebook/fasttext-language-identification
Text Classification
•
Updated
Jun 9, 2023
•
2.94M
•
216
jinaai/jina-clip-v1
Feature Extraction
•
Updated
24 days ago
•
79.9k
•
235
liked
a model
8 months ago
EmergentMethods/gliner_medium_news-v2.1
Token Classification
•
Updated
Jun 18, 2024
•
508k
•
72
liked
2 models
about 1 year ago
mhenrichsen/danskgpt-tiny-chat
Text Generation
•
Updated
Jan 27, 2024
•
60
•
12
mhenrichsen/danskgpt-tiny
Text Generation
•
Updated
Jan 13, 2024
•
96
•
18
liked
a dataset
almost 2 years ago
ekinakyurek/ftrace
Viewer
•
Updated
Oct 23, 2022
•
1.59M
•
68
•
5
liked
2 models
about 2 years ago
strombergnlp/dant5-large
Text2Text Generation
•
Updated
Aug 26, 2022
•
21
•
6
strombergnlp/dant5-small
Text2Text Generation
•
Updated
Aug 23, 2023
•
170
•
4
liked
a model
over 2 years ago
alexandrainst/da-offensive-detection-base
Text Classification
•
Updated
Sep 20, 2023
•
106
•
3
Load more