Kenneth C. Enevoldsen
KennethEnevoldsen
AI & ML interests
NLP, multimodal learning, Scandinavian NLP, Theory of Mind, Medical NLP, Psychiatry
Recent Activity
liked
a Space
about 23 hours ago
mteb/leaderboard_2_demo
new activity
4 days ago
danish-foundation-models/danish-dynaword:Add datasets quality metrics
reacted
to
davanstrien's
post
with š¤
5 days ago
Introducing scandi-fine-web-cleaner https://huggingface.co/davanstrien/scandi-fine-web-cleaner, the first model trained on FineWeb-C community annotations!
FineWeb2 is a massive multilingual dataset for pre-training language models. Like any web-scale dataset, it contains low-quality content. How can we improve it?
Over the past months, an amazing community of 400+ annotators has been labelling content quality (using Argilla) across 23 languages through the FineWeb-C initiative.
Today, I'm happy to share the first classifier trained on this data.
š What we've built:
- A lightweight classifier that efficiently removes low-quality content
- 90%+ precision demonstrated on Danish & Swedish
- Can process the 43M+ documents in Danish FineWeb2 with minimal compute
š Why this matters: The approach can be reproduced for any of the 23 languages in FineWeb-C (https://huggingface.co/datasets/data-is-better-together/fineweb-c). We can improve training data quality at scale without massive compute resources by starting with community annotations and training small, efficient classifiers.
Want to build a classifier for your language? Check out the full blog post with code examples and implementation details: https://danielvanstrien.xyz/posts/2025/FineWeb-c/scandinavian-content-filtering-fineweb.html
Organizations
KennethEnevoldsen's activity
Add datasets quality metrics
#39 opened 4 days ago
by
KennethEnevoldsen
Added some š„
#19 opened 9 days ago
by
KennethEnevoldsen
Sources for the data
#1 opened 11 days ago
by
KennethEnevoldsen
add Danish-PD
#38 opened 11 days ago
by
KennethEnevoldsen
Center dataset statistics
#37 opened 12 days ago
by
KennethEnevoldsen
Add reference to license for datasets
#36 opened 12 days ago
by
KennethEnevoldsen
Add contributor note at the button of each data sheet
#35 opened 12 days ago
by
KennethEnevoldsen
Do we accept synthetic datasets?
1
#33 opened 12 days ago
by
KennethEnevoldsen
Add a notion of dataset quality to new and current datasets
#34 opened 12 days ago
by
KennethEnevoldsen
Add CoRAL
#32 opened 12 days ago
by
KennethEnevoldsen
Add Danish github repositories
#31 opened 12 days ago
by
KennethEnevoldsen
Datasets to add
#6 opened about 1 month ago
by
KennethEnevoldsen
make a test that checks if descriptive statistics have been updated
#30 opened 14 days ago
by
KennethEnevoldsen
Add n. tokens to dataset columns
#29 opened 14 days ago
by
KennethEnevoldsen
remove empty documents
#28 opened 14 days ago
by
KennethEnevoldsen
add lex
1
#24 opened 16 days ago
by
KennethEnevoldsen
add test to check for duplicates and remove existing duplicates
1
#25 opened 16 days ago
by
KennethEnevoldsen
Add OpenSubtitles
1
#21 opened 17 days ago
by
KennethEnevoldsen
Adding CI
#27 opened 15 days ago
by
KennethEnevoldsen
mypr
#26 opened 15 days ago
by
KennethEnevoldsen