Hugging Face
Models
Datasets
Spaces
Community
Docs
Enterprise
Pricing
Log In
Sign Up
Negar Foroutan
negar-foroutan
Follow
nataliaElv's profile picture
21world's profile picture
thomwolf's profile picture
3 followers
ยท
8 following
http://negar.foroutan.info
negarforoutan
negar-foroutan
negarforoutan
negarforoutan
AI & ML interests
NLP, Multilingual LLMs, Cross-lingual Transfer
Recent Activity
authored
a paper
about 13 hours ago
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
reacted
to
thomwolf
's
post
with ๐
7 months ago
We are proud to announce https://huggingface.co/datasets/HuggingFaceFW/fineweb-2: A sparkling update to https://huggingface.co/datasets/HuggingFaceFW/fineweb with 1000s of ๐ฃ๏ธlanguages. We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages. ๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments. The dataset is released under the permissive ๐ ODC-By 1.0 license, and the ๐ป code to reproduce it and our evaluations is public. We will very soon announce a big community project, and are working on a ๐ blogpost walking you through the entire dataset creation process. Stay tuned! In the mean time come ask us question on our chat place: https://huggingface.co/spaces/HuggingFaceFW/discussion H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
reacted
to
thomwolf
's
post
with ๐ค
7 months ago
We are proud to announce https://huggingface.co/datasets/HuggingFaceFW/fineweb-2: A sparkling update to https://huggingface.co/datasets/HuggingFaceFW/fineweb with 1000s of ๐ฃ๏ธlanguages. We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages. ๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments. The dataset is released under the permissive ๐ ODC-By 1.0 license, and the ๐ป code to reproduce it and our evaluations is public. We will very soon announce a big community project, and are working on a ๐ blogpost walking you through the entire dataset creation process. Stay tuned! In the mean time come ask us question on our chat place: https://huggingface.co/spaces/HuggingFaceFW/discussion H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
View all activity
Organizations
Papers
1
arxiv:
2506.20920
models
0
None public yet
datasets
0
None public yet