Stefan Schweter PRO
stefan-it
AI & ML interests
Flair Library ๐, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models
Recent Activity
upvoted
a
paper
1 day ago
Common Corpus: The Largest Collection of Ethical Data for LLM
Pre-Training
upvoted
a
paper
2 days ago
XToM: Exploring the Multilingual Theory of Mind for Large Language
Models
upvoted
an
article
2 days ago
No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL
Organizations
stefan-it's activity

reacted to
merve's
post with ๐
13 days ago

reacted to
merve's
post with ๐
30 days ago
Post
6574
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers ๐ฅ
D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐คฉ
> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352
Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ฉ
Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐ฅฒโน๏ธ
D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐คฉ
Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ
this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.
D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐คฉ
> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352
Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ฉ
Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐ฅฒโน๏ธ
D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐คฉ
Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ
this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

reacted to
BramVanroy's
post with โค๏ธ
about 1 month ago
Post
3168
๐ข๐พ Introducing the Common Crawl Creative Commons Corpus (C5)!
C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
---
๐ data: BramVanroy/CommonCrawl-CreativeCommons
๐งฐ software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the
๐ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
๐ More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
---
๐ data: BramVanroy/CommonCrawl-CreativeCommons
๐งฐ software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the
head
?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. ๐ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
๐ More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

reacted to
jsulz's
post with ๐ฅ
about 1 month ago
Post
2633
At
xet-team
we've been hard at work bringing a new generation of storage to the Hugging Face community, and weโve crossed some major milestones:
๐ท Over 2,000 builders and nearing 100 organizations with access to Xet
๐ Over 70,000 model and dataset repositories are Xet-backed
๐คฏ 1.4 petabytes managed by Xet
As we move repos from LFS to Xet for everyone we onboard, weโre pushing our content-addressed store (CAS). Check out the chart below ๐ of CAS hitting up to 150 Gb/s throughput this past week.
All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.
Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph
Join the waitlist to get access! https://huggingface.co/join/xet

๐ท Over 2,000 builders and nearing 100 organizations with access to Xet
๐ Over 70,000 model and dataset repositories are Xet-backed
๐คฏ 1.4 petabytes managed by Xet
As we move repos from LFS to Xet for everyone we onboard, weโre pushing our content-addressed store (CAS). Check out the chart below ๐ of CAS hitting up to 150 Gb/s throughput this past week.
All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.
Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph
Join the waitlist to get access! https://huggingface.co/join/xet

reacted to
anakin87's
post with ๐
about 1 month ago
Post
3433
๐ ๐๐ฟ๐ฎ๐ถ๐ป๐ฒ๐ฑ ๐ฎ ๐๐ฎ๐ป๐ด๐๐ฎ๐ด๐ฒ ๐ ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ฒ๐๐ฒ๐ป๐๐ ๐๐ถ๐๐ต ๐๐ฅ๐ฃ๐ข! ๐ ๐๏ธ
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to:
๐ค Think about the problem setting
๐งฌ Generate data
๐ค Choose the right base model
๐ Design reward functions (and experiencing reward hacking)
๐ Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding ๐ experience.
I learned a lot of things, that I want to share with you. ๐
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐ค Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
I experimented with GRPO lately.
I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.
After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...
I wanted a different challenge, like ๐๐ฒ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด ๐ฎ ๐บ๐ผ๐ฑ๐ฒ๐น ๐๐ผ ๐ฐ๐ฟ๐ฒ๐ฎ๐๐ฒ ๐ฎ ๐๐ฐ๐ต๐ฒ๐ฑ๐๐น๐ฒ ๐ณ๐ฟ๐ผ๐บ ๐ฎ ๐น๐ถ๐๐ ๐ผ๐ณ ๐ฒ๐๐ฒ๐ป๐๐ ๐ฎ๐ป๐ฑ ๐ฝ๐ฟ๐ถ๐ผ๐ฟ๐ถ๐๐ถ๐ฒ๐.
Choosing an original problem forced me to:
๐ค Think about the problem setting
๐งฌ Generate data
๐ค Choose the right base model
๐ Design reward functions (and experiencing reward hacking)
๐ Run multiple rounds of training, hoping that my model would learn something.
A fun and rewarding ๐ experience.
I learned a lot of things, that I want to share with you. ๐
โ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐ค Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837

reacted to
hannayukhymenko's
post with ๐ฅ
about 1 month ago
Post
3449
๐ We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!
๐ MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.
๐ MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.
๐ง MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Googleโs Gemma 2 9B model, but uses a number of new advances stemming from INSAITโs experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.
๐ค MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zรผrich and is trained entirely via donations to INSAIT for AI compute resources.
๐ฅ MamayLM is now freely available to download on INSAITโs HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.
๐ Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.
๐ The release of LLMs for various languages is part of INSAITโs mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.
MamayLM model and benchmarks:
INSAIT-Institute
Blog (EN): https://huggingface.co/blog/INSAIT-Institute/mamaylm
Blog (UKR): https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr
๐ MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.
๐ MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.
๐ง MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Googleโs Gemma 2 9B model, but uses a number of new advances stemming from INSAITโs experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.
๐ค MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zรผrich and is trained entirely via donations to INSAIT for AI compute resources.
๐ฅ MamayLM is now freely available to download on INSAITโs HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.
๐ Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.
๐ The release of LLMs for various languages is part of INSAITโs mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.
MamayLM model and benchmarks:

Blog (EN): https://huggingface.co/blog/INSAIT-Institute/mamaylm
Blog (UKR): https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr

reacted to
jsulz's
post with ๐ฅ
about 2 months ago
Post
3768
Huge week for
xet-team
as Llama 4 is the first major model on Hugging Face uploaded with Xet providing the backing! Every byte downloaded comes through our infrastructure.
Using Xet on Hugging Face is the fastest way to download and iterate on open source models and we've proved it with Llama 4 giving a boost of ~25% across all models.
We expect builders on the Hub to see even more improvements, helping power innovation across the community.
With the models on our infrastructure, we can peer in and see how well our dedupe performs across the Llama 4 family. On average, we're seeing ~25% dedupe, providing huge savings to the community who iterate on these state-of-the-art models. The attached image shows a few selected models and how they perform on Xet.
Thanks to the
meta-llama
team for launching on Xet!

Using Xet on Hugging Face is the fastest way to download and iterate on open source models and we've proved it with Llama 4 giving a boost of ~25% across all models.
We expect builders on the Hub to see even more improvements, helping power innovation across the community.
With the models on our infrastructure, we can peer in and see how well our dedupe performs across the Llama 4 family. On average, we're seeing ~25% dedupe, providing huge savings to the community who iterate on these state-of-the-art models. The attached image shows a few selected models and how they perform on Xet.
Thanks to the


posted
an
update
2 months ago
Post
2572
Wohoo ๐ฅณ I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.
I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbรคr 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.
For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting ๐ And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).
I am so happy to start training and fine-tuning new open source models - stay tuned!!!
I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbรคr 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.
For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting ๐ And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).
I am so happy to start training and fine-tuning new open source models - stay tuned!!!

reacted to
wassemgtk's
post with โค๏ธ
2 months ago
Post
2100
For fun, a new project: SuperTokenizer! A BPE tokenizer trained on C4 to beat GPT-4. Byte-level, A100-powered, and open-source. Messing around with tokens!
https://github.com/wassemgtk/SuperTokenizer
https://github.com/wassemgtk/SuperTokenizer

reacted to
clem's
post with ๐ฅ
3 months ago
Post
2611
Nice new space to see how fast your personal or organization followers are growing on HF:
julien-c/follow-history
As you can see, I still have more followers than @julien-c even if he's trying to change this by building such cool spaces ๐๐๐
julien-c/follow-history
As you can see, I still have more followers than @julien-c even if he's trying to change this by building such cool spaces ๐๐๐

reacted to
csabakecskemeti's
post with ๐ฅ
3 months ago
Post
1978
-UPDATED-
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.

replied to
their
post
3 months ago
Let's see if BERT5urk can make it into @merve 's weekly recap of open AI ๐ค

posted
an
update
3 months ago
Post
999
๐น๐ท ๐ I'm very happy to finally announce my new Turkish LM called "BERT5urk":
stefan-it/bert5urk
It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.
Feel free to check it out!
stefan-it/bert5urk
It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.
Feel free to check it out!

posted
an
update
3 months ago
Post
3218
After running some 3DMark and FurMark benchmarks on Windows to make sure that my new 5090 is not causing melting cables [1] and some nice shots with a thermal camera (I don't think that's too much), running some fine-tuning experiments with my favorite Flair & Transformers libraries are very easy to perform.
Important steps:
Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:
I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.
If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:
But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:
After that the latest Flair version can be installed and fine-tuning will work!
References:
[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
Important steps:
Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:
sudo apt -y install cuda-toolkit-12-8 nvidia-open
I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.
If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:
NVIDIA Graphics Device with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.
But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128
After that the latest Flair version can be installed and fine-tuning will work!
References:
[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network

reacted to
jsulz's
post with ๐
3 months ago
Post
3656
Time flies!
Six months after joining Hugging Face the Xet team is kicking off the first migrations from LFS to our storage for a number of repositories on the Hub.
More on the nitty gritty details behind the migration soon, but here are the big takeaways:
๐ค We've successfully completed the first migrations from LFS -> Xet to test the infrastructure and prepare for a wider release
โ No action on your part needed - you can work with a Xet-backed repo like any other repo on the Hub (for now - major improvements on their way!)
๐ Keep an eye out for the Xet logo to see if a repo you know is on our infra! See the screenshots below to spot the difference ๐
โฉ โฉ โฉ Blazing uploads and downloads coming soon. Wโre gearing up for a full integration with the Hub's Python library that will make building on the Hub faster than ever - special thanks to @celinah and @Wauplin for their assistance.
๐ Want Early Access? If youโre curious and want to test it out the bleeding edge that will power the development experience on the Hub, weโd love to partner with you. Let me know!
This is the culmination of a lot of effort from the entire team. Big round of applause to @sirahd @brianronan @jgodlewski @hoytak @seanses @assafvayner @znation @saba9 @rajatarya @port8080 @yuchenglow
Six months after joining Hugging Face the Xet team is kicking off the first migrations from LFS to our storage for a number of repositories on the Hub.
More on the nitty gritty details behind the migration soon, but here are the big takeaways:
๐ค We've successfully completed the first migrations from LFS -> Xet to test the infrastructure and prepare for a wider release
โ No action on your part needed - you can work with a Xet-backed repo like any other repo on the Hub (for now - major improvements on their way!)
๐ Keep an eye out for the Xet logo to see if a repo you know is on our infra! See the screenshots below to spot the difference ๐
โฉ โฉ โฉ Blazing uploads and downloads coming soon. Wโre gearing up for a full integration with the Hub's Python library that will make building on the Hub faster than ever - special thanks to @celinah and @Wauplin for their assistance.
๐ Want Early Access? If youโre curious and want to test it out the bleeding edge that will power the development experience on the Hub, weโd love to partner with you. Let me know!
This is the culmination of a lot of effort from the entire team. Big round of applause to @sirahd @brianronan @jgodlewski @hoytak @seanses @assafvayner @znation @saba9 @rajatarya @port8080 @yuchenglow

replied to
their
post
3 months ago

reacted to
nicolay-r's
post with ๐
4 months ago
Post
2354
๐ข If you wish to empower LLM with IR and named entity recognition module, then I got relevant findings.
Just tested Flair below is how you can start for adapting for processing your CSV / JSONL data via bulk-ner
๐ฉโ๐ป code: https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_flair_0151.sh
๐ค models:
flair
Provider: https://raw.githubusercontent.com/nicolay-r/nlp-thirdgate/refs/heads/master/ner/flair_0151.py
Framework: https://github.com/nicolay-r/bulk-ner
๐ Performance: the default ner model (Thinkpad X1 Nano)
Batch-size 1 6it/sec
Batch-size 10+ 12it/sec
๐ other wrappers for bulk-ner nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate
Just tested Flair below is how you can start for adapting for processing your CSV / JSONL data via bulk-ner
๐ฉโ๐ป code: https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_flair_0151.sh
๐ค models:

Provider: https://raw.githubusercontent.com/nicolay-r/nlp-thirdgate/refs/heads/master/ner/flair_0151.py
Framework: https://github.com/nicolay-r/bulk-ner
๐ Performance: the default ner model (Thinkpad X1 Nano)
Batch-size 1 6it/sec
Batch-size 10+ 12it/sec
๐ other wrappers for bulk-ner nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate

reacted to
davanstrien's
post with ๐ฅ
4 months ago
Post
2065
๐ Big step for multilingual AI data!
The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
โข Japanese
โข Italian
โข Old High German
Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community
These ratings can help enhance training data for major world languages.
The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
โข Japanese
โข Italian
โข Old High German
Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community
These ratings can help enhance training data for major world languages.