Stefan Schweter's picture

Stefan Schweter PRO

stefan-it

AI & ML interests

Flair Library ๐Ÿ’•, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models

Recent Activity

Organizations

Bayerische Staatsbibliothek's profile picture flair's profile picture Flax Community's profile picture dumitrescustefan-org's profile picture GermanT5's profile picture BigScience: LMs for Historical Texts's profile picture Universal NER's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Libre Euro Lingua-Alliance's profile picture Lang UK's profile picture BabyLM Challenge's profile picture hmByT5 Preliminary's profile picture hmByT5's profile picture Blog-explorers's profile picture German Wikipedia LMs's profile picture hmBERT's profile picture hmTEAMS's profile picture HIPE's profile picture hmBERT Tiny's profile picture hmBERT 64k's profile picture LSV @ Saarland University's profile picture GERMATRON's profile picture PleIAs's profile picture German LLM Tokenizers's profile picture Occiglot's profile picture Social Post Explorers's profile picture GERTuraX's profile picture Stefmal's profile picture Hugging Face Discord Community's profile picture ScaDS.AI German LLM's profile picture ENGEBA's profile picture Nerdy Face's profile picture TensorFlow Model Garden LMs's profile picture Hugging Face MCP Course's profile picture

stefan-it's activity

reacted to merve's post with ๐Ÿš€ 13 days ago
view post
Post
3096
Bu post'u รงevirebilirsiniz ๐Ÿค—๐Ÿ’—
ยท
reacted to merve's post with ๐Ÿš€ 30 days ago
view post
Post
6574
A real-time object detector much faster and accurate than YOLO with Apache 2.0 license just landed to Hugging Face transformers ๐Ÿ”ฅ

D-FINE is the sota real-time object detector that runs on T4 (free Colab) ๐Ÿคฉ

> Collection with all checkpoints and demo ustc-community/d-fine-68109b427cbe6ee36b4e7352

Notebooks:
> Tracking https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_tracking.ipynb
> Inference https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_inference.ipynb
> Fine-tuning https://github.com/qubvel/transformers-notebooks/blob/main/notebooks/DFine_finetune_on_a_custom_dataset.ipynb
h/t @vladislavbro @qubvel-hf @ariG23498 and the authors of the paper ๐ŸŽฉ

Regular object detectors attempt to predict bounding boxes in (x, y, w, h) pixel perfect coordinates, which is very rigid and hard to solve ๐Ÿฅฒโ˜น๏ธ



D-FINE formulates object detection as a distribution for bounding box coordinates, refines them iteratively, and it's more accurate ๐Ÿคฉ

Another core idea behind this model is Global Optimal Localization Self-Distillation โคต๏ธ

this model uses final layer's distribution output (sort of like a teacher) to distill to earlier layers to make early layers more performant.

  • 2 replies
ยท
reacted to BramVanroy's post with โค๏ธ about 1 month ago
view post
Post
3168
๐Ÿ“ข๐Ÿ’พ Introducing the Common Crawl Creative Commons Corpus (C5)!

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.

---
๐Ÿ“„ data: BramVanroy/CommonCrawl-CreativeCommons
๐Ÿงฐ software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---

</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the head?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze.

๐ŸŒ In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.

๐Ÿ” More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!
  • 1 reply
ยท
reacted to jsulz's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
2633
At xet-team we've been hard at work bringing a new generation of storage to the Hugging Face community, and weโ€™ve crossed some major milestones:

๐Ÿ‘ท Over 2,000 builders and nearing 100 organizations with access to Xet
๐Ÿš€ Over 70,000 model and dataset repositories are Xet-backed
๐Ÿคฏ 1.4 petabytes managed by Xet

As we move repos from LFS to Xet for everyone we onboard, weโ€™re pushing our content-addressed store (CAS). Check out the chart below ๐Ÿ‘‡ of CAS hitting up to 150 Gb/s throughput this past week.

All of this growth is helping us build richer insights. We expanded our repo graph, which maps how Xet-backed repositories on the Hub share bytes with each other.

Check out the current network in the image below (nodes are repositories, edges are where repos share bytes) and visit the space to see how different versions of Qwen, Llama, and Phi models are grouped together xet-team/repo-graph

Join the waitlist to get access! https://huggingface.co/join/xet
reacted to anakin87's post with ๐Ÿ‘ about 1 month ago
view post
Post
3433
๐—œ ๐˜๐—ฟ๐—ฎ๐—ถ๐—ป๐—ฒ๐—ฑ ๐—ฎ ๐—Ÿ๐—ฎ๐—ป๐—ด๐˜‚๐—ฎ๐—ด๐—ฒ ๐— ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐˜„๐—ถ๐˜๐—ต ๐—š๐—ฅ๐—ฃ๐—ข! ๐Ÿ‘‘ ๐Ÿ—“๏ธ

โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo

I experimented with GRPO lately.

I am fascinated by models learning from prompts and rewards - no example answers needed like in Supervised Fine-Tuning.

After the DeepSeek boom, everyone is trying GRPO with GSM8K or the Countdown Game...

I wanted a different challenge, like ๐˜๐—ฒ๐—ฎ๐—ฐ๐—ต๐—ถ๐—ป๐—ด ๐—ฎ ๐—บ๐—ผ๐—ฑ๐—ฒ๐—น ๐˜๐—ผ ๐—ฐ๐—ฟ๐—ฒ๐—ฎ๐˜๐—ฒ ๐—ฎ ๐˜€๐—ฐ๐—ต๐—ฒ๐—ฑ๐˜‚๐—น๐—ฒ ๐—ณ๐—ฟ๐—ผ๐—บ ๐—ฎ ๐—น๐—ถ๐˜€๐˜ ๐—ผ๐—ณ ๐—ฒ๐˜ƒ๐—ฒ๐—ป๐˜๐˜€ ๐—ฎ๐—ป๐—ฑ ๐—ฝ๐—ฟ๐—ถ๐—ผ๐—ฟ๐—ถ๐˜๐—ถ๐—ฒ๐˜€.

Choosing an original problem forced me to:
๐Ÿค” Think about the problem setting
๐Ÿงฌ Generate data
๐Ÿค Choose the right base model
๐Ÿ† Design reward functions (and experiencing reward hacking)
๐Ÿ”„ Run multiple rounds of training, hoping that my model would learn something.

A fun and rewarding ๐Ÿ˜„ experience.


I learned a lot of things, that I want to share with you. ๐Ÿ‘‡
โœ๏ธ Blog post: https://huggingface.co/blog/anakin87/qwen-scheduler-grpo
๐Ÿ’ป Code: https://github.com/anakin87/qwen-scheduler-grpo
๐Ÿค— Hugging Face collection (dataset and model): anakin87/qwen-scheduler-grpo-680bcc583e817390525a8837
  • 2 replies
ยท
reacted to hannayukhymenko's post with ๐Ÿ”ฅ about 1 month ago
view post
Post
3449
๐Ÿš€ We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!

๐Ÿ“ˆ MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

๐Ÿ“Š MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.

๐Ÿง  MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Googleโ€™s Gemma 2 9B model, but uses a number of new advances stemming from INSAITโ€™s experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.

๐Ÿค MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zรผrich and is trained entirely via donations to INSAIT for AI compute resources.

๐Ÿ“ฅ MamayLM is now freely available to download on INSAITโ€™s HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.

๐Ÿ“ Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.

๐ŸŒŽ The release of LLMs for various languages is part of INSAITโ€™s mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.

MamayLM model and benchmarks: INSAIT-Institute
Blog (EN): https://huggingface.co/blog/INSAIT-Institute/mamaylm
Blog (UKR): https://huggingface.co/blog/INSAIT-Institute/mamaylm-ukr
  • 1 reply
ยท
reacted to jsulz's post with ๐Ÿ”ฅ about 2 months ago
view post
Post
3768
Huge week for xet-team as Llama 4 is the first major model on Hugging Face uploaded with Xet providing the backing! Every byte downloaded comes through our infrastructure.

Using Xet on Hugging Face is the fastest way to download and iterate on open source models and we've proved it with Llama 4 giving a boost of ~25% across all models.

We expect builders on the Hub to see even more improvements, helping power innovation across the community.

With the models on our infrastructure, we can peer in and see how well our dedupe performs across the Llama 4 family. On average, we're seeing ~25% dedupe, providing huge savings to the community who iterate on these state-of-the-art models. The attached image shows a few selected models and how they perform on Xet.

Thanks to the meta-llama team for launching on Xet!
posted an update 2 months ago
view post
Post
2572
Wohoo ๐Ÿฅณ I have finished my 2025 GPU workstation build and I am very excited to train new awesome open source models on it.

I built my last GPU workstation 5 years ago featuring an AMD Ryzen 5900X, 64GB of G.SKILL Trident Z RGB on an ASRock X570 Taichi cooled by an Alphacool Eisbรคr 420. GPU was a Zotac RTX 3090 AMP Extreme. Unfortunately, I was never satisfied with the case - some Fractal Define 7, as it is definitely too small, airflow is not optimal as I had to open the front door all the time and it also arrived with a partly damaged side panel.

For my new build, I've used the following components: an outstanding new AMD Ryzen 9950X3D with 64GB of Corsair Dominator Titanium (what a name). As a huge Noctua fan - warm greetings to my Austrian neighbors - I am using the brand new Noctua NH-D15 G2 on an ASRock X870E Taichi in an amazing Lian Li LANCOOL III chassis. One joke that only NVIDIA Blackwell users will understand: you definitely need a tempered glass panel to check if your GPU cables/connectors start melting ๐Ÿ˜‚ And the best is yet to come: I returned my previously bought Zotac RTX 5090 Solid to the eBay seller (because of... missing ROPs, only NVIDIA Blackwell users will again understand) and bought a Zotac 5090 AMP Extreme INFINITY (yes, the long name indicates that this is the flagship model from Zotac) from a more trustworthy source (NBB in Germany).

I am so happy to start training and fine-tuning new open source models - stay tuned!!!
  • 2 replies
ยท
reacted to wassemgtk's post with โค๏ธ 2 months ago
view post
Post
2100
For fun, a new project: SuperTokenizer! A BPE tokenizer trained on C4 to beat GPT-4. Byte-level, A100-powered, and open-source. Messing around with tokens!
https://github.com/wassemgtk/SuperTokenizer
  • 1 reply
ยท
reacted to clem's post with ๐Ÿ”ฅ 3 months ago
view post
Post
2611
Nice new space to see how fast your personal or organization followers are growing on HF:
julien-c/follow-history

As you can see, I still have more followers than @julien-c even if he's trying to change this by building such cool spaces ๐Ÿ˜๐Ÿ˜๐Ÿ˜
reacted to clem's post with ๐Ÿš€ 3 months ago
view post
Post
4685
We just crossed 1,500,000 public models on Hugging Face (and 500k spaces, 330k datasets, 50k papers). One new repository is created every 15 seconds. Congratulations all!
ยท
reacted to csabakecskemeti's post with ๐Ÿ”ฅ 3 months ago
view post
Post
1978
-UPDATED-
4bit inference is working! The blogpost is updated with code snippet and requirements.txt
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
-UPDATED-
I've played around with an MI100 and ROCm and collected my experience in a blogpost:
https://devquasar.com/uncategorized/all-about-amd-and-rocm/
Unfortunately I've could not make inference or training work with model loaded in 8bit or use BnB, but did everything else and documented my findings.
  • 4 replies
ยท
replied to their post 3 months ago
view reply

Let's see if BERT5urk can make it into @merve 's weekly recap of open AI ๐Ÿค—

posted an update 3 months ago
view post
Post
999
๐Ÿ‡น๐Ÿ‡ท ๐Ÿ˜ I'm very happy to finally announce my new Turkish LM called "BERT5urk":

stefan-it/bert5urk

It is a 1.42B T5-based model, trained with UL2 pretraining objective on the Turkish part of the awesome HuggingFaceFW/fineweb-2 dataset.

Feel free to check it out!
  • 1 reply
ยท
posted an update 3 months ago
view post
Post
3218
After running some 3DMark and FurMark benchmarks on Windows to make sure that my new 5090 is not causing melting cables [1] and some nice shots with a thermal camera (I don't think that's too much), running some fine-tuning experiments with my favorite Flair & Transformers libraries are very easy to perform.

Important steps:

Good idea is to start with a fresh Ubuntu 24.04 installation with latest CUDA 12.8 and the open NVIDIA driver - follow more advices from [2]:

sudo apt -y install cuda-toolkit-12-8 nvidia-open

I tried update from an existing Ubuntu installation with an older CUDA and driver version and it resulted in a non-startable system.

If you are using PyTorch 2.6 with built CUDA 12.6 it will result in:

NVIDIA Graphics Device with CUDA capability sm_120 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_50 sm_60 sm_70 sm_75 sm_80 sm_86 sm_90.

But no worries! For PyTorch you need just to use a nightly 2.7 version that was built with CUDA 12.8. This can easily done via:

pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu128

After that the latest Flair version can be installed and fine-tuning will work!

References:

[1]: https://www.reddit.com/r/nvidia/comments/1inpox7/rtx_50_series_12vhpwr_megathread/
[2]: https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=24.04&target_type=deb_network
  • 1 reply
ยท
reacted to jsulz's post with ๐Ÿš€ 3 months ago
view post
Post
3656
Time flies!

Six months after joining Hugging Face the Xet team is kicking off the first migrations from LFS to our storage for a number of repositories on the Hub.

More on the nitty gritty details behind the migration soon, but here are the big takeaways:

๐Ÿค– We've successfully completed the first migrations from LFS -> Xet to test the infrastructure and prepare for a wider release

โœ… No action on your part needed - you can work with a Xet-backed repo like any other repo on the Hub (for now - major improvements on their way!)

๐Ÿ‘€ Keep an eye out for the Xet logo to see if a repo you know is on our infra! See the screenshots below to spot the difference ๐Ÿ‘‡

โฉ โฉ โฉ Blazing uploads and downloads coming soon. Wโ€™re gearing up for a full integration with the Hub's Python library that will make building on the Hub faster than ever - special thanks to @celinah and @Wauplin for their assistance.

๐ŸŽ‰ Want Early Access? If youโ€™re curious and want to test it out the bleeding edge that will power the development experience on the Hub, weโ€™d love to partner with you. Let me know!

This is the culmination of a lot of effort from the entire team. Big round of applause to @sirahd @brianronan @jgodlewski @hoytak @seanses @assafvayner @znation @saba9 @rajatarya @port8080 @yuchenglow
  • 1 reply
ยท
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
5149
She arrived ๐Ÿ˜

[Expect more models soon...]
  • 2 replies
ยท
reacted to nicolay-r's post with ๐Ÿš€ 4 months ago
view post
Post
2354
๐Ÿ“ข If you wish to empower LLM with IR and named entity recognition module, then I got relevant findings.
Just tested Flair below is how you can start for adapting for processing your CSV / JSONL data via bulk-ner
๐Ÿ‘ฉโ€๐Ÿ’ป code: https://github.com/nicolay-r/nlp-thirdgate/blob/master/tutorials/ner_flair_0151.sh
๐Ÿค– models: flair

Provider: https://raw.githubusercontent.com/nicolay-r/nlp-thirdgate/refs/heads/master/ner/flair_0151.py
Framework: https://github.com/nicolay-r/bulk-ner

๐Ÿš€ Performance: the default ner model (Thinkpad X1 Nano)
Batch-size 1 6it/sec
Batch-size 10+ 12it/sec

๐ŸŒŒ other wrappers for bulk-ner nlp-thirdgate: https://github.com/nicolay-r/nlp-thirdgate
reacted to davanstrien's post with ๐Ÿ”ฅ 4 months ago
view post
Post
2065
๐ŸŒ Big step for multilingual AI data!

The Hugging Face community has rated educational content in languages spoken by 1.6 billion people! New additions:
โ€ข Japanese
โ€ข Italian
โ€ข Old High German

Learn more and contribute: https://huggingface.co/blog/davanstrien/fineweb2-community

These ratings can help enhance training data for major world languages.
  • 1 reply
ยท