Activity Feed

AI & ML interests

None defined yet.

Recent Activity

merve 
posted an update about 12 hours ago
view post
Post
881
🤯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (👑 best model ✅ best open model)
sergiopaniego 
posted an update 5 days ago
view post
Post
1100
Yet Another New Multimodal Fine-Tuning Recipe 🥧

🧑‍🍳 In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.

💡 This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!

We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).

Check it out! ➡️ https://huggingface.co/learn/cookbook/fine_tuning_vlm_mpo
  • 1 reply
·
merve 
posted an update 5 days ago
view post
Post
682
so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫡 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license 👏 nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🤠
AtAndDev 
posted an update 6 days ago
view post
Post
222
Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...

Qwen ftw!

DECENTRALIZE DECENTRALIZE DECENTRALIZE
merve 
posted an update 7 days ago
MaziyarPanahi 
posted an update 7 days ago
view post
Post
6952
🧬 Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face 🔬

OpenMed is back! 🔥 Finding the right biomedical NER model just became as precise as a PCR assay!

I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.

🎯 Why This Matters in Healthcare AI:
Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.

🔬 What You Can Discover:
✅ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes
✅ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines"
✅ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature
✅ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components"
✅ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"

💡 Advanced Features:
🔍 Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein")
🏥 Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more
📊 Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations
⚡ Real-Time Search - Auto-filtering as you type, no search buttons needed
🎨 Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals

Ready to revolutionize your biomedical NLP pipeline?

🔗 Try it now: OpenMed/openmed-ner-models
🧬 Built with: Gradio, Transformers, Advanced Entity Mapping
·
sergiopaniego 
posted an update 10 days ago
view post
Post
1618
🧑‍🍳 New Multimodal Fine-Tuning Recipe 🧑‍🍳

⚡️ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.

🔍 Object detection typically involves detecting categories in images (e.g., vase).

By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.

VLMs are super powerful!

In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.

🤗 Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding
sergiopaniego 
posted an update 11 days ago
view post
Post
1591
Multiple NEW notebooks and scripts added to the Hugging Face Gemma recipes repo!

Thanks to the community 🫶, we're adding more and more recipes using Gemma 💎

Fine tuning for all modalities, function calling, RAG...

Repo: https://github.com/huggingface/huggingface-gemma-recipes

We're also open to new ideas from the community 🤗!
  • 1 reply
·
merve 
posted an update 11 days ago
merve 
posted an update 12 days ago
view post
Post
2562
Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision
  • 1 reply
·
sergiopaniego 
posted an update 14 days ago
sergiopaniego 
posted an update 14 days ago
merve 
posted an update 14 days ago
view post
Post
2397
past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA
sergiopaniego 
posted an update 14 days ago
sergiopaniego 
posted an update 19 days ago
view post
Post
1572
Test SmolLM3, the newest fully open model released by @HuggingFaceTB !

It's smol (3B), multilingual (6 languages), comes with dual mode reasoning (think/no_think modes) and supports long-context (128k).

Try it now in the notebook below!! ⬇️

Colab notebook: https://colab.research.google.com/github/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
notebook: https://github.com/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
blog: https://huggingface.co/blog/smollm3
merve 
posted an update 20 days ago
view post
Post
3092
GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision
  • 1 reply
·
merve 
posted an update 20 days ago
view post
Post
3426
ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯