zero-gpu-explorers (ZeroGPU Explorers)

merve

posted an update about 12 hours ago

Post

881

🤯 241B VLM with apache-2.0 license internlm/Intern-S1

internlm released Intern-S1: multimodal reasoning model based on 235B MoE Qwen3 and 6B InternViT 😍

benchmarks look great (👑 best model ✅ best open model)

sergiopaniego

posted an update 5 days ago

Post

1100

Yet Another New Multimodal Fine-Tuning Recipe 🥧

🧑‍🍳 In this @HuggingFace Face Cookbook notebook, we demonstrate how to align a multimodal model (VLM) using Mixed Preference Optimization (MPO) using trl.

💡 This recipe is powered by the new MPO support in trl, enabled through a recent upgrade to the DPO trainer!

We align the multimodal model using multiple optimization objectives (losses), guided by a preference dataset (chosen vs. rejected multimodal pairs).

Check it out! ➡️ https://huggingface.co/learn/cookbook/fine_tuning_vlm_mpo

1 reply

·

merve

posted an update 5 days ago

Post

682

so many open LLMs and image LoRAs dropped past week, here's some picks for you 🫡 merve/releases-july-18-687e3fbd2ab9b39c51f9238b

LLMs
> ByteDance released a bunch of translation models called Seed-X-RM (7B) ByteDance-Seed/Seed-X-RM-7B
> NVIDIA released reasoning models of which 32B surpassing the giant Qwen3-235B with cc-by-4.0 license 👏 nvidia/openreasoning-nemotron-687730dae0170059860f1f01
> LG released a new EXAONE model (32B) LGAI-EXAONE/EXAONE-4.0-32B

VLMs/any-to-any
> vidore/colqwen-omni-v0.1 is a new any-to-any retriever (MIT)
> HiDream-ai/HiDream-E1-1 is image+text in image+text out model (MIT)

LoRAs
> There's a bunch of LoRAs based on Flux Kontext, gotta check out the collection 🤠

AtAndDev

posted an update 6 days ago

Post

222

Qwen 3 Coder is a personal attack to k2, and I love it.
It achieves near SOTA on LCB while not having reasoning.
Finally people are understanding that reasoning isnt necessary for high benches...

Qwen ftw!

DECENTRALIZE DECENTRALIZE DECENTRALIZE

PereLluis13

authored 2 papers 7 days ago

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

Paper • 2410.05183 • Published Oct 7, 2024 • 1

BOOKCOREF: Coreference Resolution at Book Scale

Paper • 2507.12075 • Published 12 days ago • 5

merve

posted an update 7 days ago

Post

2687

Now it's possible to do RAG with any-to-any models 🔥

Learn how to search in a video dataset and generate using Tevatron/OmniEmbed-v0.1-multivent an all modality retriever, and Qwen/Qwen2.5-Omni-7B, any-to-any model in this notebook 🤝 merve/smol-vision

MaziyarPanahi

posted an update 7 days ago

Post

6952

🧬 Breaking news in Clinical AI: Introducing the OpenMed NER Model Discovery App on Hugging Face 🔬

OpenMed is back! 🔥 Finding the right biomedical NER model just became as precise as a PCR assay!

I'm thrilled to unveil my comprehensive OpenMed Named Entity Recognition Model Discovery App that puts 384 specialized biomedical AI models at your fingertips.

🎯 Why This Matters in Healthcare AI:
Traditional clinical text mining required hours of manual model evaluation. My Discovery App instantly connects researchers, clinicians, and data scientists with the exact NER models they need for their biomedical entity extraction tasks.

🔬 What You Can Discover:
✅ Pharmacological Models - Extract "chemical compounds", "drug interactions", and "pharmaceutical" entities from clinical notes
✅ Genomics & Proteomics - Identify "DNA sequences", "RNA transcripts", "gene variants", "protein complexes", and "cell lines"
✅ Pathology & Disease Detection - Recognize "pathological formations", "cancer types", and "disease entities" in medical literature
✅ Anatomical Recognition - Map "anatomical systems", "tissue types", "organ structures", and "cellular components"
✅ Clinical Entity Extraction - Detect "organism species", "amino acids", 'protein families", and "multi-tissue structures"

💡 Advanced Features:
🔍 Intelligent Entity Search - Find models by specific biomedical entities (e.g., "Show me models detecting CHEM + DNA + Protein")
🏥 Domain-Specific Filtering - Browse by Oncology, Pharmacology, Genomics, Pathology, Hematology, and more
📊 Model Architecture Insights - Compare BERT, RoBERTa, and DeBERTa implementations
⚡ Real-Time Search - Auto-filtering as you type, no search buttons needed
🎨 Clinical-Grade UI - Beautiful, intuitive interface designed for medical professionals

Ready to revolutionize your biomedical NLP pipeline?

🔗 Try it now: OpenMed/openmed-ner-models
🧬 Built with: Gradio, Transformers, Advanced Entity Mapping

5 replies

·

sergiopaniego

posted an update 10 days ago

Post

1618

🧑‍🍳 New Multimodal Fine-Tuning Recipe 🧑‍🍳

⚡️ In this new @huggingface Cookbook recipe, I walk you though the process of fine tuning a Visual Language Model (VLM) for Object Detection with Visual Grounding, using TRL.

🔍 Object detection typically involves detecting categories in images (e.g., vase).

By combining it with visual grounding, we add contextual understanding so instead of detecting just "vase", we can detect "middle vase" in an image.

VLMs are super powerful!

In this case, I use PaliGemma 2 which already supports object detection and extend it to also add visual grounding.

🤗 Check it out here: https://huggingface.co/learn/cookbook/fine_tuning_vlm_object_detection_grounding

sergiopaniego

posted an update 11 days ago

Post

1591

Multiple NEW notebooks and scripts added to the Hugging Face Gemma recipes repo!

Thanks to the community 🫶, we're adding more and more recipes using Gemma 💎

Fine tuning for all modalities, function calling, RAG...

Repo: https://github.com/huggingface/huggingface-gemma-recipes

We're also open to new ideas from the community 🤗!

1 reply

·

merve

posted an update 11 days ago

Post

2007

all modality RAG 🔥

ColQwen-Omni is a new multimodal retrieval model that can retrieve anything (videos, audios, documents and more!)

use with transformers 🤗
read the blog https://huggingface.co/blog/manu/colqwen-omni-omnimodal-retrieval
model repository vidore/colqwen-omni-v0.1

merve

posted an update 12 days ago

Post

2562

Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision

1 reply

·

sergiopaniego

posted an update 14 days ago

Post

375

Loved this paper! ♥️

Authors benchmark multimodal models on vision tasks (detection, segmentation...) using clever prompting tricks.

📄 Results: VLMs are solid generalists but still lag behind SOTA task-specific models — especially on geometric tasks vs. semantic ones.

paper: How Well Does GPT-4o Understand Vision? Evaluating Multimodal Foundation Models on Standard Computer Vision Tasks (2507.01955)

sergiopaniego

posted an update 14 days ago

Post

248

You can already play with two of the latest most impressive models on HF via @novita-ai as Inference Provider 🚨

🌌 Kimi K2: 1T params model, MoE beast for coding, reasoning and agentic tasks
🔮 GLM-4.1V-9B-Thinking: VLM + deep reasoning model

Kimi K2: moonshotai/Kimi-K2-Instruct
GLM-4.1V-9B-Thinking: https://huggingface.co/THUDM/GLM-4.1V-9B-Thinking

merve

posted an update 14 days ago

Post

2397

past week had huuuge releases 💗
here's our picks 🔥 find more models, datasets, demos here merve/releases-july-11-68750452c358c98b0fa663f7

> moonshotai/Kimi-K2-Instruct is the new sota LLM with 1T total 32B active parameters 🤯

> HuggingFaceTB/SmolLM3-3B is the new best LM for it's size, offers thinking mode 💭 as well as the dataset HuggingFaceTB/smoltalk2

> Alibaba-NLP/WebSailor-3B is the new agentic LLM for complex browsing

> Google DeepMind released medical vision LMs with an agentic doctor-patient app google/medgemma-release-680aade845f90bec6a3f60c4

> fal released a LoRA to improve details on face images fal/Realism-Detailer-Kontext-Dev-LoRA

sergiopaniego

posted an update 14 days ago

Post

214

Over 1K already on @huggingface !!

sergiopaniego

posted an update 19 days ago

Post

1572

Test SmolLM3, the newest fully open model released by @HuggingFaceTB !

It's smol (3B), multilingual (6 languages), comes with dual mode reasoning (think/no_think modes) and supports long-context (128k).

Try it now in the notebook below!! ⬇️

Colab notebook: https://colab.research.google.com/github/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
notebook: https://github.com/sergiopaniego/samples/blob/main/smollm3_3b_inference.ipynb
blog: https://huggingface.co/blog/smollm3

merve

posted an update 20 days ago

Post

3092

GitHub refuses to render notebooks for a long time now 💔

so smol-vision now lives in Hugging Face model repository 🤗 merve/smol-vision

1 reply

·

mrfakename

in zero-gpu-explorers/README 20 days ago

Buy extra ZERO GPU minutes?

1

#166 opened 20 days ago by

JKM2025

merve

posted an update 20 days ago

Post

3426

ByteDance released Tar 1.5B and 7B: image-text in image-text out models, fully open-source 👏 ByteDance-Seed/tar-6864cf0d9fe59a3b91cc4260

They have an image tokenizer unified with text, and they de-tokenize using either of two models (LLM and diffusion)
The model is actually a full LLM (Qwen2), the tokenizer converts image tokens 🤯

ZeroGPU Explorers

AI & ML interests

Recent Activity

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics

BOOKCOREF: Coreference Resolution at Book Scale

Buy extra ZERO GPU minutes?

AI & ML interests

Recent Activity

Team members 753

zero-gpu-explorers's activity

Buy extra ZERO GPU minutes?