MohammedEltoum (Mohammed Mohammed Ali)

upvoted a paper 4 days ago

DINOv3

Paper • 2508.10104 • Published 9 days ago • 167

upvoted an article 10 days ago

Article

Vision Language Model Alignment in TRL ⚡️

By

and 4 others •

16 days ago

• 69

upvoted a paper 11 days ago

MolmoAct: Action Reasoning Models that can Reason in Space

Paper • 2508.07917 • Published 11 days ago • 38

upvoted a paper 15 days ago

Enhanced Arabic Text Retrieval with Attentive Relevance Scoring

Paper • 2507.23404 • Published 22 days ago • 2

reacted to sergiopaniego's post with 👍 18 days ago

Post

4478

Just included example scripts for aligning models using GSPO (including VLM example) 🙆‍♂️🙆‍♂️

GSPO is the latest RL alignment algo by @Alibaba_Qwen and it's already supported in the latest TRL v0.20 release.

Super-easy-to-get-started example scripts below, GO run them!👩‍💻👩‍💻

🧑‍🎨 Script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo.py
🦄 VLM script: https://github.com/huggingface/trl/blob/main/examples/scripts/gspo_vlm.py
🧩 More TRL examples: https://huggingface.co/docs/trl/main/en/example_overview
🧙‍♂️ GSPO paper: Group Sequence Policy Optimization (2507.18071)

liked a Space 24 days ago

108

Appoint Ready - MedGemma Demo

📋

Simulated Pre-visit Intake Demo built using MedGemma

reacted to merve's post with ❤️ about 1 month ago

Post

2626

Fine-tune Gemma3n on videos with audios inside with Colab A100 🔥
Just dropped the notebook where you can learn how to fine-tune Gemma3n on images+audio+text at the same time!

keep in mind, it's made for educational purposes 🫡 we do LoRA, audio resampling & video downsampling to be able to train <40GB VRAM

stretch modalities and unfreeze layers as you wish! 🙏🏻 merve/smol-vision

1 reply

·

upvoted a paper about 1 month ago

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Paper • 2507.10787 • Published Jul 14 • 11

upvoted a paper about 2 months ago

AnimaX: Animating the Inanimate in 3D with Joint Video-Pose Diffusion Models

Paper • 2506.19851 • Published Jun 24 • 59

liked 3 Spaces 2 months ago

301

Chain-of-Zoom

🚀

Extreme Super-Resolution via Scale Autoregression

1.31k

FLUX Unlimited

🔥

Use the FLUX model as much as you want.

193

Graphify

⚡

Create multiple diagram types instantly from JSON!

upvoted an article 3 months ago

Article

How to Build an MCP Server with Gradio

By

and 1 other •

Apr 30

• 189

reacted to nyuuzyou's post with 🔥 3 months ago

Post

3078

I recently updated nyuuzyou/pxhere dataset and it now contains approximately 1.1M CC0 high-resolution images

reacted to merve's post with 🔥 3 months ago

Post

3142

what happened in open AI past week? so many vision LM & omni releases 🔥 merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal 💬🖼️
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) 🌚
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM 🐬 (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text

LLMs
> Mistral released Devstral, a 24B coding assistant (OS) 👩🏻‍💻
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation 🎨
> MTVCrafter is a new human motion animation generator

reacted to fdaudens's post with ❤️ 3 months ago

Post

3934

Just completed the AI Agents course and wow, that capstone project really makes you understand how to build agents that can handle real-world complexity!

The final project uses the GAIA dataset - your agent has to solve tasks like analyzing Excel files, processing audio recordings, answering questions about YouTube videos, and diving into research papers. This isn't toy examples, it's the messy, multimodal stuff agents need to handle in practice.

Whether you’re just getting started with agents or want to go deeper with tools like LangChain, LlamaIndex, and SmolAgents, this course has tons of useful stuff. A few key insights:
- Code agents are incredibly versatile once you get the architecture right
- The sweet spot is finding the right balance of guidance vs autonomy for each use case
- Once the logic clicks, the possibilities really are endless - it's like letting LLMs break free from the chatbox

The course is free and the certification deadline is July 1st, 2025.

The Hugging Face team built something special here. If you're tired of AI that impresses in demos but fails in practice, this is your path to building agents that actually deliver. https://huggingface.co/learn/agents-course/unit0/introduction

Best part? There's the MCP course next!

upvoted 2 papers 3 months ago

Qwen3 Technical Report

Paper • 2505.09388 • Published May 14 • 280

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Paper • 2505.09568 • Published May 14 • 97

upvoted an article 3 months ago

Article

Vision Language Models (Better, Faster, Stronger)

By

and 4 others •

May 12

• 510

reacted to AdinaY's post with 🔥 3 months ago

Post

2427

Bytedance is on fire this week 🔥🔥🔥

They released Seed1.5-VL, A vision-language model for general-purpose multimodal reasoning.
It’s not open-source, but the paper and demo are available here👇

✨ Seed1.5-VL Technical Report (2505.07062)
✨ ByteDance-Seed/Seed1.5-VL

Mohammed Mohammed Ali

AI & ML interests

Recent Activity