huggingface-projects (Huggingface Projects)

merve

posted an update about 19 hours ago

Post

1671

Huge drops in open AI past week!
Find more models, datasets, demos here merve/releases-july-4-686bcc54ed7c45c341fbf654
Some of our picks 🫡
⏯️ BAAI/MTVCraft is a new Veo3-like text-to-video model, demo is here BAAI/MTVCraft
🧑🏻‍💻 apple/diffucoder-6868139f56672ae046fe04e8 is a new family of diffusion LLMs (7B base and instruct) for coding
🗣️ kyutai/tts-1.6b-en_fr is a new small TTS model for English and France
👀 aharley/alltracker is a new pixel tracking model by Stanford, demo is here aharley/alltracker
📖 racineai/OGC_MEGA_MultiDomain_DocRetrieval is a new large visual document retrieval dataset

1 reply

·

ThomasSimonini

updated a dataset about 23 hours ago

huggingface-projects/Deep-RL-Course-Certification

Viewer • Updated about 23 hours ago • 1.5k • 307 • 15

giadap

authored a paper 5 days ago

Can AI be Consentful?

Paper • 2507.01051 • Published 11 days ago

andito

posted an update 5 days ago

Post

3781

🧠👁️ Can AI visualize solutions?

Humans often solve visual problems by sketching ideas in our minds. What if Vision-Language Models (VLMs) could do something similar, not by generating full images, but by using internal “mental sketches”?

That’s the idea behind Mirage, a new framework that empowers VLMs to reason using latent visual tokens. Instead of just thinking in words, Mirage mixes in abstract visual representations that help the model solve complex tasks.

These aren't photorealistic images. They're compact, internal representations optimized purely to support reasoning.

🔧 Mirage is trained in two phases:

1) Grounding: It learns to produce latent tokens anchored in real images.
2) Refinement: The model drops the images and learns to generate visual tokens on its own.

📈 And yes, it works!
On challenging benchmarks like Visual Spatial Planning, Jigsaw puzzles, and Spatial Attention Tasks, Mirage clearly outperforms GPT-4o and other strong baselines.
Smart sketches > empty words.

By mimicking the way humans visualize solutions, Mirage gives AI a new kind of imagination, one that’s faster, more efficient, and more human-like.
Kudos to the teams at UMass Amherst and MIT behind this exciting work.
Check the paper: Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens (2506.17218)

2 replies

·

AdinaY

posted an update 5 days ago

Post

1884

The Chinese Open Source Heatmap is live 🔥
You can now track the companies/ research labs/ communities powering China’s open source AI movement.

zh-ai-community/model-release-heatmap-zh

Some highlights:

✨Giant Tech are investing more in open source.
-Alibaba: Full stack open ecosystem
-Tecent: Hunyuan image/video/3D
-Bytedance: Catching up fast in 2025
-Baidu: New player in open LLM

✨New players emerging post–DeepSeek moment.
-Xiaomi
-Red Note
-Bilibili
-MiniMax
-Moonshot AI

✨Startup list is shifting fast! Those who find a direction aligned with their strengths are the ones who endure.
-DeepSeek
-MiniMax
-StepFun
-Moonshot AI
-Zhipu AI
-OpenBMB

✨Research Lab & Community are making key contributions.
-BAAI
-Shanghai AI Lab
-OpenMOSS
-MAP

sergiopaniego

posted an update 5 days ago

Post

1843

Updated my HF Space for vibe testing smol VLMs on object detection, visual grounding, keypoint detection & counting! 👓

🆕 Compare Qwen2.5 VL 3B vs Moondream 2B side-by-side with annotated images & text outputs.

Try examples or test your own images! 🏃

📱Space: sergiopaniego/vlm_object_understanding

AdinaY

posted an update 6 days ago

Post

3274

🔥 June highlights from China’s open source ecosystem.

zh-ai-community/june-2025-open-works-from-the-chinese-community-683d66c188f782dc5570ba15

✨Baidu & MiniMax both launched open foundation models
- Baidu: Ernie 4.5 ( from 0.3B -424B ) 🤯
- MiniMax: MiniMax -M1 ( Hybrid MoE reasoning model )

✨Multimodal AI is moving from fusion to full-stack reasoning: unified Any-to-Any pipelines across text, vision, audio, and 3D
- Baidu: ERNIE-4.5-VL-424B
- Moonshot AI: Kimi-VL-A3B
- Alibaba: Ovis-U1
- BAAI: Video-XL-2/OmniGen2
- AntGroup: Ming-Lite-Omni
- Chinese Academy of Science: Stream-Omni
- Bytedance: SeedVR2-3B
- Tencent: Hunyuan 3D 2.1/ SongGeneration
- FishAudio: Openaudio-s1-mini

✨Domain specific models are rapidly emerging
- Alibaba DAMO: Lingshu-7B (medical MLLM)
- BAAI: RoboBrain (Robotics)

✨ So many small models!
- OpenBMB: MiciCPM4 ( on device )
- Qwen: Embedding/Reranker (0.6B)
- Alibaba: Ovis-U1-3B
- Moonshot AI: Kimi-VL-A3B
- Bytedance: SeedVR2-3B

merve

posted an update 6 days ago

Post

832

SOOOO MANY MODEL RELEASES 😍
Here's some picks from past week 🤗

> ByteDance/XVerse is a new identity preserving image generation model 🖼️
> google/gemma-3n-E4B-it, any-to-text model supported by transformers 🤗
> nvidia/llama-nemoretriever-colembed-3b-v1 two new state-of-the-art visual document retrievers 📑
> New version of Dia TTS model is up nari-labs/Dia-1.6B-0626
> Black Forest Labs releases Kontext benchmark black-forest-labs/kontext-bench

Find more here merve/releases-june-27-6864e8eb17f7e3a8b444083c

AdinaY

posted an update 6 days ago

Post

281

MTVCraft 🔥 Veo3 style Audio-Video model by BAAI

Model:
BAAI/MTVCraft
Demo:
BAAI/MTVCraft

✨ Text > [Speech + SFX + BGM] > Synchronized Video
✨ Built with Qwen3 + ElevenLabs + MTV

merve

posted an update 6 days ago

Post

2935

visual reasoning is now in transformers 🔥
THUDM/GLM-4.1V-9B-Thinking is just released and merged into transformers, we gave it a vibe test run 🤠

it's very good, comes with 64k context length and MIT license 😍
it supports 4k image tokens and any aspect ratio as well!
Notebook: http://colab.research.google.com/drive/1atODIiV57hOZLv16Bjzwd6fwx0yoTorj?usp=sharing
Demo: THUDM/GLM-4.1V-9B-Thinking-Demo

AdinaY

posted an update 6 days ago

Post

2265

GLM-4.1V-Thinking 🔥 New open vision reasoning model by Zhipu AI

THUDM/glm-41v-thinking-6862bbfc44593a8601c2578d

✨ 9B base & Thinking - MIT license
✨ CoT + RL with Curriculum Sampling
✨ 64k context, 4K image, any aspect ratio
✨ Support English & Chinese
✨ Outperforms GPT 4O -2024/11/20

sergiopaniego

posted an update 8 days ago

Post

989

📣 CALL FOR CONTRIBUTORS! 📣

Following last week’s full release of Gemma 3n, we launched a dedicated recipes repo to explore and share use cases. We already added some! 🧑‍🍳

Now we’re inviting the community to contribute and showcase how these models shine! ✨

Let them cook.

Check it out: https://github.com/huggingface/huggingface-gemma-recipes/issues/4

1 reply

·

merve

posted an update 8 days ago

Post

2483

so many multimodal releases these days 🤠
> ERNIE-4.5-VL: new vision language MoE models by Baidu https://huggingface.co/models?search=ernie-4.5-vl
> new visual document retrievers by NVIDIA (sota on ViDoRe!) nvidia/llama-nemoretriever-colembed-3b-v1 nvidia/llama-nemoretriever-colembed-1b-v1
> Ovis-3b: new image-text in image-text out models by Alibaba ⤵️ https://huggingface.co/spaces/AIDC-AI/Ovis-U1-

AdinaY

posted an update 8 days ago

Post

959

Pangu Pro MoE 🔥 Huawei's first open model!

Paper:
Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity (2505.21411)
Model:
https://gitcode.com/ascend-tribe/pangu-pro-moe-model

✨ MoGE: Mixture of Grouped Experts
✨ 16B activated params - 48 layers
✨ Trained on 15T tokens
✨ Natively optimized for Ascend hardware

1 reply

·

AdinaY

posted an update 8 days ago

Post

322

Baidu kept its promise, releasing 10 open models on the very last day of June🚀 Let's meet ERNIE 4.5 🔥

baidu/ernie-45-6861cd4c9be84540645f35c9

✨ From 0.3B to 424B total params
✨ Includes 47B & 3B active param MoE models + a 0.3B dense model
✨ Apache 2.0
✨ 128K context length
✨ Text+Vision co-training with ViT & UPO

sergiopaniego

in huggingface-projects/gemma-3n-E4B-it 11 days ago

Add video example

🔥 1

#2 opened 11 days ago by

sergiopaniego

AdinaY

posted an update 11 days ago

Post

3080

Hunyuan-A13B 🔥 New MoE LLM by TencentHunyuan

tencent/Hunyuan-A13B-Instruct

✨80B total / 13B active params
✨256K context window
✨Dual-mode reasoning: fast & slow thinking
✨Efficient inference (GQA + quantization)

hysts

updated a Space 11 days ago

93

Gemma 3n E4B It

⚡

Describe images, videos, and audio

hysts

published a Space 12 days ago

93

Gemma 3n E4B It

⚡

Describe images, videos, and audio

merve

posted an update 12 days ago

Post

573

Dataset Viewer for PDFs just landed on Hugging Face 📖🤗 you can now preview all the PDFs easier than before!

on top of this, there's PdfFolder format to load the PDF datasets quicker 💨
> to use it, your dataset should follow a directory format like folder/train/doc1.pdf, folder/train/doc1.pdf
> if you want to include bounding boxes, labels etc. you can keep them in a metadata.csv file in the same folder 🤝

read document dataset docs https://huggingface.co/docs/datasets/main/en/document_dataset
check all the document datasets here https://huggingface.co/datasets?modality=modality:document&sort=trending 📖

1 reply

·

Huggingface Projects

AI & ML interests

Recent Activity

huggingface-projects/Deep-RL-Course-Certification

Can AI be Consentful?

Add video example

Gemma 3n E4B It

Gemma 3n E4B It

AI & ML interests

Recent Activity

Team members 25

huggingface-projects's activity

Add video example

Gemma 3n E4B It

Gemma 3n E4B It