AI & ML interests

computer-vision, image-processing, machine-learning, deep-learning

Recent Activity

kornia's activity

merve 
posted an update about 14 hours ago
merve 
posted an update 1 day ago
view post
Post
1050
Past week was insanely packed for open AI! 😱
Luckily we picked some highlights for you ❤️ lfg!

💬 LLMs/VLMs
> Deepseek 🐳 released deepseek-ai/DeepSeek-R1-0528, 38B model, only 0.2 and 1.4 points behind o3 in AIME 24/25 🤯 they also released an 8B distilled version based on Qwen3 (OS) deepseek-ai/deepseek-r1-678e1e131c0169c0bc89728d
> Xiaomi released MiMo-7B-RL (LLM for code and math) and MiMo-VL-7B-RL (VLM for visual reasoning, GUI agentic task and general use) (OS) 😍 XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212
> NVIDIA released , new reasoning model nvidia/Nemotron-Research-Reasoning-Qwen-1.5B
> DS: MiniMax released https://huggingface.co/MiniMaxAI/SynLogic, new 49k logical reasoning examples across 35 tasks including solving cipher, sudoku and more!

🖼️ Image/Video Generation
> tencent released tencent/HunyuanPortrait, a new model for consistent portrait generation with SVD Research license. They also released tencent/HunyuanVideo-Avatar, audio driven avatar generation (OS)
> showlab released showlab/OmniConsistency, consistent stylization model (OS)
> Rapidata/text-2-video-human-preferences-veo3 is a new T2V preference dataset based on videos from Veo3 with 46k examples (OS)

Audio🗣️
> https://huggingface.co/ResembleAI/Chatterbox is a new 500M text-to-speech model preferred more than ElevenLabs (OS) 😍
> PlayHT/PlayDiffusion is a new speech editing model (OS)

Other
> https://huggingface.co/NX-AI/TiReX is a new time series foundation model
> Yandex released a huge (4.79B examples!) video recommendation dataset https://huggingface.co/yandex/yambda

OS ones have Apache2.0 or MIT licenses, find more models and datasets here merve/releases-30-may-6840097345e0b1e915bff843
merve 
posted an update 2 days ago
view post
Post
994
Yesterday was the day of vision language action models (VLAs)!

> SmolVLA: open-source small VLA for robotics by Hugging Face LeRobot team 🤖
Blog: https://huggingface.co/blog/smolvla
Model: lerobot/smolvla_base

> Holo-1: 3B & 7B web/computer use agentic VLAs by H Company 💻
Model family: Hcompany/holo1-683dd1eece7eb077b96d0cbd
Demo: https://huggingface.co/spaces/multimodalart/Holo1
Blog: https://huggingface.co/blog/Hcompany/holo1
super exciting times!!
merve 
posted an update 3 days ago
merve 
posted an update 4 days ago
merve 
posted an update 5 days ago
prithivMLmods 
posted an update 6 days ago
view post
Post
4714
OpenAI, Google, Hugging Face, and Anthropic have released guides and courses on building agents, prompting techniques, scaling AI use cases, and more. Below are 10+ minimalistic guides and courses that may help you in your progress. 📖

⤷ Agents Companion : https://www.kaggle.com/whitepaper-agent-companion
⤷ Building Effective Agents : https://www.anthropic.com/engineering/building-effective-agents
⤷ Guide to building agents by OpenAI : https://cdn.openai.com/business-guides-and-resources/a-practical-guide-to-building-agents.pdf
⤷ Prompt engineering by Google : https://www.kaggle.com/whitepaper-prompt-engineering
⤷ Google: 601 real-world gen AI use cases : https://cloud.google.com/transform/101-real-world-generative-ai-use-cases-from-industry-leaders
⤷ Prompt engineering by IBM : https://www.ibm.com/think/topics/prompt-engineering-guide
⤷ Prompt Engineering by Anthropic : https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/overview
⤷ Scaling AI use cases : https://cdn.openai.com/business-guides-and-resources/identifying-and-scaling-ai-use-cases.pdf
⤷ Prompting Guide 101 : https://services.google.com/fh/files/misc/gemini-for-google-workspace-prompting-guide-101.pdf
⤷ AI in the Enterprise by OpenAI : https://cdn.openai.com/business-guides-and-resources/ai-in-the-enterprise.pdf

by HF🤗 :
⤷ AI Agents Course by Huggingface : https://huggingface.co/learn/agents-course/unit0/introduction
⤷ Smol-agents Docs : https://huggingface.co/docs/smolagents/en/tutorials/building_good_agents
⤷ MCP Course by Huggingface : https://huggingface.co/learn/mcp-course/unit0/introduction
⤷ Other Course (LLM, Computer Vision, Deep RL, Audio, Diffusion, Cookbooks, etc..) : https://huggingface.co/learn
  • 2 replies
·
merve 
posted an update 7 days ago
view post
Post
1939
HOT: MiMo-VL new 7B vision LMs by Xiaomi surpassing gpt-4o (Mar), competitive in GUI agentic + reasoning tasks ❤️‍🔥 XiaomiMiMo/mimo-vl-68382ccacc7c2875500cd212

not only that, but also MIT license & usable with transformers 🔥
prithivMLmods 
posted an update 7 days ago
view post
Post
2137
Just made a demo for Cosmos-Reason1, a physical AI model that understands physical common sense and generates appropriate embodied decisions in natural language through long chain-of-thought reasoning. Also added video understanding support to it. 🤗🚀

✦ Try the demo here : prithivMLmods/DocScope-R1

⤷ Cosmos-Reason1-7B : nvidia/Cosmos-Reason1-7B
⤷ docscopeOCR-7B-050425-exp : prithivMLmods/docscopeOCR-7B-050425-exp
⤷ Captioner-Relaxed : Ertugrul/Qwen2.5-VL-7B-Captioner-Relaxed

⤷ Multimodal Implementations : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

⤷ GitHub :
https://github.com/PRITHIVSAKTHIUR/Cosmos-x-DocScope
https://github.com/PRITHIVSAKTHIUR/Nvidia-Cosmos-Reason1-Demo.

To know more about it, visit the model card of the respective model. !!
merve 
posted an update 8 days ago
view post
Post
2686
introducing: VLM vibe eval 🪭 visionLMsftw/VLMVibeEval

vision LMs are saturated over benchmarks, so we built vibe eval 💬

> compare different models with refreshed in-the-wild examples in different categories 🤠
> submit your favorite model for eval
no numbers -- just vibes!
AtAndDev 
posted an update 8 days ago
view post
Post
2663
deepseek-ai/DeepSeek-R1-0528

This is the end
  • 1 reply
·
merve 
posted an update 10 days ago
view post
Post
2532
emerging trend: models that can understand image + text and generate image + text

don't miss out ⤵️
> MMaDA: single 8B diffusion model aligned with CoT (reasoning!) + UniGRPO Gen-Verse/MMaDA
> BAGEL: 7B MoT model based on Qwen2.5, SigLIP-so-400M, Flux VAE ByteDance-Seed/BAGEL
both by ByteDance! 😱

I keep track of all any input → any output models here https://huggingface.co/collections/merve/any-to-any-models-6822042ee8eb7fb5e38f9b62
  • 1 reply
·
merve 
posted an update 11 days ago
view post
Post
3108
what happened in open AI past week? so many vision LM & omni releases 🔥 merve/releases-23-may-68343cb970bbc359f9b5fb05

multimodal 💬🖼️
> new moondream (VLM) is out: it's 4-bit quantized (with QAT) version of moondream-2b, runs on 2.5GB VRAM at 184 tps with only 0.6% drop in accuracy (OS) 🌚
> ByteDance released BAGEL-7B, an omni model that understands and generates both image + text. they also released Dolphin, a document parsing VLM 🐬 (OS)
> Google DeepMind dropped MedGemma in I/O, VLM that can interpret medical scans, and Gemma 3n, an omni model with competitive LLM performance

> MMaDa is a new 8B diffusion language model that can generate image and text



LLMs
> Mistral released Devstral, a 24B coding assistant (OS) 👩🏻‍💻
> Fairy R1-32B is a new reasoning model -- distilled version of DeepSeek-R1-Distill-Qwen-32B (OS)
> NVIDIA released ACEReason-Nemotron-14B, new 14B math and code reasoning model
> sarvam-m is a new Indic LM with hybrid thinking mode, based on Mistral Small (OS)
> samhitika-0.0.1 is a new Sanskrit corpus (BookCorpus translated with Gemma3-27B)

image generation 🎨
> MTVCrafter is a new human motion animation generator
  • 1 reply
·
merve 
posted an update 15 days ago
view post
Post
2582
Google released MedGemma on I/O'25 👏 google/medgemma-release-680aade845f90bec6a3f60c4

> 4B and 27B instruction fine-tuned vision LMs and a 4B pre-trained vision LM for medicine
> available with transformers from the get-go 🤗

they also released a cool demo for scan reading ➡️ google/rad_explain

use with transformers ⤵️
  • 1 reply
·
merve 
posted an update 15 days ago
view post
Post
3096
Bu post'u çevirebilirsiniz 🤗💗
·
merve 
posted an update 15 days ago
view post
Post
2373
tis the year of any-to-any/omni models 🤠
ByteDance-Seed/BAGEL-7B-MoT 7B native multimodal model that understands and generates both image + text

it outperforms leading VLMs like Qwen 2.5-VL 👏 and has Apache 2.0 license 😱
prithivMLmods 
posted an update 16 days ago
view post
Post
2281
Got access to Google's all-new Gemini Diffusion a state-of-the-art text diffusion model. It delivers the performance of Gemini 2.0 Flash-Lite at 5x the speed, generating over 1000 tokens in a fraction of a second and producing impressive results. Below are some initial outputs generated using the model. ♊🔥

Gemini Diffusion Playground ✦ : https://deepmind.google.com/frontiers/gemini-diffusion

Get Access Here : https://docs.google.com/forms/d/1aLm6J13tAkq4v4qwGR3z35W2qWy7mHiiA0wGEpecooo/viewform?edit_requested=true

🔗 To know more, visit: https://deepmind.google/models/gemini-diffusion/
  • 1 reply
·
merve 
posted an update 17 days ago
view post
Post
1713
NVIDIA released new vision reasoning model for robotics: Cosmos-Reason1-7B 🤖 nvidia/cosmos-reason1-67c9e926206426008f1da1b7

> first reasoning model for robotics
> based on Qwen 2.5-VL-7B, use with Hugging Face transformers or vLLM 🤗
> comes with SFT & alignment datasets and a new benchmark 👏