Qwen2.5-Omni is soooo good that people build multimodal reasoning models off of it 🥹 > KE-Team/Ke-Omni-R-3B is open-source audio reasoning model sota on average of benchmarks, based on Qwen/Qwen2.5-Omni-3B 🗣️ > Haoz0206/Omni-R1 is a video reasoning model with pixel level grounding (see below) and it's super competitive ⏯️ based on Qwen/Qwen2.5-Omni-7B
NEW: Real-time conversational AI models can now run 100% locally in your browser! 🤯
🔐 Privacy by design (no data leaves your device) 💰 Completely free... forever 📦 Zero installation required, just visit a website ⚡️ Blazingly-fast WebGPU-accelerated inference
For those interested, here's how it works: - Silero VAD for voice activity detection - Whisper for speech recognition - SmolLM2-1.7B for text generation - Kokoro for text to speech
Powered by Transformers.js and ONNX Runtime Web! 🤗 I hope you like it!
✨ 3 models: 7B/32B/ Mix-3-32B (MIT license) ✨ Dataset: 35 verifiable logic tasks (Sudoku, Cipher, Arrow Maze etc.) ✨ RL training with auto-verifiable rewards ✨ Generalizes to math without explicit math training ✨ +6 pts on BBEH, +9.5 on KOR-Bench vs baselines
✨ Apache 2.0 ✨ Handles up to 10,000+ frames on a single GPU ✨ 2048-frame encoding in just 12s ✨ Efficient Chunk-based Prefilling & Bi-granularity KV decoding
vision LMs are saturated over benchmarks, so we built vibe eval 💬
> compare different models with refreshed in-the-wild examples in different categories 🤠 > submit your favorite model for eval no numbers -- just vibes!
🔥 New benchmark & dataset for Subject-to-Video generation
OPENS2V-NEXUS by Pekin University ✨ Fine-grained evaluation for subject consistency BestWishYsh/OpenS2V-Eval ✨ 5M-scale dataset: BestWishYsh/OpenS2V-5M ✨ New metrics – automatic scores for identity, realism, and text match
✨Emotion-controlled, high-dynamic avatar videos ✨Multi-character support with separate audio control ✨Works with any style: cartoon, 3D, real face, while keeping identity consistent