
AV LLMs
A collection of Audio, Video and Visual LLMs.
- Text-to-Speech β’ Updated β’ 438
- 1.04k
OpenVoice
π€ dataautogpt3/ProteusV0.3
Text-to-Image β’ Updated β’ 143k β’ β’ 93ByteDance/SDXL-Lightning
Text-to-Image β’ Updated β’ 128k β’ β’ 2kopenai/whisper-large-v3
Automatic Speech Recognition β’ Updated β’ 3.98M β’ β’ 4.14kstabilityai/TripoSR
Image-to-3D β’ Updated β’ 33.9k β’ 525Efficient-Large-Model/VILA-7b
Text Generation β’ Updated β’ 312 β’ 26google/paligemma-3b-pt-896
Image-Text-to-Text β’ Updated β’ 128k β’ 117microsoft/Phi-3-vision-128k-instruct
Text Generation β’ Updated β’ 94k β’ 956stabilityai/stable-audio-open-1.0
Text-to-Audio β’ Updated β’ 22.9k β’ 1.09kOpenVLA: An Open-Source Vision-Language-Action Model
Paper β’ 2406.09246 β’ Published β’ 37aiola/whisper-medusa-v1
Updated β’ 40 β’ 178merve/idefics3llama-vqav2
Updated β’ 8black-forest-labs/FLUX.1-schnell
Text-to-Image β’ Updated β’ 1.51M β’ β’ 3.49k- 113
Llama3.1 S V0.2 Checkpoint 2024 08 20
π»Convert text to audio and vice versa
gpt-omni/mini-omni
Text-to-Speech β’ Updated β’ 423fishaudio/fish-speech-1.4
Text-to-Speech β’ Updated β’ 1.15k β’ 450- 170
Tonic's GOT OCR
π²GOT - OCR (from : UCAS, Beijing)
stepfun-ai/GOT-OCR2_0
Image-Text-to-Text β’ Updated β’ 80.6k β’ 1.41kapple/coreml-sam2-large
Mask Generation β’ Updated β’ 40 β’ 25coreml-projects/sam-2-studio
Updated β’ 23mistralai/Pixtral-12B-2409
Image-Text-to-Text β’ Updated β’ β’ 619allenai/Molmo-72B-0924
Image-Text-to-Text β’ Updated β’ 4.42k β’ 282openai/whisper-large-v3-turbo
Automatic Speech Recognition β’ Updated β’ 9.56M β’ β’ 2.09kRevai/reverb-asr
Automatic Speech Recognition β’ Updated β’ 9 β’ 82- 358
GOT Online
π¬Extract text from images using various OCR modes
facebook/vfusion3d
Image-to-3D β’ Updated β’ 168 β’ 66facebook/cotracker
Updated β’ 776 β’ 35rhymes-ai/Aria
Image-Text-to-Text β’ Updated β’ 22.4k β’ 618SWivid/F5-TTS
Text-to-Speech β’ Updated β’ 951k β’ 935- 65
Ichigo Llama3.1 S Instruct
π’Generate text from audio recordings
kyutai/moshiko-mlx-q4
Updated β’ 874 β’ 28kyutai/moshiko-mlx-q8
Updated β’ 219 β’ 5- 101
Open VLM Video Leaderboard
πVLMEvalKit Eval Results in video understanding benchmark
jimmycarter/LibreFLUX
Text-to-Image β’ Updated β’ 112 β’ 161microsoft/OmniParser
Image-Text-to-Text β’ Updated β’ 2.57k β’ 1.64k- 281
Aya Models
πInteract with the Aya family of models.
CohereForAI/aya-expanse-32b
Text Generation β’ Updated β’ 112k β’ 232stabilityai/stable-diffusion-3.5-medium
Text-to-Image β’ Updated β’ 165k β’ β’ 635OuteAI/OuteTTS-0.1-350M
Text-to-Speech β’ Updated β’ 4.32k β’ 300vidore/colpali
Visual Document Retrieval β’ Updated β’ 17.5k β’ 425vidore/colpali-v1.2
Visual Document Retrieval β’ Updated β’ 69.4k β’ 105si-pbc/hertz-dev
Audio-to-Audio β’ Updated β’ 211- 38
Talk To Ultravox
β‘Talk to Fixie.ai's Ultravox with WebRTC β‘οΈ
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper β’ 2411.10440 β’ Published β’ 114Xkev/Llama-3.2V-11B-cot
Image-Text-to-Text β’ Updated β’ 4.19k β’ 147google/paligemma-3b-pt-224
Image-Text-to-Text β’ Updated β’ 30.5k β’ 303apple/coreml-mobileclip
Updated β’ 275 β’ 41InstantX/InstantIR
Image-to-Image β’ Updated β’ 1 β’ 166- 81
InstantIR
πΌdiffusion-based Image Restoration model
- 141
Flux IP Adapter
πΌPrompt with Images in flux[dev]
- 38
Image Preferences - Argilla annotation space
πΌA community project to create an image preferences dataset.
fishaudio/fish-speech-1.5
Text-to-Speech β’ Updated β’ 14.2k β’ 494meta-llama/Llama-3.3-70B-Instruct
Text Generation β’ Updated β’ 740k β’ β’ 2.11k- 44
Paligemma2 Vqav2
π¨PaliGemma2 LoRA finetuned on VQAv2
VisionZip: Longer is Better but Not Necessary in Vision Language Models
Paper β’ 2412.04467 β’ Published β’ 107fancyfeast/llama-joycaption-alpha-two-hf-llava
Updated β’ 30.2k β’ 156taohu/mask
Updated β’ 5[MASK] is All You Need
Paper β’ 2412.06787 β’ Published β’ 3- 655
Open VLM Leaderboard
πVLMEvalKit Evaluation Results Collection
microsoft/LLM2CLIP-Llama3.2-1B-EVA02-L-14-336
Zero-Shot Image Classification β’ Updated β’ 10LLM2CLIP: Powerful Language Model Unlock Richer Visual Representation
Paper β’ 2411.04997 β’ Published β’ 37Generative Powers of Ten
Paper β’ 2312.02149 β’ Published β’ 8- 62
StoryStar
π¬Fantasy story generator
GoodiesHere/Apollo-LMMs-Apollo-7B-t32
Video-Text-to-Text β’ Updated β’ 530 β’ 53Apollo: An Exploration of Video Understanding in Large Multimodal Models
Paper β’ 2412.10360 β’ Published β’ 140Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text β’ Updated β’ 1.42M β’ β’ 1.15kXiaoduoAILab/Xmodel_VLM
Text Generation β’ Updated β’ 347 β’ 12nvidia/Cosmos-1.0-Diffusion-14B-Text2World
Updated β’ 85.8k β’ 49nvidia/Cosmos-1.0-Autoregressive-12B
Updated β’ 523 β’ 29nvidia/Cosmos-1.0-Autoregressive-13B-Video2World
Updated β’ 507 β’ 31nvidia/Cosmos-1.0-Diffusion-7B-Text2World
Updated β’ 177k β’ 208nvidia/Cosmos-1.0-Diffusion-14B-Video2World
Updated β’ 56.5k β’ 52- 387
Stable Point-Aware 3D
β‘Create 3D models from images
hexgrad/Kokoro-82M
Text-to-Speech β’ Updated β’ 1.54M β’ 3.6k- 2.21k
Kokoro TTS
β€Upgraded to v1.0!
openbmb/MiniCPM-o-2_6
Any-to-Any β’ Updated β’ 413k β’ 1.03k- 311
TTS Spaces Arena
π€Blind vote on HF TTS models!
google/paligemma2-10b-pt-896
Image-Text-to-Text β’ Updated β’ 1.35k β’ 32NovaSky-AI/Sky-T1-32B-Preview
Text Generation β’ Updated β’ 5.22k β’ 539MiniMaxAI/MiniMax-VL-01
Image-Text-to-Text β’ Updated β’ 465 β’ 245- 53
SmolVLM
πGenerate descriptions from images and text prompts
HKUSTAudio/Llasa-3B
Text-to-Speech β’ Updated β’ 3.66k β’ 469HuggingFaceTB/SmolVLM-500M-Instruct
Image-Text-to-Text β’ Updated β’ 26.8k β’ 111deepseek-ai/Janus-Pro-7B
Any-to-Any β’ Updated β’ 372k β’ 3.2k- 250
Kokoro TTS Zero
π΄β¨[With v1.0.0] Accelerated TTS on Kokoro-82M
kyutai/hibiki-2b-mlx-bf16
Translation β’ Updated β’ 260 β’ 15kyutai/hibiki-2b-pytorch-bf16
Translation β’ Updated β’ 380 β’ 48ARTPARK-IISc/Vaani
Viewer β’ Updated β’ 9.72M β’ 2.15k β’ 43Zyphra/Zonos-v0.1-hybrid
Text-to-Speech β’ Updated β’ 58.1k β’ 1.03kZyphra/Zonos-v0.1-transformer
Text-to-Speech β’ Updated β’ 201k β’ 379microsoft/OmniParser-v2.0
Image-Text-to-Text β’ Updated β’ 8.78k β’ 1.12k- 87
Paligemma2 Mix
πGenerate text or segment objects from an image
google/paligemma2-3b-mix-448
Image-Text-to-Text β’ Updated β’ 10.4k β’ 39google/paligemma2-3b-mix-224
Image-Text-to-Text β’ Updated β’ 5.91k β’ 22google/paligemma2-28b-mix-224
Image-Text-to-Text β’ Updated β’ 89 β’ 4google/paligemma2-28b-mix-448
Image-Text-to-Text β’ Updated β’ 725 β’ 25google/paligemma2-10b-mix-224
Image-Text-to-Text β’ Updated β’ 1.08k β’ 7google/paligemma2-10b-mix-448
Image-Text-to-Text β’ Updated β’ 9.65k β’ 22stepfun-ai/stepvideo-t2v
Text-to-Video β’ Updated β’ 1.82k β’ 416stepfun-ai/stepvideo-t2v-turbo
Updated β’ 83Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
Paper β’ 2502.10248 β’ Published β’ 51HuggingFaceTB/SmolVLM2-2.2B-Instruct
Image-Text-to-Text β’ Updated β’ 429k β’ 106nvidia/canary-1b
Automatic Speech Recognition β’ Updated β’ 21.5k β’ 375Wan-AI/Wan2.1-I2V-14B-720P
Image-to-Video β’ Updated β’ 59.1k β’ 341fastrtc/kokoro-onnx
Updated β’ 8- 2
Fastphone
πDownload and run an app from a Hugging Face repository
microsoft/Phi-4-multimodal-instruct
Automatic Speech Recognition β’ Updated β’ 231k β’ 1.04kmicrosoft/Magma-8B
Image-Text-to-Text β’ Updated β’ 10.5k β’ 320- 19
Magma UI
πMagma-8B model for UI Agents
- 308
DiβͺβͺRhythm
πΆBlazingly Fast and Embarrassingly Simple Song Generation
DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Paper β’ 2503.01183 β’ Published β’ 26ASLP-lab/DiffRhythm-vae
Updated β’ 30ASLP-lab/DiffRhythm-base
Updated β’ 108Large Language Diffusion Models
Paper β’ 2502.09992 β’ Published β’ 99GSAI-ML/LLaDA-8B-Instruct
Text Generation β’ Updated β’ 18.6k β’ 192