MoshiVis v0.1 Collection MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs • 8 items • Updated 9 days ago • 16
Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese Paper • 2408.12480 • Published Aug 22, 2024 • 23
🍓 Ichigo v0.5 Collection The experimental family designed to train LLMs to understand sound natively. • 1 item • Updated Dec 29, 2024 • 3
view article Article Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints May 1, 2024 • 74
Molmo Collection Artifacts for open multimodal language models. • 5 items • Updated 17 days ago • 299