MoshiVis v0.1

kyutai 's Collections

updated 25 days ago

MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs

kyutai/Babillage

Viewer • Updated 25 days ago • 465k • 663 • 9

Note Evaluation Benchmark for Vision Speech Models, based on COCO-Captions, OCR-VQA and VQAv2
kyutai/moshika-vis-pytorch-bf16

Updated 24 days ago • 56

Note `bfloat16` weights for the Pytorch backend
kyutai/moshika-vis-candle-bf16

Updated 28 days ago • 1

Note `bfloat16` weights for the rust backend
kyutai/moshika-vis-candle-q8

Updated 28 days ago • 8.31k

Note 8-bits quantised weights for the rust backend
kyutai/moshika-vis-mlx

Updated 28 days ago • 2

Note weights for the MLX backend (`bfloat16`, 8-bits and 4-bits quantised)