kyutai/Babillage
Viewer
•
Updated
•
465k
•
63
•
3
MoshiVis is a Vision Speech Model built as a perceptually-augmented version of Moshi v0.1 for conversing about image inputs
Note Evaluation Benchmark for Vision Speech Models, based on COCO-Captions, OCR-VQA and VQAv2
Note `bfloat16` weights for the Pytorch backend
Note `bfloat16` weights for the rust backend
Note 8-bits quantised weights for the rust backend
Note weights for the MLX backend (`bfloat16`, 8-bits and 4-bits quantised)