Model Card for MoshiVis
Model Details
Model Description
MoshiVis (Project Page | arXiv) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency. To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model. To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).
This model page contains the Moshika
(female voice) model weights for the Pytorch
backend of the MoshiVis repo, in bfloat16
.
We provide the same model weights for other backends and quantization formats in the associated model collection.
- Developed by: Kyutai
- Model type: Multimodal speech+vision+text foundation model
- Language(s) (NLP): English
- License: CC-BY-4.0
- Finetuned from model: Moshika and PaliGemma2
Model Sources
- Project Page kyutai.org/moshivis
- Preprint (arXiv/abs/2503.15633)
- Repository: Github kyutai-labs/moshivis
- Demo: Talk to Moshi
Uses
Direct Use
Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.
Downstream Use
Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters, the model could be adapted to different downstream scenarios by further finetuning these parameters : for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.
Out-of-Scope Use
The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
Bias, Risks, and Limitations
MoshiVis has been designed to perceptually augment the original Moshi model with vision capabilities and is expected to inherit similar biases and limitations.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
See our github repository for getting started.
Training Details
Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.
Training Data
For information on the training data used for the base models, see Pixtral and Moshi respectively. To train the cross-attention and gating mechanism that MoshiVis uses for processing images, we rely on a collection of publicly available datasets, namely:
Technical Specifications
Compute Infrastructure
MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters) and was trained on a single DGX node with 8 H100 GPUs.
Software
Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.
Citation
@article{kyutai2025moshivis,
author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
year = {2025},
title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
journal = {ArXiv},
url = {https://arxiv.org/abs/2503.15633}
}
Model Card Authors and Contact
- Amelie Royer
- Moritz Boehle
Model tree for kyutai/moshika-vis-pytorch-bf16
Base model
google/paligemma2-3b-pt-448