Model Card for MoshiVis

Model Details

Model Description

MoshiVis (Project Page | arXiv) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency. To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model. To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).

This model page contains the Moshika (female voice) model weights for the Pytorch backend of the MoshiVis repo, in bfloat16. We provide the same model weights for other backends and quantization formats in the associated model collection.

Developed by: Kyutai
Model type: Multimodal speech+vision+text foundation model
Language(s) (NLP): English
License: CC-BY-4.0
Uses frozen components from: Moshika and PaliGemma2
Terms of use: As the released models include frozen weights of the SigLIP image encoder from PaliGemma-2, MoshiVis is subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms

Model Sources

Project Page kyutai.org/moshivis
Preprint (arXiv/abs/2503.15633)
Repository: Github kyutai-labs/moshivis
Demo: Talk to Moshi

Uses

Direct Use

Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.

Downstream Use

Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters, the model could be adapted to different downstream scenarios by further finetuning these parameters : for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.

Out-of-Scope Use

The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.

Bias, Risks, and Limitations

MoshiVis has been designed to perceptually augment the original Moshi model with vision capabilities and is expected to inherit similar biases and limitations.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

See our github repository for getting started.

Training Details

Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.

Training Data

For information on the training data used for the base models, see Pixtral and Moshi respectively. To train the cross-attention and gating mechanism that MoshiVis uses for processing images, we rely on a collection of publicly available datasets, namely:

Technical Specifications

Compute Infrastructure

MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters) and was trained on a single DGX node with 8 H100 GPUs.

Software

Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.

Citation

@article{kyutai2025moshivis,
  author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
  Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
  year = {2025},
  title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2503.15633}
}

Model Card Authors and Contact

Amelie Royer
Moritz Boehle

kyutai
/

moshika-vis-pytorch-bf16