Model Card for MoshiVis

Model Details

Model Description

MoshiVis (Project Page | arXiv) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency. To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model. To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).

This model page contains the Moshika (female voice) model weights for the Pytorch backend of the MoshiVis repo, in bfloat16. We provide the same model weights for other backends and quantization formats in the associated model collection.

  • Developed by: Kyutai
  • Model type: Multimodal speech+vision+text foundation model
  • Language(s) (NLP): English
  • License: CC-BY-4.0
  • Finetuned from model: Moshika and PaliGemma2

Model Sources

Uses

Direct Use

Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.

Downstream Use

Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters, the model could be adapted to different downstream scenarios by further finetuning these parameters : for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.

Out-of-Scope Use

The model is not intended to be used to impersonate other people or any malicious use of any kind. This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.

Bias, Risks, and Limitations

MoshiVis has been designed to perceptually augment the original Moshi model with vision capabilities and is expected to inherit similar biases and limitations.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

See our github repository for getting started.

Training Details

Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.

Training Data

For information on the training data used for the base models, see Pixtral and Moshi respectively. To train the cross-attention and gating mechanism that MoshiVis uses for processing images, we rely on a collection of publicly available datasets, namely:

Technical Specifications

Compute Infrastructure

MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters) and was trained on a single DGX node with 8 H100 GPUs.

Software

Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.

Citation

@article{kyutai2025moshivis,
  author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
  Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
  year = {2025},
  title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2503.15633}
}

Model Card Authors and Contact

  • Amelie Royer
  • Moritz Boehle
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
8.72B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for kyutai/moshika-vis-pytorch-bf16

Finetuned
(15)
this model

Collection including kyutai/moshika-vis-pytorch-bf16