File size: 4,967 Bytes
3bfce6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bc2cd8c
3bfce6c
 
 
 
 
 
 
 
 
 
8bd80dd
3bfce6c
 
 
 
 
8cceb02
bc2cd8c
3bfce6c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5903417
 
 
 
 
 
 
 
 
 
3bfce6c
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: cc-by-4.0
language:
- en
base_model:
- google/paligemma2-3b-pt-448
- kyutai/moshika-pytorch-bf16
---

# Model Card for MoshiVis


## Model Details

### Model Description

**MoshiVis** ([Project Page](https://kyutai.org/moshivis) | [arXiv](https://arxiv.org/abs/2503.15633)) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency.
To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model.
To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) 
and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).

This model page contains the `Moshika` (female voice)  model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`.
We provide the same model weights for other backends and quantization formats in the associated model collection. 

- **Developed by:** Kyutai
- **Model type:** Multimodal speech+vision+text foundation model
- **Language(s) (NLP):** English
- **License:** CC-BY-4.0
- **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-pytorch-bf16) and [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448)


### Model Sources 

- **Project Page** [kyutai.org/moshivis](https://kyutai.org/moshivis)
- **Preprint** ([arXiv/abs/2503.15633](https://arxiv.org/abs/2503.15633))
- **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis)
- **Demo:** [Talk to Moshi](http://vis.moshi.chat)

## Uses

### Direct Use

Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. 
In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.


### Downstream Use 

Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with  only a few trainable parameters, 
 the model could be adapted to different downstream scenarios by further finetuning these parameters : 
for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.

### Out-of-Scope Use

The model is not intended to be used to impersonate other people or any malicious use of any kind. 
This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.


## Bias, Risks, and Limitations

MoshiVis has been designed to perceptually augment the original [Moshi]((https://huggingface.co/kyutai/moshika-pytorch-bf16)) 
model with vision capabilities and is expected to inherit similar biases and limitations. 


### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

See our [github repository](https://github.com/kyutai-labs/moshivis) for getting started.


## Training Details

Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.

### Training Data

For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and 
[Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively.
To train the cross-attention and gating mechanism that MoshiVis uses for processing images, 
we rely on a collection of publicly available datasets, namely:
- [DOCCI](https://google.github.io/docci/)   
- [PixMo](https://huggingface.co/datasets/allenai/pixmo-cap)   
- [Pixelprose](https://arxiv.org/abs/2406.10328)   
- [TallyQA](https://arxiv.org/abs/1810.12440)   
- [OCR-VQA](https://ocr-vqa.github.io/)   
- [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText)   
- [DocVQA](https://arxiv.org/abs/2007.00398)    



## Technical Specifications 


### Compute Infrastructure

MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters)
and was trained on a single DGX node with 8 H100 GPUs.


#### Software

Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.

## Citation

```
@article{kyutai2025moshivis,
  author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
  Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
  year = {2025},
  title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2503.15633}
}
```




## Model Card Authors and Contact

  * Amelie Royer
  * Moritz Boehle