File size: 4,967 Bytes
3bfce6c bc2cd8c 3bfce6c 8bd80dd 3bfce6c 8cceb02 bc2cd8c 3bfce6c 5903417 3bfce6c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
license: cc-by-4.0
language:
- en
base_model:
- google/paligemma2-3b-pt-448
- kyutai/moshika-pytorch-bf16
---
# Model Card for MoshiVis
## Model Details
### Model Description
**MoshiVis** ([Project Page](https://kyutai.org/moshivis) | [arXiv](https://arxiv.org/abs/2503.15633)) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency.
To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model.
To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params)
and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).
This model page contains the `Moshika` (female voice) model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`.
We provide the same model weights for other backends and quantization formats in the associated model collection.
- **Developed by:** Kyutai
- **Model type:** Multimodal speech+vision+text foundation model
- **Language(s) (NLP):** English
- **License:** CC-BY-4.0
- **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-pytorch-bf16) and [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448)
### Model Sources
- **Project Page** [kyutai.org/moshivis](https://kyutai.org/moshivis)
- **Preprint** ([arXiv/abs/2503.15633](https://arxiv.org/abs/2503.15633))
- **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis)
- **Demo:** [Talk to Moshi](http://vis.moshi.chat)
## Uses
### Direct Use
Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc.
In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.
### Downstream Use
Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters,
the model could be adapted to different downstream scenarios by further finetuning these parameters :
for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.
### Out-of-Scope Use
The model is not intended to be used to impersonate other people or any malicious use of any kind.
This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
## Bias, Risks, and Limitations
MoshiVis has been designed to perceptually augment the original [Moshi]((https://huggingface.co/kyutai/moshika-pytorch-bf16))
model with vision capabilities and is expected to inherit similar biases and limitations.
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
See our [github repository](https://github.com/kyutai-labs/moshivis) for getting started.
## Training Details
Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.
### Training Data
For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and
[Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively.
To train the cross-attention and gating mechanism that MoshiVis uses for processing images,
we rely on a collection of publicly available datasets, namely:
- [DOCCI](https://google.github.io/docci/)
- [PixMo](https://huggingface.co/datasets/allenai/pixmo-cap)
- [Pixelprose](https://arxiv.org/abs/2406.10328)
- [TallyQA](https://arxiv.org/abs/1810.12440)
- [OCR-VQA](https://ocr-vqa.github.io/)
- [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText)
- [DocVQA](https://arxiv.org/abs/2007.00398)
## Technical Specifications
### Compute Infrastructure
MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters)
and was trained on a single DGX node with 8 H100 GPUs.
#### Software
Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.
## Citation
```
@article{kyutai2025moshivis,
author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
year = {2025},
title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
journal = {ArXiv},
url = {https://arxiv.org/abs/2503.15633}
}
```
## Model Card Authors and Contact
* Amelie Royer
* Moritz Boehle |