|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- en |
|
base_model: |
|
- google/paligemma2-3b-pt-448 |
|
- kyutai/moshika-pytorch-bf16 |
|
--- |
|
|
|
# Model Card for MoshiVis |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
**MoshiVis** ([Project Page](https://kyutai.org/moshivis) | [arXiv](https://arxiv.org/abs/2503.15633)) is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency. |
|
To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model. |
|
To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params) |
|
and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters). |
|
|
|
This model page contains the `Moshika` (female voice) model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`. |
|
We provide the same model weights for other backends and quantization formats in the associated model collection. |
|
|
|
- **Developed by:** Kyutai |
|
- **Model type:** Multimodal speech+vision+text foundation model |
|
- **Language(s) (NLP):** English |
|
- **License:** CC-BY-4.0 |
|
- **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-pytorch-bf16) and [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448) |
|
|
|
|
|
### Model Sources |
|
|
|
- **Project Page** [kyutai.org/moshivis](https://kyutai.org/moshivis) |
|
- **Preprint** ([arXiv/abs/2503.15633](https://arxiv.org/abs/2503.15633)) |
|
- **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis) |
|
- **Demo:** [Talk to Moshi](http://vis.moshi.chat) |
|
|
|
## Uses |
|
|
|
### Direct Use |
|
|
|
Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc. |
|
In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions. |
|
|
|
|
|
### Downstream Use |
|
|
|
Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters, |
|
the model could be adapted to different downstream scenarios by further finetuning these parameters : |
|
for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains. |
|
|
|
### Out-of-Scope Use |
|
|
|
The model is not intended to be used to impersonate other people or any malicious use of any kind. |
|
This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
MoshiVis has been designed to perceptually augment the original [Moshi]((https://huggingface.co/kyutai/moshika-pytorch-bf16)) |
|
model with vision capabilities and is expected to inherit similar biases and limitations. |
|
|
|
|
|
### Recommendations |
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
## How to Get Started with the Model |
|
|
|
See our [github repository](https://github.com/kyutai-labs/moshivis) for getting started. |
|
|
|
|
|
## Training Details |
|
|
|
Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results. |
|
|
|
### Training Data |
|
|
|
For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and |
|
[Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively. |
|
To train the cross-attention and gating mechanism that MoshiVis uses for processing images, |
|
we rely on a collection of publicly available datasets, namely: |
|
- [DOCCI](https://google.github.io/docci/) |
|
- [PixMo](https://huggingface.co/datasets/allenai/pixmo-cap) |
|
- [Pixelprose](https://arxiv.org/abs/2406.10328) |
|
- [TallyQA](https://arxiv.org/abs/1810.12440) |
|
- [OCR-VQA](https://ocr-vqa.github.io/) |
|
- [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText) |
|
- [DocVQA](https://arxiv.org/abs/2007.00398) |
|
|
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters) |
|
and was trained on a single DGX node with 8 H100 GPUs. |
|
|
|
|
|
#### Software |
|
|
|
Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX. |
|
|
|
## Citation |
|
|
|
``` |
|
@article{kyutai2025moshivis, |
|
author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and |
|
Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez}, |
|
year = {2025}, |
|
title = {Vision-Speech Models: Teaching Speech Models to Converse about Images}, |
|
journal = {ArXiv}, |
|
url = {https://arxiv.org/abs/2503.15633} |
|
} |
|
``` |
|
|
|
|
|
|
|
|
|
## Model Card Authors and Contact |
|
|
|
* Amelie Royer |
|
* Moritz Boehle |