Update README.md
#1
by
mboehle
- opened
- README.md +44 -73
- model.safetensors +0 -3
- tokenizer-e351c8d8-checkpoint125.safetensors +0 -3
- tokenizer_spm_32k_3.model +0 -3
README.md
CHANGED
@@ -1,128 +1,99 @@
|
|
1 |
---
|
2 |
-
license:
|
3 |
language:
|
4 |
- en
|
5 |
base_model:
|
6 |
-
- google/paligemma2-3b-pt-448
|
7 |
- kyutai/moshika-pytorch-bf16
|
|
|
|
|
8 |
---
|
|
|
9 |
|
10 |
-
|
11 |
|
12 |
|
13 |
## Model Details
|
14 |
|
15 |
### Model Description
|
16 |
|
17 |
-
|
|
|
18 |
To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model.
|
19 |
-
To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params)
|
20 |
-
and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).
|
21 |
-
|
22 |
-
This model page contains the `Moshika` (female voice) model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`.
|
23 |
-
We provide the same model weights for other backends and quantization formats in the associated model collection.
|
24 |
|
25 |
- **Developed by:** Kyutai
|
26 |
- **Model type:** Multimodal speech+vision+text foundation model
|
27 |
- **Language(s) (NLP):** English
|
28 |
-
- **License:**
|
29 |
-
- **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-pytorch-bf16) and [
|
30 |
|
|
|
31 |
|
32 |
-
|
33 |
|
34 |
-
- **
|
35 |
-
- **
|
36 |
-
- **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis)
|
37 |
-
- **Demo:** [Talk to Moshi](http://vis.moshi.chat)
|
38 |
|
39 |
## Uses
|
40 |
|
|
|
|
|
41 |
### Direct Use
|
42 |
|
|
|
43 |
Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc.
|
44 |
In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.
|
45 |
|
46 |
|
47 |
-
### Downstream Use
|
48 |
-
|
49 |
-
Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters,
|
50 |
-
the model could be adapted to different downstream scenarios by further finetuning these parameters :
|
51 |
-
for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.
|
52 |
-
|
53 |
### Out-of-Scope Use
|
54 |
|
|
|
55 |
The model is not intended to be used to impersonate other people or any malicious use of any kind.
|
56 |
This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
|
57 |
|
58 |
-
|
59 |
## Bias, Risks, and Limitations
|
60 |
|
61 |
-
|
62 |
-
model with vision capabilities and is expected to inherit similar biases and limitations.
|
|
|
63 |
|
64 |
|
65 |
-
### Recommendations
|
66 |
-
|
67 |
-
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
68 |
-
|
69 |
-
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
|
70 |
-
|
71 |
## How to Get Started with the Model
|
72 |
|
73 |
-
See
|
74 |
-
|
75 |
|
76 |
## Training Details
|
77 |
|
78 |
-
|
79 |
-
|
80 |
-
|
81 |
-
|
82 |
-
For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and
|
83 |
-
[Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively.
|
84 |
-
To train the cross-attention and gating mechanism that MoshiVis uses for processing images,
|
85 |
-
we rely on a collection of publicly available datasets, namely:
|
86 |
-
- [DOCCI](https://google.github.io/docci/)
|
87 |
-
- [PixMo](https://huggingface.co/datasets/allenai/pixmo-cap)
|
88 |
-
- [Pixelprose](https://arxiv.org/abs/2406.10328)
|
89 |
-
- [TallyQA](https://arxiv.org/abs/1810.12440)
|
90 |
-
- [OCR-VQA](https://ocr-vqa.github.io/)
|
91 |
-
- [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText)
|
92 |
-
- [DocVQA](https://arxiv.org/abs/2007.00398)
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
-
## Technical Specifications
|
97 |
|
98 |
|
99 |
-
###
|
100 |
-
|
101 |
-
MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters)
|
102 |
-
and was trained on a single DGX node with 8 H100 GPUs.
|
103 |
-
|
104 |
|
105 |
-
|
|
|
106 |
|
107 |
-
|
108 |
|
109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
110 |
|
111 |
-
|
112 |
-
@article{kyutai2025moshivis,
|
113 |
-
author = {Amélie Royer and Moritz Böhle and Gabriel de Marmiesse and
|
114 |
-
Laurent Mazaré and Alexandre Défossez and Neil Zeghidour and Patrick Pérez},
|
115 |
-
year = {2025},
|
116 |
-
title = {Vision-Speech Models: Teaching Speech Models to Converse about Images},
|
117 |
-
journal = {ArXiv},
|
118 |
-
url = {https://arxiv.org/abs/2503.15633}
|
119 |
-
}
|
120 |
-
```
|
121 |
|
122 |
|
|
|
123 |
|
|
|
124 |
|
125 |
-
## Model Card Authors
|
126 |
|
127 |
-
|
128 |
-
* Moritz Boehle
|
|
|
1 |
---
|
2 |
+
license: apache-2.0
|
3 |
language:
|
4 |
- en
|
5 |
base_model:
|
|
|
6 |
- kyutai/moshika-pytorch-bf16
|
7 |
+
- mistralai/Pixtral-12B-2409
|
8 |
+
- mistral-community/pixtral-12b
|
9 |
---
|
10 |
+
# Model Card for Moshika Vision
|
11 |
|
12 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
13 |
|
14 |
|
15 |
## Model Details
|
16 |
|
17 |
### Model Description
|
18 |
|
19 |
+
<!-- Provide a longer summary of what this model is. -->
|
20 |
+
MoshiVis is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency.
|
21 |
To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model.
|
|
|
|
|
|
|
|
|
|
|
22 |
|
23 |
- **Developed by:** Kyutai
|
24 |
- **Model type:** Multimodal speech+vision+text foundation model
|
25 |
- **Language(s) (NLP):** English
|
26 |
+
- **License:** Apache License 2.0
|
27 |
+
- **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-vis-pytorch-bf16) and [Pixtral](https://huggingface.co/mistral-community/pixtral-12b)
|
28 |
|
29 |
+
### Model Sources [optional]
|
30 |
|
31 |
+
<!-- Provide the basic links for the model. -->
|
32 |
|
33 |
+
- **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis) <-- TODO: Update / check link
|
34 |
+
- **Demo [optional]:** [moshi.chat](https://moshi.chat/)
|
|
|
|
|
35 |
|
36 |
## Uses
|
37 |
|
38 |
+
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
+
|
40 |
### Direct Use
|
41 |
|
42 |
+
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
43 |
Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc.
|
44 |
In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.
|
45 |
|
46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
47 |
### Out-of-Scope Use
|
48 |
|
49 |
+
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
50 |
The model is not intended to be used to impersonate other people or any malicious use of any kind.
|
51 |
This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
|
52 |
|
|
|
53 |
## Bias, Risks, and Limitations
|
54 |
|
55 |
+
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
56 |
+
MoshiVis has been designed to perceptually augment the original Moshi model with vision capabilities and is expected to inherit similar biases and limitations, see also [Moshika](https://huggingface.co/kyutai/moshika-vis-pytorch-bf16).
|
57 |
+
Our analysis with respect to how much MoshiVis diverges from the original model is still ongoing.
|
58 |
|
59 |
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
## How to Get Started with the Model
|
61 |
|
62 |
+
See the [README file](https://github.com/kyutai-labs/moshivis) for getting started. <-- TODO: Update / check link
|
|
|
63 |
|
64 |
## Training Details
|
65 |
|
66 |
+
### Model Architecture and Objective
|
67 |
+
Our goal was to design an efficient and effective adaptation mechanism that allows Moshi to discuss images whilst maintaining its previous conversational capabilities.
|
68 |
+
To achieve this, we train a cross-attention mechanism to insert image information from a pretrained and frozen vision backbone into the underlying language model, which is also kept frozen.
|
69 |
+
An additional gating mechanism ensures that the insertion of visual information does not impact the interaction with Moshi outside of discussions of images, allowing for a seamless back and forth between general and image-specific conversations.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
|
71 |
|
72 |
+
### Training Procedure
|
|
|
|
|
|
|
|
|
73 |
|
74 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
75 |
+
Stay tuned for our technical report, in which we will describe the training procedure in detail!
|
76 |
|
77 |
+
### Training Data
|
78 |
|
79 |
+
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
80 |
+
For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and [Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively.
|
81 |
+
To train the cross-attention and gating mechanism that MoshiVis uses for processing images, we rely on a collection of publicly available datasets:
|
82 |
+
- [Pixelprose](https://arxiv.org/abs/2406.10328)
|
83 |
+
- [DOCCI](https://arxiv.org/abs/2404.19753)
|
84 |
+
- [TallyQA](https://arxiv.org/abs/1810.12440)
|
85 |
+
- [OCRVQA](https://ocr-vqa.github.io/)
|
86 |
+
- [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText)
|
87 |
+
- [DocVQA](https://arxiv.org/abs/2007.00398)
|
88 |
+
- [ChartQA](https://aclanthology.org/2022.findings-acl.177/)
|
89 |
|
90 |
+
We will share additional details soon, stay tuned!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
92 |
|
93 |
+
### Compute Infrastructure
|
94 |
|
95 |
+
MoshiVis was designed as a relatively low-cost adaptation of Moshi and was trained on a single DGX node with 8 H100 GPUs provided by Scaleway.
|
96 |
|
97 |
+
## Model Card Authors
|
98 |
|
99 |
+
Amélie Royer, Moritz Böhle
|
|
model.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:73d2e92ff89b99c200d8c6e625d3649022481da2fb9d10fd85b5ae12fcc6226b
|
3 |
-
size 17445080792
|
|
|
|
|
|
|
|
tokenizer-e351c8d8-checkpoint125.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:09b782f0629851a271227fb9d36db65c041790365f11bbe5d3d59369cf863f50
|
3 |
-
size 384644900
|
|
|
|
|
|
|
|
tokenizer_spm_32k_3.model
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:78d4336533ddc26f9acf7250d7fb83492152196c6ea4212c841df76933f18d2d
|
3 |
-
size 552778
|
|
|
|
|
|
|
|