ameroyer mboehle commited on
Commit
3bfce6c
·
verified ·
0 Parent(s):

kyutai/moshika-vis-pytorch-bf16 v0.1

Browse files

Co-authored-by: mboehle <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - en
5
+ base_model:
6
+ - google/paligemma2-3b-pt-448
7
+ - kyutai/moshika-pytorch-bf16
8
+ ---
9
+
10
+ # Model Card for MoshiVis
11
+
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ MoshiVis is a perceptually augmented version of Moshi, giving it the ability to freely discuss images whilst maintaining its natural conversation style and low latency.
18
+ To achieve this, Moshi has been extended with a visual backbone and a cross-attention mechanism to infuse the visual information into the language model.
19
+ To train MoshiVis, we add a few parameters (~200M) on top of a frozen Moshi backbone (for the text/speech modeling aspect, ~7B params)
20
+ and a PaliGemma2 vision encoder (for the image encoding part, ~400M parameters).
21
+
22
+ This model page contains the `Moshika` (female voice) model weights for the `Pytorch` backend of the MoshiVis repo, in `bfloat16`.
23
+ We provide the same model weights for other backends and quantization formats in the associated model collection.
24
+
25
+ - **Developed by:** Kyutai
26
+ - **Model type:** Multimodal speech+vision+text foundation model
27
+ - **Language(s) (NLP):** English
28
+ - **License:** Apache License 2.0
29
+ - **Finetuned from model:** [Moshika](https://huggingface.co/kyutai/moshika-pytorch-bf16) and [PaliGemma2](https://huggingface.co/google/paligemma2-3b-pt-448)
30
+
31
+
32
+ ### Model Sources
33
+
34
+ - **Repository:** [Github kyutai-labs/moshivis](https://github.com/kyutai-labs/moshivis)
35
+ - **Demo:** [Talk to Moshi](http://vis.moshi.chat)
36
+
37
+ ## Uses
38
+
39
+ ### Direct Use
40
+
41
+ Similar to Moshi itself, MoshiVis can be used as a conversational agent for casual conversations, basic facts and advice (e.g. recipes, trivia), roleplay, etc.
42
+ In addition, MoshiVis is able to recognize and discuss images in a natural way, whilst still allowing for low-latency interactions.
43
+
44
+
45
+ ### Downstream Use
46
+
47
+ Since MoshiVis was designed to infuse visual signal in a frozen Moshi backbone with only a few trainable parameters,
48
+ the model could be adapted to different downstream scenarios by further finetuning these parameters :
49
+ for instance adapting MoshiVis for a different off-the-shelf image encoder or different visual domains.
50
+
51
+ ### Out-of-Scope Use
52
+
53
+ The model is not intended to be used to impersonate other people or any malicious use of any kind.
54
+ This model is for research only and we do not recommend it for providing advices or to perform any professionnal duty.
55
+
56
+
57
+ ## Bias, Risks, and Limitations
58
+
59
+ MoshiVis has been designed to perceptually augment the original [Moshi]((https://huggingface.co/kyutai/moshika-pytorch-bf16))
60
+ model with vision capabilities and is expected to inherit similar biases and limitations.
61
+
62
+
63
+ ### Recommendations
64
+
65
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
66
+
67
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
68
+
69
+ ## How to Get Started with the Model
70
+
71
+ See our [github repository](https://github.com/kyutai-labs/moshivis) for getting started.
72
+
73
+
74
+ ## Training Details
75
+
76
+ Stay tuned for our technical report, in which we will describe the training procedure in detail as well as report evaluation results.
77
+
78
+ ### Training Data
79
+
80
+ For information on the training data used for the base models, see [Pixtral](https://mistral.ai/news/pixtral-12b/) and
81
+ [Moshi](https://huggingface.co/kyutai/moshika-pytorch-bf16) respectively.
82
+ To train the cross-attention and gating mechanism that MoshiVis uses for processing images,
83
+ we rely on a collection of publicly available datasets, namely:
84
+ - [DOCCI](https://google.github.io/docci/)
85
+ - [PixMo](https://huggingface.co/datasets/allenai/pixmo-cap)
86
+ - [Pixelprose](https://arxiv.org/abs/2406.10328)
87
+ - [TallyQA](https://arxiv.org/abs/1810.12440)
88
+ - [OCR-VQA](https://ocr-vqa.github.io/)
89
+ - [RenderedText](https://huggingface.co/datasets/wendlerc/RenderedText)
90
+ - [DocVQA](https://arxiv.org/abs/2007.00398)
91
+
92
+
93
+
94
+ ## Technical Specifications
95
+
96
+
97
+ ### Compute Infrastructure
98
+
99
+ MoshiVis was designed as a relatively low-cost adaptation of Moshi (~200M extra trainable parameters)
100
+ and was trained on a single DGX node with 8 H100 GPUs.
101
+
102
+
103
+ #### Software
104
+
105
+ Our training code was implemented in Pytorch. Our inference code is available for Pytorch, Rust and MLX.
106
+
107
+ ## Citation
108
+
109
+
110
+ Blog post: https://kyutai.org/
111
+
112
+
113
+
114
+ ## Model Card Authors and Contact
115
+
116
+ * Amelie Royer
117
+ * Moritz Boehle
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:73d2e92ff89b99c200d8c6e625d3649022481da2fb9d10fd85b5ae12fcc6226b
3
+ size 17445080792
tokenizer-e351c8d8-checkpoint125.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:09b782f0629851a271227fb9d36db65c041790365f11bbe5d3d59369cf863f50
3
+ size 384644900
tokenizer_spm_32k_3.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:78d4336533ddc26f9acf7250d7fb83492152196c6ea4212c841df76933f18d2d
3
+ size 552778