MohamedRashad commited on
Commit
85baa5b
·
verified ·
1 Parent(s): 1a460fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +321 -183
README.md CHANGED
@@ -1,199 +1,337 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
4
  ---
 
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
 
 
 
 
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ - fr
5
+ - de
6
+ - es
7
+ - it
8
+ - pt
9
+ - nl
10
+ - hi
11
+ license: apache-2.0
12
  library_name: transformers
13
+ inference: false
14
+ extra_gated_description: >-
15
+ If you want to learn more about how we process your personal data, please read
16
+ our <a href="https://mistral.ai/terms/">Privacy Policy</a>.
17
+ pipeline_tag: audio-text-to-text
18
  ---
19
+ # Voxtral Mini 1.0 (3B) - 2507
20
 
21
+ Voxtral Mini is an enhancement of [Ministral 3B](https://mistral.ai/news/ministraux), incorporating state-of-the-art audio input capabilities while retaining best-in-class text performance. It excels at speech transcription, translation and audio understanding.
22
 
23
+ Learn more about Voxtral in our blog post [here](https://mistral.ai/news/voxtral).
24
 
25
+ ## Key Features
26
 
27
+ Voxtral builds upon Ministral-3B with powerful audio understanding capabilities.
28
+ - **Dedicated transcription mode**: Voxtral can operate in a pure speech transcription mode to maximize performance. By default, Voxtral automatically predicts the source audio language and transcribes the text accordingly
29
+ - **Long-form context**: With a 32k token context length, Voxtral handles audios up to 30 minutes for transcription, or 40 minutes for understanding
30
+ - **Built-in Q&A and summarization**: Supports asking questions directly through audio. Analyze audio and generate structured summaries without the need for separate ASR and language models
31
+ - **Natively multilingual**: Automatic language detection and state-of-the-art performance in the world’s most widely used languages (English, Spanish, French, Portuguese, Hindi, German, Dutch, Italian)
32
+ - **Function-calling straight from voice**: Enables direct triggering of backend functions, workflows, or API calls based on spoken user intents
33
+ - **Highly capable at text**: Retains the text understanding capabilities of its language model backbone, Ministral-3B
34
 
35
+ ## Benchmark Results
36
 
37
+ ### Audio
38
 
39
+ Average word error rate (WER) over the FLEURS, Mozilla Common Voice and Multilingual LibriSpeech benchmarks:
40
 
41
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/64161701107962562e9b1006/puASxtajF1lDeGYPrRK5y.png)
42
 
43
+ ### Text
 
 
 
 
 
 
44
 
45
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/5dfcb1aada6d0311fd3d5448/iH9V8JVtMoaGlqJd6FIri.png)
46
 
47
+ ## Usage
48
 
49
+ The model can be used with the following frameworks;
50
+ - [`Transformers` 🤗](https://github.com/huggingface/transformers): See [here](#transformers-🤗)
51
+
52
+ **Notes**:
53
+
54
+ - `temperature=0.2` and `top_p=0.95` for chat completion (*e.g. Audio Understanding*) and `temperature=0.0` for transcription
55
+ - Multiple audios per message and multiple user turns with audio are supported
56
+ - System prompts are not yet supported
57
+
58
+ ### Transformers 🤗
59
+
60
+ Voxtral is supported in Transformers natively!
61
+
62
+ Install Transformers from source:
63
+ ```bash
64
+ pip install git+https://github.com/huggingface/transformers
65
+ ```
66
+
67
+ #### Audio Instruct
68
+
69
+ <details>
70
+ <summary>➡️ multi-audio + text instruction</summary>
71
+
72
+ ```python
73
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
74
+ import torch
75
+
76
+ device = "cuda"
77
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
78
+
79
+ processor = AutoProcessor.from_pretrained(repo_id)
80
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
81
 
82
+ conversation = [
83
+ {
84
+ "role": "user",
85
+ "content": [
86
+ {
87
+ "type": "audio",
88
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/mary_had_lamb.mp3",
89
+ },
90
+ {
91
+ "type": "audio",
92
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
93
+ },
94
+ {"type": "text", "text": "What sport and what nursery rhyme are referenced?"},
95
+ ],
96
+ }
97
+ ]
98
+
99
+ inputs = processor.apply_chat_template(conversation)
100
+ inputs = inputs.to(device, dtype=torch.bfloat16)
101
+
102
+ outputs = model.generate(**inputs, max_new_tokens=500)
103
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
104
+
105
+ print("\nGenerated response:")
106
+ print("=" * 80)
107
+ print(decoded_outputs[0])
108
+ print("=" * 80)
109
+ ```
110
+ </details>
111
+
112
+
113
+ <details>
114
+ <summary>➡️ multi-turn</summary>
115
+
116
+ ```python
117
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
118
+ import torch
119
+
120
+ device = "cuda"
121
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
122
+
123
+ processor = AutoProcessor.from_pretrained(repo_id)
124
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
125
+
126
+ conversation = [
127
+ {
128
+ "role": "user",
129
+ "content": [
130
+ {
131
+ "type": "audio",
132
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
133
+ },
134
+ {
135
+ "type": "audio",
136
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
137
+ },
138
+ {"type": "text", "text": "Describe briefly what you can hear."},
139
+ ],
140
+ },
141
+ {
142
+ "role": "assistant",
143
+ "content": "The audio begins with the speaker delivering a farewell address in Chicago, reflecting on his eight years as president and expressing gratitude to the American people. The audio then transitions to a weather report, stating that it was 35 degrees in Barcelona the previous day, but the temperature would drop to minus 20 degrees the following day.",
144
+ },
145
+ {
146
+ "role": "user",
147
+ "content": [
148
+ {
149
+ "type": "audio",
150
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
151
+ },
152
+ {"type": "text", "text": "Ok, now compare this new audio with the previous one."},
153
+ ],
154
+ },
155
+ ]
156
+
157
+ inputs = processor.apply_chat_template(conversation)
158
+ inputs = inputs.to(device, dtype=torch.bfloat16)
159
+
160
+ outputs = model.generate(**inputs, max_new_tokens=500)
161
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
162
+
163
+ print("\nGenerated response:")
164
+ print("=" * 80)
165
+ print(decoded_outputs[0])
166
+ print("=" * 80)
167
+ ```
168
+ </details>
169
+
170
+
171
+ <details>
172
+ <summary>➡️ text only</summary>
173
+
174
+ ```python
175
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
176
+ import torch
177
+
178
+ device = "cuda"
179
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
180
+
181
+ processor = AutoProcessor.from_pretrained(repo_id)
182
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
183
+
184
+ conversation = [
185
+ {
186
+ "role": "user",
187
+ "content": [
188
+ {
189
+ "type": "text",
190
+ "text": "Why should AI models be open-sourced?",
191
+ },
192
+ ],
193
+ }
194
+ ]
195
+
196
+ inputs = processor.apply_chat_template(conversation)
197
+ inputs = inputs.to(device, dtype=torch.bfloat16)
198
+
199
+ outputs = model.generate(**inputs, max_new_tokens=500)
200
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
201
+
202
+ print("\nGenerated response:")
203
+ print("=" * 80)
204
+ print(decoded_outputs[0])
205
+ print("=" * 80)
206
+ ```
207
+ </details>
208
+
209
+
210
+ <details>
211
+ <summary>➡️ audio only</summary>
212
+
213
+ ```python
214
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
215
+ import torch
216
+
217
+ device = "cuda"
218
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
219
+
220
+ processor = AutoProcessor.from_pretrained(repo_id)
221
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
222
+
223
+ conversation = [
224
+ {
225
+ "role": "user",
226
+ "content": [
227
+ {
228
+ "type": "audio",
229
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
230
+ },
231
+ ],
232
+ }
233
+ ]
234
+
235
+ inputs = processor.apply_chat_template(conversation)
236
+ inputs = inputs.to(device, dtype=torch.bfloat16)
237
+
238
+ outputs = model.generate(**inputs, max_new_tokens=500)
239
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
240
+
241
+ print("\nGenerated response:")
242
+ print("=" * 80)
243
+ print(decoded_outputs[0])
244
+ print("=" * 80)
245
+ ```
246
+ </details>
247
+
248
+
249
+ <details>
250
+ <summary>➡️ batched inference</summary>
251
+
252
+ ```python
253
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
254
+ import torch
255
+
256
+ device = "cuda"
257
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
258
+
259
+ processor = AutoProcessor.from_pretrained(repo_id)
260
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
261
+
262
+ conversations = [
263
+ [
264
+ {
265
+ "role": "user",
266
+ "content": [
267
+ {
268
+ "type": "audio",
269
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3",
270
+ },
271
+ {
272
+ "type": "audio",
273
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/bcn_weather.mp3",
274
+ },
275
+ {
276
+ "type": "text",
277
+ "text": "Who's speaking in the speach and what city's weather is being discussed?",
278
+ },
279
+ ],
280
+ }
281
+ ],
282
+ [
283
+ {
284
+ "role": "user",
285
+ "content": [
286
+ {
287
+ "type": "audio",
288
+ "path": "https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/winning_call.mp3",
289
+ },
290
+ {"type": "text", "text": "What can you tell me about this audio?"},
291
+ ],
292
+ }
293
+ ],
294
+ ]
295
+
296
+ inputs = processor.apply_chat_template(conversations)
297
+ inputs = inputs.to(device, dtype=torch.bfloat16)
298
+
299
+ outputs = model.generate(**inputs, max_new_tokens=500)
300
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
301
+
302
+ print("\nGenerated responses:")
303
+ print("=" * 80)
304
+ for decoded_output in decoded_outputs:
305
+ print(decoded_output)
306
+ print("=" * 80)
307
+ ```
308
+ </details>
309
+
310
+ #### Transcription
311
+
312
+ <details>
313
+ <summary>➡️ transcribe</summary>
314
+
315
+ ```python
316
+ from transformers import VoxtralForConditionalGeneration, AutoProcessor
317
+ import torch
318
+
319
+ device = "cuda"
320
+ repo_id = "mistralai/Voxtral-Mini-3B-2507"
321
+
322
+ processor = AutoProcessor.from_pretrained(repo_id)
323
+ model = VoxtralForConditionalGeneration.from_pretrained(repo_id, torch_dtype=torch.bfloat16, device_map=device)
324
+
325
+ inputs = processor.apply_transcrition_request(language="en", audio="https://huggingface.co/datasets/hf-internal-testing/dummy-audio-samples/resolve/main/obama.mp3", model_id=repo_id)
326
+ inputs = inputs.to(device, dtype=torch.bfloat16)
327
+
328
+ outputs = model.generate(**inputs, max_new_tokens=500)
329
+ decoded_outputs = processor.batch_decode(outputs[:, inputs.input_ids.shape[1]:], skip_special_tokens=True)
330
+
331
+ print("\nGenerated responses:")
332
+ print("=" * 80)
333
+ for decoded_output in decoded_outputs:
334
+ print(decoded_output)
335
+ print("=" * 80)
336
+ ```
337
+ </details>