xiongwang commited on
Commit
ae9e169
·
verified ·
1 Parent(s): 08f233e

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -75
README.md CHANGED
@@ -93,6 +93,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
93
  <td class="tg-0lax">Baichuan-Omni-1.5</td>
94
  <td class="tg-0lax">-|-|-|42.90%</td>
95
  </tr>
 
 
 
 
96
  <tr>
97
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
98
  <td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
@@ -116,7 +120,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
116
  <td class="tg-9j4x" colspan="3">ASR</td>
117
  </tr>
118
  <tr>
119
- <td class="tg-0lax" rowspan="11">Librispeech<br>dev-clean | dev other | test-clean | test-other</td>
120
  <td class="tg-0lax">SALMONN</td>
121
  <td class="tg-0lax">-|-|2.1|4.9</td>
122
  </tr>
@@ -156,12 +160,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
156
  <td class="tg-0lax">Qwen2-Audio</td>
157
  <td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
158
  </tr>
 
 
 
 
159
  <tr>
160
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
161
  <td class="tg-0lax">1.6|3.5|1.8|3.4</td>
162
  </tr>
163
  <tr>
164
- <td class="tg-0lax" rowspan="4">Common Voice 15<br>en | zh | yue | fr</td>
165
  <td class="tg-0lax">Whisper-large-v3</td>
166
  <td class="tg-0lax">9.3|12.8|10.9|10.8</td>
167
  </tr>
@@ -173,12 +181,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
173
  <td class="tg-0lax">Qwen2-Audio</td>
174
  <td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
175
  </tr>
 
 
 
 
176
  <tr>
177
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
178
  <td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
179
  </tr>
180
  <tr>
181
- <td class="tg-0lax" rowspan="7">Fleurs<br>zh | en</td>
182
  <td class="tg-0lax">Whisper-large-v3</td>
183
  <td class="tg-0lax">7.7|4.1</td>
184
  </tr>
@@ -201,13 +213,17 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
201
  <tr>
202
  <td class="tg-0lax">Qwen2-Audio</td>
203
  <td class="tg-0lax">7.5|-</td>
 
 
 
 
204
  </tr>
205
  <tr>
206
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
207
  <td class="tg-0lax"><strong>3.0</strong>|4.1</td>
208
  </tr>
209
  <tr>
210
- <td class="tg-0lax" rowspan="5">Wenetspeech<br>test-net | test-meeting</td>
211
  <td class="tg-0lax">Seed-ASR-Chinese</td>
212
  <td class="tg-0lax"><strong>4.7|5.7</strong></td>
213
  </tr>
@@ -223,12 +239,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
223
  <td class="tg-0lax">MinMo</td>
224
  <td class="tg-0lax">6.8|7.4</td>
225
  </tr>
 
 
 
 
226
  <tr>
227
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
228
  <td class="tg-0lax">5.9|7.7</td>
229
  </tr>
230
  <tr>
231
- <td class="tg-0lax" rowspan="3">Voxpopuli-V1.0-en</td>
232
  <td class="tg-0lax">Llama-3-8B</td>
233
  <td class="tg-0lax">6.2</td>
234
  </tr>
@@ -236,6 +256,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
236
  <td class="tg-0lax">Llama-3-70B</td>
237
  <td class="tg-0lax"><strong>5.7</strong></td>
238
  </tr>
 
 
 
 
239
  <tr>
240
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
241
  <td class="tg-0lax">5.8</td>
@@ -244,7 +268,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
244
  <td class="tg-9j4x" colspan="3">S2TT</td>
245
  </tr>
246
  <tr>
247
- <td class="tg-0lax" rowspan="8">CoVoST2<br>en-de | de-en | en-zh | zh-en</td>
248
  <td class="tg-0lax">SALMONN</td>
249
  <td class="tg-0lax">18.6|-|33.1|-</td>
250
  </tr>
@@ -272,6 +296,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
272
  <td class="tg-0lax">Qwen2-Audio</td>
273
  <td class="tg-0lax">29.9|35.2|45.2|24.4</td>
274
  </tr>
 
 
 
 
275
  <tr>
276
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
277
  <td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
@@ -280,7 +308,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
280
  <td class="tg-9j4x" colspan="3">SER</td>
281
  </tr>
282
  <tr>
283
- <td class="tg-0lax" rowspan="5">Meld</td>
284
  <td class="tg-0lax">WavLM-large</td>
285
  <td class="tg-0lax">0.542</td>
286
  </tr>
@@ -296,6 +324,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
296
  <td class="tg-0lax">Qwen2-Audio</td>
297
  <td class="tg-0lax">0.553</td>
298
  </tr>
 
 
 
 
299
  <tr>
300
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
301
  <td class="tg-0lax"><strong>0.570</strong></td>
@@ -304,7 +336,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
304
  <td class="tg-9j4x" colspan="3">VSC</td>
305
  </tr>
306
  <tr>
307
- <td class="tg-0lax" rowspan="5">VocalSound</td>
308
  <td class="tg-0lax">CLAP</td>
309
  <td class="tg-0lax">0.495</td>
310
  </tr>
@@ -320,6 +352,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
320
  <td class="tg-0lax">Qwen2-Audio</td>
321
  <td class="tg-0lax"><strong>0.939</strong></td>
322
  </tr>
 
 
 
 
323
  <tr>
324
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
325
  <td class="tg-0lax"><strong>0.939</strong></td>
@@ -328,28 +364,36 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
328
  <td class="tg-9j4x" colspan="3">Music</td>
329
  </tr>
330
  <tr>
331
- <td class="tg-0lax" rowspan="2">GiantSteps Tempo</td>
332
  <td class="tg-0lax">Llark-7B</td>
333
  <td class="tg-0lax">0.86</td>
334
  </tr>
 
 
 
 
335
  <tr>
336
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
337
  <td class="tg-0lax"><strong>0.88</strong></td>
338
  </tr>
339
  <tr>
340
- <td class="tg-0lax" rowspan="2">MusicCaps</td>
341
  <td class="tg-0lax">LP-MusicCaps</td>
342
- <td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong>|<strong>0.129</strong>|0.130</td>
 
 
 
 
343
  </tr>
344
  <tr>
345
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
346
- <td class="tg-0lax"><strong>0.328</strong>|<strong>0.162</strong>|<strong>0.090</strong>|0.055|0.127|<strong>0.225</strong></td>
347
  </tr>
348
  <tr>
349
  <td class="tg-9j4x" colspan="3">Audio Reasoning</td>
350
  </tr>
351
  <tr>
352
- <td class="tg-0lax" rowspan="3">MMAU<br>Sound | Music | Speech | Avg</td>
353
  <td class="tg-0lax">Gemini-Pro-V1.5</td>
354
  <td class="tg-0lax">56.75|49.40|58.55|54.90</td>
355
  </tr>
@@ -357,15 +401,19 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
357
  <td class="tg-0lax">Qwen2-Audio</td>
358
  <td class="tg-0lax">54.95|50.98|42.04|49.20</td>
359
  </tr>
 
 
 
 
360
  <tr>
361
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
362
- <td class="tg-0lax"><strong>67.87|69.16|59.76|65.60</strong></td>
363
  </tr>
364
  <tr>
365
  <td class="tg-9j4x" colspan="3">Voice Chatting</td>
366
  </tr>
367
  <tr>
368
- <td class="tg-0lax" rowspan="8">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
369
  <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
370
  <td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
371
  </tr>
@@ -392,13 +440,17 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
392
  <tr>
393
  <td class="tg-0lax">Qwen2-Audio</td>
394
  <td class="tg-0lax">3.74|3.43|35.71|35.72</td>
 
 
 
 
395
  </tr>
396
  <tr>
397
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
398
  <td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
399
  </tr>
400
  <tr>
401
- <td class="tg-0lax" rowspan="8">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
402
  <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
403
  <td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
404
  </tr>
@@ -426,6 +478,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
426
  <td class="tg-0lax">Qwen2-Audio</td>
427
  <td class="tg-0lax">49.45|26.33|96.73|55.35</td>
428
  </tr>
 
 
 
 
429
  <tr>
430
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
431
  <td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
@@ -436,52 +492,52 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
436
  <details>
437
  <summary>Image -> Text</summary>
438
 
439
- | Dataset | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
440
- |--------------------------------|--------------|------------|---------------|-------------|
441
- | MMMU<sub>val</sub> | 59.2 | 53.9 | 58.6 | **60.0** |
442
- | MMMU-Pro<sub>overall</sub> | 36.6 | - | **38.3** | 37.6 |
443
- | MathVista<sub>testmini</sub> | 67.9 | **71.9** | 68.2 | 52.5 |
444
- | MathVision<sub>full</sub> | 25.0 | 23.1 | **25.1** | - |
445
- | MMBench-V1.1-EN<sub>test</sub> | 81.8 | 80.5 | **82.6** | 76.0 |
446
- | MMVet<sub>turbo</sub> | 66.8 | **67.5** | 67.1 | 66.9 |
447
- | MMStar | **64.0** | **64.0** | 63.9 | 54.8 |
448
- | MME<sub>sum</sub> | 2340 | **2372** | 2347 | 2003 |
449
- | MuirBench | 59.2 | - | **59.2** | - |
450
- | CRPE<sub>relation</sub> | **76.5** | - | 76.4 | - |
451
- | RealWorldQA<sub>avg</sub> | 70.3 | **71.9** | 68.5 | - |
452
- | MME-RealWorld<sub>en</sub> | **61.6** | - | 57.4 | - |
453
- | MM-MT-Bench | 6.0 | - | **6.3** | - |
454
- | AI2D | 83.2 | **85.8** | 83.9 | - |
455
- | TextVQA<sub>val</sub> | 84.4 | 83.2 | **84.9** | - |
456
- | DocVQA<sub>test</sub> | 95.2 | 93.5 | **95.7** | - |
457
- | ChartQA<sub>test Avg</sub> | 85.3 | 84.9 | **87.3** | - |
458
- | OCRBench_V2<sub>en</sub> | **57.8** | - | 56.3 | - |
459
-
460
-
461
- | Dataset | Qwen2.5-Omni-7B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
462
- |--------------------------|--------------|---------------|----------------|----------------|
463
- | Refcoco<sub>val</sub> | 90.5 | 90.0 | **90.6** | 73.2 |
464
- | Refcoco<sub>textA</sub> | **93.5** | 92.5 | 93.2 | 72.9 |
465
- | Refcoco<sub>textB</sub> | 86.6 | 85.4 | **88.2** | 74.6 |
466
- | Refcoco+<sub>val</sub> | 85.4 | 84.2 | **88.2** | 62.5 |
467
- | Refcoco+<sub>textA</sub> | **91.0** | 89.1 | 89.0 | 63.9 |
468
- | Refcoco+<sub>textB</sub> | **79.3** | 76.9 | 75.9 | 65.0 |
469
- | Refcocog+<sub>val</sub> | **87.4** | 87.2 | 86.1 | 75.2 |
470
- | Refcocog+<sub>test</sub> | **87.9** | 87.2 | 87.0 | 76.2 |
471
- | ODinW | 42.4 | 37.3 | **55.0** | 36.7 |
472
- | PointGrounding | 66.5 | **67.3** | - | - |
473
  </details>
474
 
475
 
476
  <details>
477
  <summary>Video(without audio) -> Text</summary>
478
 
479
- | Dataset | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
480
- |-----------------------------|--------------|------------|---------------|-------------|
481
- | Video-MME<sub>w/o sub</sub> | 64.3 | 63.9 | **65.1** | 64.8 |
482
- | Video-MME<sub>w sub</sub> | **72.4** | 67.9 | 71.6 | - |
483
- | MVBench | **70.3** | 67.2 | 69.6 | - |
484
- | EgoSchema<sub>test</sub> | **68.6** | 63.2 | 65.0 | - |
485
  </details>
486
 
487
  <details>
@@ -499,7 +555,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
499
  <td class="tg-9j4x" colspan="3">Content Consistency</td>
500
  </tr>
501
  <tr>
502
- <td class="tg-0lax" rowspan="9">SEED<br>test-zh | test-en | test-hard </td>
503
  <td class="tg-0lax">Seed-TTS_ICL</td>
504
  <td class="tg-0lax">1.11 | 2.24 | 7.58</td>
505
  </tr>
@@ -527,6 +583,14 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
527
  <td class="tg-0lax">CosyVoice 2-S</td>
528
  <td class="tg-0lax">1.45 | 2.38 | 8.08</td>
529
  </tr>
 
 
 
 
 
 
 
 
530
  <tr>
531
  <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
532
  <td class="tg-0lax">1.70 | 2.72 | 7.97</td>
@@ -539,7 +603,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
539
  <td class="tg-9j4x" colspan="3">Speaker Similarity</td>
540
  </tr>
541
  <tr>
542
- <td class="tg-0lax" rowspan="9">SEED<br>test-zh | test-en | test-hard </td>
543
  <td class="tg-0lax">Seed-TTS_ICL</td>
544
  <td class="tg-0lax">0.796 | 0.762 | 0.776</td>
545
  </tr>
@@ -567,6 +631,14 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
567
  <td class="tg-0lax">CosyVoice 2-S</td>
568
  <td class="tg-0lax">0.753 | 0.654 | 0.732</td>
569
  </tr>
 
 
 
 
 
 
 
 
570
  <tr>
571
  <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
572
  <td class="tg-0lax">0.752 | 0.632 | 0.747</td>
@@ -581,18 +653,18 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
581
  <details>
582
  <summary>Text -> Text</summary>
583
 
584
- | Dataset | Qwen2.5-Omni-7B | Qwen2.5-7B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
585
- |-----------------------------------|-----------|------------|----------|-------------|-----------|
586
- | MMLU-Pro | 47.0 | **56.3** | 44.1 | 48.3 | 52.1 |
587
- | MMLU-redux | 71.0 | **75.4** | 67.3 | 67.2 | 72.8 |
588
- | LiveBench<sub>0831</sub> | 29.6 | **35.9** | 29.2 | 26.7 | 30.6 |
589
- | GPQA | 30.8 | **36.4** | 34.3 | 32.8 | 32.8 |
590
- | MATH | 71.5 | **75.5** | 52.9 | 51.9 | 44.3 |
591
- | GSM8K | 88.7 | **91.6** | 85.7 | 84.5 | 76.7 |
592
- | HumanEval | 78.7 | **84.8** | 79.9 | 72.6 | 68.9 |
593
- | MBPP | 73.2 | **79.2** | 67.2 | 69.6 | 74.9 |
594
- | MultiPL-E | 65.8 | **70.4** | 59.1 | 50.7 | 53.4 |
595
- | LiveCodeBench<sub>2305-2409</sub> | 24.6 | **28.7** | 23.9 | 8.3 | 18.9 |
596
  </details>
597
 
598
  ## Quickstart
@@ -600,7 +672,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
600
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command:
601
  ```
602
  pip uninstall transformers
603
- pip install git+https://github.com/huggingface/transformers
604
  pip install accelerate
605
  ```
606
  or you might encounter the following error:
@@ -680,10 +752,12 @@ sf.write(
680
  <details>
681
  <summary>Minimum GPU memory requirements</summary>
682
 
683
- | Precision | 15(s) Video | 30(s) Video | 60(s) Video |
684
- |-----------| ------------- | --------- | -------------- |
685
- | FP32 | 93.56 GB | Not Recommend | Not Recommend |
686
- | BF16 | 31.11 GB | 41.85 GB | 60.19 GB |
 
 
687
 
688
  Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attn_implementation="flash_attention_2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource [here](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).
689
  </details>
 
93
  <td class="tg-0lax">Baichuan-Omni-1.5</td>
94
  <td class="tg-0lax">-|-|-|42.90%</td>
95
  </tr>
96
+ <tr>
97
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
98
+ <td class="tg-0lax">52.14%|52.08%|52.83%|52.19%</td>
99
+ </tr>
100
  <tr>
101
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
102
  <td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
 
120
  <td class="tg-9j4x" colspan="3">ASR</td>
121
  </tr>
122
  <tr>
123
+ <td class="tg-0lax" rowspan="12">Librispeech<br>dev-clean | dev other | test-clean | test-other</td>
124
  <td class="tg-0lax">SALMONN</td>
125
  <td class="tg-0lax">-|-|2.1|4.9</td>
126
  </tr>
 
160
  <td class="tg-0lax">Qwen2-Audio</td>
161
  <td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
162
  </tr>
163
+ <tr>
164
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
165
+ <td class="tg-0lax">2.0|4.1|2.2|4.5</td>
166
+ </tr>
167
  <tr>
168
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
169
  <td class="tg-0lax">1.6|3.5|1.8|3.4</td>
170
  </tr>
171
  <tr>
172
+ <td class="tg-0lax" rowspan="5">Common Voice 15<br>en | zh | yue | fr</td>
173
  <td class="tg-0lax">Whisper-large-v3</td>
174
  <td class="tg-0lax">9.3|12.8|10.9|10.8</td>
175
  </tr>
 
181
  <td class="tg-0lax">Qwen2-Audio</td>
182
  <td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
183
  </tr>
184
+ <tr>
185
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
186
+ <td class="tg-0lax">9.1|6.0|11.6|9.6</td>
187
+ </tr>
188
  <tr>
189
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
190
  <td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
191
  </tr>
192
  <tr>
193
+ <td class="tg-0lax" rowspan="8">Fleurs<br>zh | en</td>
194
  <td class="tg-0lax">Whisper-large-v3</td>
195
  <td class="tg-0lax">7.7|4.1</td>
196
  </tr>
 
213
  <tr>
214
  <td class="tg-0lax">Qwen2-Audio</td>
215
  <td class="tg-0lax">7.5|-</td>
216
+ </tr>
217
+ <tr>
218
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
219
+ <td class="tg-0lax">3.2|5.4</td>
220
  </tr>
221
  <tr>
222
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
223
  <td class="tg-0lax"><strong>3.0</strong>|4.1</td>
224
  </tr>
225
  <tr>
226
+ <td class="tg-0lax" rowspan="6">Wenetspeech<br>test-net | test-meeting</td>
227
  <td class="tg-0lax">Seed-ASR-Chinese</td>
228
  <td class="tg-0lax"><strong>4.7|5.7</strong></td>
229
  </tr>
 
239
  <td class="tg-0lax">MinMo</td>
240
  <td class="tg-0lax">6.8|7.4</td>
241
  </tr>
242
+ <tr>
243
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
244
+ <td class="tg-0lax">6.3|8.1</td>
245
+ </tr>
246
  <tr>
247
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
248
  <td class="tg-0lax">5.9|7.7</td>
249
  </tr>
250
  <tr>
251
+ <td class="tg-0lax" rowspan="4">Voxpopuli-V1.0-en</td>
252
  <td class="tg-0lax">Llama-3-8B</td>
253
  <td class="tg-0lax">6.2</td>
254
  </tr>
 
256
  <td class="tg-0lax">Llama-3-70B</td>
257
  <td class="tg-0lax"><strong>5.7</strong></td>
258
  </tr>
259
+ <tr>
260
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
261
+ <td class="tg-0lax">6.6</td>
262
+ </tr>
263
  <tr>
264
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
265
  <td class="tg-0lax">5.8</td>
 
268
  <td class="tg-9j4x" colspan="3">S2TT</td>
269
  </tr>
270
  <tr>
271
+ <td class="tg-0lax" rowspan="9">CoVoST2<br>en-de | de-en | en-zh | zh-en</td>
272
  <td class="tg-0lax">SALMONN</td>
273
  <td class="tg-0lax">18.6|-|33.1|-</td>
274
  </tr>
 
296
  <td class="tg-0lax">Qwen2-Audio</td>
297
  <td class="tg-0lax">29.9|35.2|45.2|24.4</td>
298
  </tr>
299
+ <tr>
300
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
301
+ <td class="tg-0lax">28.3|38.1|41.4|26.6</td>
302
+ </tr>
303
  <tr>
304
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
305
  <td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
 
308
  <td class="tg-9j4x" colspan="3">SER</td>
309
  </tr>
310
  <tr>
311
+ <td class="tg-0lax" rowspan="6">Meld</td>
312
  <td class="tg-0lax">WavLM-large</td>
313
  <td class="tg-0lax">0.542</td>
314
  </tr>
 
324
  <td class="tg-0lax">Qwen2-Audio</td>
325
  <td class="tg-0lax">0.553</td>
326
  </tr>
327
+ <tr>
328
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
329
+ <td class="tg-0lax">0.558</td>
330
+ </tr>
331
  <tr>
332
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
333
  <td class="tg-0lax"><strong>0.570</strong></td>
 
336
  <td class="tg-9j4x" colspan="3">VSC</td>
337
  </tr>
338
  <tr>
339
+ <td class="tg-0lax" rowspan="6">VocalSound</td>
340
  <td class="tg-0lax">CLAP</td>
341
  <td class="tg-0lax">0.495</td>
342
  </tr>
 
352
  <td class="tg-0lax">Qwen2-Audio</td>
353
  <td class="tg-0lax"><strong>0.939</strong></td>
354
  </tr>
355
+ <tr>
356
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
357
+ <td class="tg-0lax">0.936</td>
358
+ </tr>
359
  <tr>
360
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
361
  <td class="tg-0lax"><strong>0.939</strong></td>
 
364
  <td class="tg-9j4x" colspan="3">Music</td>
365
  </tr>
366
  <tr>
367
+ <td class="tg-0lax" rowspan="3">GiantSteps Tempo</td>
368
  <td class="tg-0lax">Llark-7B</td>
369
  <td class="tg-0lax">0.86</td>
370
  </tr>
371
+ <tr>
372
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
373
+ <td class="tg-0lax"><strong>0.88</strong></td>
374
+ </tr>
375
  <tr>
376
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
377
  <td class="tg-0lax"><strong>0.88</strong></td>
378
  </tr>
379
  <tr>
380
+ <td class="tg-0lax" rowspan="3">MusicCaps</td>
381
  <td class="tg-0lax">LP-MusicCaps</td>
382
+ <td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong>|0.129|0.130</td>
383
+ </tr>
384
+ <tr>
385
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
386
+ <td class="tg-0lax">0.325|<strong>0.163</strong>|<strong>0.093</strong>|0.057|<strong>0.132</strong>|<strong>0.229</strong></td>
387
  </tr>
388
  <tr>
389
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
390
+ <td class="tg-0lax"><strong>0.328</strong>|0.162|0.090|0.055|0.127|0.225</td>
391
  </tr>
392
  <tr>
393
  <td class="tg-9j4x" colspan="3">Audio Reasoning</td>
394
  </tr>
395
  <tr>
396
+ <td class="tg-0lax" rowspan="4">MMAU<br>Sound | Music | Speech | Avg</td>
397
  <td class="tg-0lax">Gemini-Pro-V1.5</td>
398
  <td class="tg-0lax">56.75|49.40|58.55|54.90</td>
399
  </tr>
 
401
  <td class="tg-0lax">Qwen2-Audio</td>
402
  <td class="tg-0lax">54.95|50.98|42.04|49.20</td>
403
  </tr>
404
+ <tr>
405
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
406
+ <td class="tg-0lax"><strong>70.27</strong>|60.48|59.16|63.30</td>
407
+ </tr>
408
  <tr>
409
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
410
+ <td class="tg-0lax">67.87|<strong>69.16|59.76|65.60</strong></td>
411
  </tr>
412
  <tr>
413
  <td class="tg-9j4x" colspan="3">Voice Chatting</td>
414
  </tr>
415
  <tr>
416
+ <td class="tg-0lax" rowspan="9">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
417
  <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
418
  <td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
419
  </tr>
 
440
  <tr>
441
  <td class="tg-0lax">Qwen2-Audio</td>
442
  <td class="tg-0lax">3.74|3.43|35.71|35.72</td>
443
+ </tr>
444
+ <tr>
445
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
446
+ <td class="tg-0lax">4.32|4.00|49.37|50.23</td>
447
  </tr>
448
  <tr>
449
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
450
  <td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
451
  </tr>
452
  <tr>
453
+ <td class="tg-0lax" rowspan="9">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
454
  <td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
455
  <td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
456
  </tr>
 
478
  <td class="tg-0lax">Qwen2-Audio</td>
479
  <td class="tg-0lax">49.45|26.33|96.73|55.35</td>
480
  </tr>
481
+ <tr>
482
+ <td class="tg-0lax">Qwen2.5-Omni-3B</td>
483
+ <td class="tg-0lax">74.73|42.10|98.85|68.81</td>
484
+ </tr>
485
  <tr>
486
  <td class="tg-0lax">Qwen2.5-Omni-7B</td>
487
  <td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
 
492
  <details>
493
  <summary>Image -> Text</summary>
494
 
495
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
496
+ |--------------------------------|--------------|------------|------------|---------------|-------------|
497
+ | MMMU<sub>val</sub> | 59.2 | 53.1 | 53.9 | 58.6 | **60.0** |
498
+ | MMMU-Pro<sub>overall</sub> | 36.6 | 29.7 | - | **38.3** | 37.6 |
499
+ | MathVista<sub>testmini</sub> | 67.9 | 59.4 | **71.9** | 68.2 | 52.5 |
500
+ | MathVision<sub>full</sub> | 25.0 | 20.8 | 23.1 | **25.1** | - |
501
+ | MMBench-V1.1-EN<sub>test</sub> | 81.8 | 77.8 | 80.5 | **82.6** | 76.0 |
502
+ | MMVet<sub>turbo</sub> | 66.8 | 62.1 | **67.5** | 67.1 | 66.9 |
503
+ | MMStar | **64.0** | 55.7 | **64.0** | 63.9 | 54.8 |
504
+ | MME<sub>sum</sub> | 2340 | 2117 | **2372** | 2347 | 2003 |
505
+ | MuirBench | 59.2 | 48.0 | - | **59.2** | - |
506
+ | CRPE<sub>relation</sub> | **76.5** | 73.7 | - | 76.4 | - |
507
+ | RealWorldQA<sub>avg</sub> | 70.3 | 62.6 | **71.9** | 68.5 | - |
508
+ | MME-RealWorld<sub>en</sub> | **61.6** | 55.6 | - | 57.4 | - |
509
+ | MM-MT-Bench | 6.0 | 5.0 | - | **6.3** | - |
510
+ | AI2D | 83.2 | 79.5 | **85.8** | 83.9 | - |
511
+ | TextVQA<sub>val</sub> | 84.4 | 79.8 | 83.2 | **84.9** | - |
512
+ | DocVQA<sub>test</sub> | 95.2 | 93.3 | 93.5 | **95.7** | - |
513
+ | ChartQA<sub>test Avg</sub> | 85.3 | 82.8 | 84.9 | **87.3** | - |
514
+ | OCRBench_V2<sub>en</sub> | **57.8** | 51.7 | - | 56.3 | - |
515
+
516
+
517
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
518
+ |--------------------------|--------------|---------------|---------------|----------------|----------------|
519
+ | Refcoco<sub>val</sub> | 90.5 | 88.7 | 90.0 | **90.6** | 73.2 |
520
+ | Refcoco<sub>textA</sub> | **93.5** | 91.8 | 92.5 | 93.2 | 72.9 |
521
+ | Refcoco<sub>textB</sub> | 86.6 | 84.0 | 85.4 | **88.2** | 74.6 |
522
+ | Refcoco+<sub>val</sub> | 85.4 | 81.1 | 84.2 | **88.2** | 62.5 |
523
+ | Refcoco+<sub>textA</sub> | **91.0** | 87.5 | 89.1 | 89.0 | 63.9 |
524
+ | Refcoco+<sub>textB</sub> | **79.3** | 73.2 | 76.9 | 75.9 | 65.0 |
525
+ | Refcocog+<sub>val</sub> | **87.4** | 85.0 | 87.2 | 86.1 | 75.2 |
526
+ | Refcocog+<sub>test</sub> | **87.9** | 85.1 | 87.2 | 87.0 | 76.2 |
527
+ | ODinW | 42.4 | 39.2 | 37.3 | **55.0** | 36.7 |
528
+ | PointGrounding | 66.5 | 46.2 | **67.3** | - | - |
529
  </details>
530
 
531
 
532
  <details>
533
  <summary>Video(without audio) -> Text</summary>
534
 
535
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
536
+ |-----------------------------|--------------|------------|------------|---------------|-------------|
537
+ | Video-MME<sub>w/o sub</sub> | 64.3 | 62.0 | 63.9 | **65.1** | 64.8 |
538
+ | Video-MME<sub>w sub</sub> | **72.4** | 68.6 | 67.9 | 71.6 | - |
539
+ | MVBench | **70.3** | 68.7 | 67.2 | 69.6 | - |
540
+ | EgoSchema<sub>test</sub> | **68.6** | 61.4 | 63.2 | 65.0 | - |
541
  </details>
542
 
543
  <details>
 
555
  <td class="tg-9j4x" colspan="3">Content Consistency</td>
556
  </tr>
557
  <tr>
558
+ <td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
559
  <td class="tg-0lax">Seed-TTS_ICL</td>
560
  <td class="tg-0lax">1.11 | 2.24 | 7.58</td>
561
  </tr>
 
583
  <td class="tg-0lax">CosyVoice 2-S</td>
584
  <td class="tg-0lax">1.45 | 2.38 | 8.08</td>
585
  </tr>
586
+ <tr>
587
+ <td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
588
+ <td class="tg-0lax">1.95 | 2.87 | 9.92</td>
589
+ </tr>
590
+ <tr>
591
+ <td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
592
+ <td class="tg-0lax">1.58 | 2.51 | 7.86</td>
593
+ </tr>
594
  <tr>
595
  <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
596
  <td class="tg-0lax">1.70 | 2.72 | 7.97</td>
 
603
  <td class="tg-9j4x" colspan="3">Speaker Similarity</td>
604
  </tr>
605
  <tr>
606
+ <td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
607
  <td class="tg-0lax">Seed-TTS_ICL</td>
608
  <td class="tg-0lax">0.796 | 0.762 | 0.776</td>
609
  </tr>
 
631
  <td class="tg-0lax">CosyVoice 2-S</td>
632
  <td class="tg-0lax">0.753 | 0.654 | 0.732</td>
633
  </tr>
634
+ <tr>
635
+ <td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
636
+ <td class="tg-0lax">0.741 | 0.635 | 0.748</td>
637
+ </tr>
638
+ <tr>
639
+ <td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
640
+ <td class="tg-0lax">0.744 | 0.635 | 0.746</td>
641
+ </tr>
642
  <tr>
643
  <td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
644
  <td class="tg-0lax">0.752 | 0.632 | 0.747</td>
 
653
  <details>
654
  <summary>Text -> Text</summary>
655
 
656
+ | Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
657
+ |-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------|
658
+ | MMLU-Pro | 47.0 | 40.4 | **56.3** | 43.7 | 44.1 | 48.3 | 52.1 |
659
+ | MMLU-redux | 71.0 | 60.9 | **75.4** | 64.4 | 67.3 | 67.2 | 72.8 |
660
+ | LiveBench<sub>0831</sub> | 29.6 | 22.3 | **35.9** | 26.8 | 29.2 | 26.7 | 30.6 |
661
+ | GPQA | 30.8 | 34.3 | **36.4** | 30.3 | 34.3 | 32.8 | 32.8 |
662
+ | MATH | 71.5 | 63.6 | **75.5** | 65.9 | 52.9 | 51.9 | 44.3 |
663
+ | GSM8K | 88.7 | 82.6 | **91.6** | 86.7 | 85.7 | 84.5 | 76.7 |
664
+ | HumanEval | 78.7 | 70.7 | **84.8** | 74.4 | 79.9 | 72.6 | 68.9 |
665
+ | MBPP | 73.2 | 70.4 | **79.2** | 72.7 | 67.2 | 69.6 | 74.9 |
666
+ | MultiPL-E | 65.8 | 57.6 | **70.4** | 60.2 | 59.1 | 50.7 | 53.4 |
667
+ | LiveCodeBench<sub>2305-2409</sub> | 24.6 | 16.5 | **28.7** | 19.9 | 23.9 | 8.3 | 18.9 |
668
  </details>
669
 
670
  ## Quickstart
 
672
  Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command:
673
  ```
674
  pip uninstall transformers
675
+ pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
676
  pip install accelerate
677
  ```
678
  or you might encounter the following error:
 
752
  <details>
753
  <summary>Minimum GPU memory requirements</summary>
754
 
755
+ |Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video |
756
+ |--------------|-----------| ------------- | ------------- | ------------------ |
757
+ | Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend |
758
+ | Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB |
759
+ | Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend |
760
+ | Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB |
761
 
762
  Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attn_implementation="flash_attention_2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource [here](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).
763
  </details>