Upload README.md
Browse files
README.md
CHANGED
@@ -93,6 +93,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
93 |
<td class="tg-0lax">Baichuan-Omni-1.5</td>
|
94 |
<td class="tg-0lax">-|-|-|42.90%</td>
|
95 |
</tr>
|
|
|
|
|
|
|
|
|
96 |
<tr>
|
97 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
98 |
<td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
|
@@ -116,7 +120,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
116 |
<td class="tg-9j4x" colspan="3">ASR</td>
|
117 |
</tr>
|
118 |
<tr>
|
119 |
-
<td class="tg-0lax" rowspan="
|
120 |
<td class="tg-0lax">SALMONN</td>
|
121 |
<td class="tg-0lax">-|-|2.1|4.9</td>
|
122 |
</tr>
|
@@ -156,12 +160,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
156 |
<td class="tg-0lax">Qwen2-Audio</td>
|
157 |
<td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
|
158 |
</tr>
|
|
|
|
|
|
|
|
|
159 |
<tr>
|
160 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
161 |
<td class="tg-0lax">1.6|3.5|1.8|3.4</td>
|
162 |
</tr>
|
163 |
<tr>
|
164 |
-
<td class="tg-0lax" rowspan="
|
165 |
<td class="tg-0lax">Whisper-large-v3</td>
|
166 |
<td class="tg-0lax">9.3|12.8|10.9|10.8</td>
|
167 |
</tr>
|
@@ -173,12 +181,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
173 |
<td class="tg-0lax">Qwen2-Audio</td>
|
174 |
<td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
|
175 |
</tr>
|
|
|
|
|
|
|
|
|
176 |
<tr>
|
177 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
178 |
<td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
|
179 |
</tr>
|
180 |
<tr>
|
181 |
-
<td class="tg-0lax" rowspan="
|
182 |
<td class="tg-0lax">Whisper-large-v3</td>
|
183 |
<td class="tg-0lax">7.7|4.1</td>
|
184 |
</tr>
|
@@ -201,13 +213,17 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
201 |
<tr>
|
202 |
<td class="tg-0lax">Qwen2-Audio</td>
|
203 |
<td class="tg-0lax">7.5|-</td>
|
|
|
|
|
|
|
|
|
204 |
</tr>
|
205 |
<tr>
|
206 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
207 |
<td class="tg-0lax"><strong>3.0</strong>|4.1</td>
|
208 |
</tr>
|
209 |
<tr>
|
210 |
-
<td class="tg-0lax" rowspan="
|
211 |
<td class="tg-0lax">Seed-ASR-Chinese</td>
|
212 |
<td class="tg-0lax"><strong>4.7|5.7</strong></td>
|
213 |
</tr>
|
@@ -223,12 +239,16 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
223 |
<td class="tg-0lax">MinMo</td>
|
224 |
<td class="tg-0lax">6.8|7.4</td>
|
225 |
</tr>
|
|
|
|
|
|
|
|
|
226 |
<tr>
|
227 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
228 |
<td class="tg-0lax">5.9|7.7</td>
|
229 |
</tr>
|
230 |
<tr>
|
231 |
-
<td class="tg-0lax" rowspan="
|
232 |
<td class="tg-0lax">Llama-3-8B</td>
|
233 |
<td class="tg-0lax">6.2</td>
|
234 |
</tr>
|
@@ -236,6 +256,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
236 |
<td class="tg-0lax">Llama-3-70B</td>
|
237 |
<td class="tg-0lax"><strong>5.7</strong></td>
|
238 |
</tr>
|
|
|
|
|
|
|
|
|
239 |
<tr>
|
240 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
241 |
<td class="tg-0lax">5.8</td>
|
@@ -244,7 +268,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
244 |
<td class="tg-9j4x" colspan="3">S2TT</td>
|
245 |
</tr>
|
246 |
<tr>
|
247 |
-
<td class="tg-0lax" rowspan="
|
248 |
<td class="tg-0lax">SALMONN</td>
|
249 |
<td class="tg-0lax">18.6|-|33.1|-</td>
|
250 |
</tr>
|
@@ -272,6 +296,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
272 |
<td class="tg-0lax">Qwen2-Audio</td>
|
273 |
<td class="tg-0lax">29.9|35.2|45.2|24.4</td>
|
274 |
</tr>
|
|
|
|
|
|
|
|
|
275 |
<tr>
|
276 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
277 |
<td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
|
@@ -280,7 +308,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
280 |
<td class="tg-9j4x" colspan="3">SER</td>
|
281 |
</tr>
|
282 |
<tr>
|
283 |
-
<td class="tg-0lax" rowspan="
|
284 |
<td class="tg-0lax">WavLM-large</td>
|
285 |
<td class="tg-0lax">0.542</td>
|
286 |
</tr>
|
@@ -296,6 +324,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
296 |
<td class="tg-0lax">Qwen2-Audio</td>
|
297 |
<td class="tg-0lax">0.553</td>
|
298 |
</tr>
|
|
|
|
|
|
|
|
|
299 |
<tr>
|
300 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
301 |
<td class="tg-0lax"><strong>0.570</strong></td>
|
@@ -304,7 +336,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
304 |
<td class="tg-9j4x" colspan="3">VSC</td>
|
305 |
</tr>
|
306 |
<tr>
|
307 |
-
<td class="tg-0lax" rowspan="
|
308 |
<td class="tg-0lax">CLAP</td>
|
309 |
<td class="tg-0lax">0.495</td>
|
310 |
</tr>
|
@@ -320,6 +352,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
320 |
<td class="tg-0lax">Qwen2-Audio</td>
|
321 |
<td class="tg-0lax"><strong>0.939</strong></td>
|
322 |
</tr>
|
|
|
|
|
|
|
|
|
323 |
<tr>
|
324 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
325 |
<td class="tg-0lax"><strong>0.939</strong></td>
|
@@ -328,28 +364,36 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
328 |
<td class="tg-9j4x" colspan="3">Music</td>
|
329 |
</tr>
|
330 |
<tr>
|
331 |
-
<td class="tg-0lax" rowspan="
|
332 |
<td class="tg-0lax">Llark-7B</td>
|
333 |
<td class="tg-0lax">0.86</td>
|
334 |
</tr>
|
|
|
|
|
|
|
|
|
335 |
<tr>
|
336 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
337 |
<td class="tg-0lax"><strong>0.88</strong></td>
|
338 |
</tr>
|
339 |
<tr>
|
340 |
-
<td class="tg-0lax" rowspan="
|
341 |
<td class="tg-0lax">LP-MusicCaps</td>
|
342 |
-
<td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong
|
|
|
|
|
|
|
|
|
343 |
</tr>
|
344 |
<tr>
|
345 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
346 |
-
<td class="tg-0lax"><strong>0.328</strong
|
347 |
</tr>
|
348 |
<tr>
|
349 |
<td class="tg-9j4x" colspan="3">Audio Reasoning</td>
|
350 |
</tr>
|
351 |
<tr>
|
352 |
-
<td class="tg-0lax" rowspan="
|
353 |
<td class="tg-0lax">Gemini-Pro-V1.5</td>
|
354 |
<td class="tg-0lax">56.75|49.40|58.55|54.90</td>
|
355 |
</tr>
|
@@ -357,15 +401,19 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
357 |
<td class="tg-0lax">Qwen2-Audio</td>
|
358 |
<td class="tg-0lax">54.95|50.98|42.04|49.20</td>
|
359 |
</tr>
|
|
|
|
|
|
|
|
|
360 |
<tr>
|
361 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
362 |
-
<td class="tg-0lax"
|
363 |
</tr>
|
364 |
<tr>
|
365 |
<td class="tg-9j4x" colspan="3">Voice Chatting</td>
|
366 |
</tr>
|
367 |
<tr>
|
368 |
-
<td class="tg-0lax" rowspan="
|
369 |
<td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
|
370 |
<td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
|
371 |
</tr>
|
@@ -392,13 +440,17 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
392 |
<tr>
|
393 |
<td class="tg-0lax">Qwen2-Audio</td>
|
394 |
<td class="tg-0lax">3.74|3.43|35.71|35.72</td>
|
|
|
|
|
|
|
|
|
395 |
</tr>
|
396 |
<tr>
|
397 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
398 |
<td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
|
399 |
</tr>
|
400 |
<tr>
|
401 |
-
<td class="tg-0lax" rowspan="
|
402 |
<td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
|
403 |
<td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
|
404 |
</tr>
|
@@ -426,6 +478,10 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
426 |
<td class="tg-0lax">Qwen2-Audio</td>
|
427 |
<td class="tg-0lax">49.45|26.33|96.73|55.35</td>
|
428 |
</tr>
|
|
|
|
|
|
|
|
|
429 |
<tr>
|
430 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
431 |
<td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
|
@@ -436,52 +492,52 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
436 |
<details>
|
437 |
<summary>Image -> Text</summary>
|
438 |
|
439 |
-
| Dataset | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
|
440 |
-
|
441 |
-
| MMMU<sub>val</sub> | 59.2 | 53.9 | 58.6 | **60.0** |
|
442 |
-
| MMMU-Pro<sub>overall</sub> | 36.6 | - | **38.3** | 37.6 |
|
443 |
-
| MathVista<sub>testmini</sub> | 67.9 | **71.9** | 68.2 | 52.5 |
|
444 |
-
| MathVision<sub>full</sub> | 25.0 | 23.1 | **25.1** | - |
|
445 |
-
| MMBench-V1.1-EN<sub>test</sub> | 81.8 | 80.5 | **82.6** | 76.0 |
|
446 |
-
| MMVet<sub>turbo</sub> | 66.8 | **67.5** | 67.1 | 66.9 |
|
447 |
-
| MMStar | **64.0** | **64.0** | 63.9 | 54.8 |
|
448 |
-
| MME<sub>sum</sub> | 2340 | **2372** | 2347 | 2003 |
|
449 |
-
| MuirBench | 59.2 | - | **59.2** | - |
|
450 |
-
| CRPE<sub>relation</sub> | **76.5** | - | 76.4 | - |
|
451 |
-
| RealWorldQA<sub>avg</sub> | 70.3 | **71.9** | 68.5 | - |
|
452 |
-
| MME-RealWorld<sub>en</sub> | **61.6** | - | 57.4 | - |
|
453 |
-
| MM-MT-Bench | 6.0 | - | **6.3** | - |
|
454 |
-
| AI2D | 83.2 | **85.8** | 83.9 | - |
|
455 |
-
| TextVQA<sub>val</sub> | 84.4 | 83.2 | **84.9** | - |
|
456 |
-
| DocVQA<sub>test</sub> | 95.2 | 93.5 | **95.7** | - |
|
457 |
-
| ChartQA<sub>test Avg</sub> | 85.3 | 84.9 | **87.3** | - |
|
458 |
-
| OCRBench_V2<sub>en</sub> | **57.8** | - | 56.3 | - |
|
459 |
-
|
460 |
-
|
461 |
-
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
|
462 |
-
|
463 |
-
| Refcoco<sub>val</sub> | 90.5 | 90.0 | **90.6** | 73.2 |
|
464 |
-
| Refcoco<sub>textA</sub> | **93.5** | 92.5 | 93.2 | 72.9 |
|
465 |
-
| Refcoco<sub>textB</sub> | 86.6 | 85.4 | **88.2** | 74.6 |
|
466 |
-
| Refcoco+<sub>val</sub> | 85.4 | 84.2 | **88.2** | 62.5 |
|
467 |
-
| Refcoco+<sub>textA</sub> | **91.0** | 89.1 | 89.0 | 63.9 |
|
468 |
-
| Refcoco+<sub>textB</sub> | **79.3** | 76.9 | 75.9 | 65.0 |
|
469 |
-
| Refcocog+<sub>val</sub> | **87.4** | 87.2 | 86.1 | 75.2 |
|
470 |
-
| Refcocog+<sub>test</sub> | **87.9** | 87.2 | 87.0 | 76.2 |
|
471 |
-
| ODinW | 42.4 | 37.3 | **55.0** | 36.7 |
|
472 |
-
| PointGrounding | 66.5 | **67.3** | - | - |
|
473 |
</details>
|
474 |
|
475 |
|
476 |
<details>
|
477 |
<summary>Video(without audio) -> Text</summary>
|
478 |
|
479 |
-
| Dataset | Qwen2.5-Omni-7B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
|
480 |
-
|
481 |
-
| Video-MME<sub>w/o sub</sub> | 64.3 | 63.9 | **65.1** | 64.8 |
|
482 |
-
| Video-MME<sub>w sub</sub> | **72.4** | 67.9 | 71.6 | - |
|
483 |
-
| MVBench | **70.3** | 67.2 | 69.6 | - |
|
484 |
-
| EgoSchema<sub>test</sub> | **68.6** | 63.2 | 65.0 | - |
|
485 |
</details>
|
486 |
|
487 |
<details>
|
@@ -499,7 +555,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
499 |
<td class="tg-9j4x" colspan="3">Content Consistency</td>
|
500 |
</tr>
|
501 |
<tr>
|
502 |
-
<td class="tg-0lax" rowspan="
|
503 |
<td class="tg-0lax">Seed-TTS_ICL</td>
|
504 |
<td class="tg-0lax">1.11 | 2.24 | 7.58</td>
|
505 |
</tr>
|
@@ -527,6 +583,14 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
527 |
<td class="tg-0lax">CosyVoice 2-S</td>
|
528 |
<td class="tg-0lax">1.45 | 2.38 | 8.08</td>
|
529 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
530 |
<tr>
|
531 |
<td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
|
532 |
<td class="tg-0lax">1.70 | 2.72 | 7.97</td>
|
@@ -539,7 +603,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
539 |
<td class="tg-9j4x" colspan="3">Speaker Similarity</td>
|
540 |
</tr>
|
541 |
<tr>
|
542 |
-
<td class="tg-0lax" rowspan="
|
543 |
<td class="tg-0lax">Seed-TTS_ICL</td>
|
544 |
<td class="tg-0lax">0.796 | 0.762 | 0.776</td>
|
545 |
</tr>
|
@@ -567,6 +631,14 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
567 |
<td class="tg-0lax">CosyVoice 2-S</td>
|
568 |
<td class="tg-0lax">0.753 | 0.654 | 0.732</td>
|
569 |
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
570 |
<tr>
|
571 |
<td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
|
572 |
<td class="tg-0lax">0.752 | 0.632 | 0.747</td>
|
@@ -581,18 +653,18 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
581 |
<details>
|
582 |
<summary>Text -> Text</summary>
|
583 |
|
584 |
-
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-7B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
|
585 |
-
|
586 |
-
| MMLU-Pro | 47.0 | **56.3** | 44.1
|
587 |
-
| MMLU-redux | 71.0 | **75.4** | 67.3
|
588 |
-
| LiveBench<sub>0831</sub> | 29.6 | **35.9** | 29.2
|
589 |
-
| GPQA | 30.8 | **36.4** | 34.3
|
590 |
-
| MATH | 71.5 | **75.5** | 52.9
|
591 |
-
| GSM8K | 88.7 | **91.6** | 85.7
|
592 |
-
| HumanEval | 78.7 | **84.8** | 79.9
|
593 |
-
| MBPP | 73.2 | **79.2** | 67.2
|
594 |
-
| MultiPL-E | 65.8 | **70.4** | 59.1
|
595 |
-
| LiveCodeBench<sub>2305-2409</sub> | 24.6 | **28.7** | 23.9
|
596 |
</details>
|
597 |
|
598 |
## Quickstart
|
@@ -600,7 +672,7 @@ We conducted a comprehensive evaluation of Qwen2.5-Omni, which demonstrates stro
|
|
600 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command:
|
601 |
```
|
602 |
pip uninstall transformers
|
603 |
-
pip install git+https://github.com/huggingface/transformers
|
604 |
pip install accelerate
|
605 |
```
|
606 |
or you might encounter the following error:
|
@@ -680,10 +752,12 @@ sf.write(
|
|
680 |
<details>
|
681 |
<summary>Minimum GPU memory requirements</summary>
|
682 |
|
683 |
-
| Precision | 15(s) Video | 30(s) Video | 60(s) Video |
|
684 |
-
|
685 |
-
| FP32 |
|
686 |
-
| BF16 |
|
|
|
|
|
687 |
|
688 |
Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attn_implementation="flash_attention_2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource [here](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).
|
689 |
</details>
|
|
|
93 |
<td class="tg-0lax">Baichuan-Omni-1.5</td>
|
94 |
<td class="tg-0lax">-|-|-|42.90%</td>
|
95 |
</tr>
|
96 |
+
<tr>
|
97 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
98 |
+
<td class="tg-0lax">52.14%|52.08%|52.83%|52.19%</td>
|
99 |
+
</tr>
|
100 |
<tr>
|
101 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
102 |
<td class="tg-0lax"><strong>55.25%</strong>|<strong>60.00%</strong>|52.83%|<strong>56.13%</strong></td>
|
|
|
120 |
<td class="tg-9j4x" colspan="3">ASR</td>
|
121 |
</tr>
|
122 |
<tr>
|
123 |
+
<td class="tg-0lax" rowspan="12">Librispeech<br>dev-clean | dev other | test-clean | test-other</td>
|
124 |
<td class="tg-0lax">SALMONN</td>
|
125 |
<td class="tg-0lax">-|-|2.1|4.9</td>
|
126 |
</tr>
|
|
|
160 |
<td class="tg-0lax">Qwen2-Audio</td>
|
161 |
<td class="tg-0lax"><strong>1.3</strong>|<strong>3.4</strong>|<strong>1.6</strong>|3.6</td>
|
162 |
</tr>
|
163 |
+
<tr>
|
164 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
165 |
+
<td class="tg-0lax">2.0|4.1|2.2|4.5</td>
|
166 |
+
</tr>
|
167 |
<tr>
|
168 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
169 |
<td class="tg-0lax">1.6|3.5|1.8|3.4</td>
|
170 |
</tr>
|
171 |
<tr>
|
172 |
+
<td class="tg-0lax" rowspan="5">Common Voice 15<br>en | zh | yue | fr</td>
|
173 |
<td class="tg-0lax">Whisper-large-v3</td>
|
174 |
<td class="tg-0lax">9.3|12.8|10.9|10.8</td>
|
175 |
</tr>
|
|
|
181 |
<td class="tg-0lax">Qwen2-Audio</td>
|
182 |
<td class="tg-0lax">8.6|6.9|<strong>5.9</strong>|9.6</td>
|
183 |
</tr>
|
184 |
+
<tr>
|
185 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
186 |
+
<td class="tg-0lax">9.1|6.0|11.6|9.6</td>
|
187 |
+
</tr>
|
188 |
<tr>
|
189 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
190 |
<td class="tg-0lax"><strong>7.6</strong>|<strong>5.2</strong>|7.3|<strong>7.5</strong></td>
|
191 |
</tr>
|
192 |
<tr>
|
193 |
+
<td class="tg-0lax" rowspan="8">Fleurs<br>zh | en</td>
|
194 |
<td class="tg-0lax">Whisper-large-v3</td>
|
195 |
<td class="tg-0lax">7.7|4.1</td>
|
196 |
</tr>
|
|
|
213 |
<tr>
|
214 |
<td class="tg-0lax">Qwen2-Audio</td>
|
215 |
<td class="tg-0lax">7.5|-</td>
|
216 |
+
</tr>
|
217 |
+
<tr>
|
218 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
219 |
+
<td class="tg-0lax">3.2|5.4</td>
|
220 |
</tr>
|
221 |
<tr>
|
222 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
223 |
<td class="tg-0lax"><strong>3.0</strong>|4.1</td>
|
224 |
</tr>
|
225 |
<tr>
|
226 |
+
<td class="tg-0lax" rowspan="6">Wenetspeech<br>test-net | test-meeting</td>
|
227 |
<td class="tg-0lax">Seed-ASR-Chinese</td>
|
228 |
<td class="tg-0lax"><strong>4.7|5.7</strong></td>
|
229 |
</tr>
|
|
|
239 |
<td class="tg-0lax">MinMo</td>
|
240 |
<td class="tg-0lax">6.8|7.4</td>
|
241 |
</tr>
|
242 |
+
<tr>
|
243 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
244 |
+
<td class="tg-0lax">6.3|8.1</td>
|
245 |
+
</tr>
|
246 |
<tr>
|
247 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
248 |
<td class="tg-0lax">5.9|7.7</td>
|
249 |
</tr>
|
250 |
<tr>
|
251 |
+
<td class="tg-0lax" rowspan="4">Voxpopuli-V1.0-en</td>
|
252 |
<td class="tg-0lax">Llama-3-8B</td>
|
253 |
<td class="tg-0lax">6.2</td>
|
254 |
</tr>
|
|
|
256 |
<td class="tg-0lax">Llama-3-70B</td>
|
257 |
<td class="tg-0lax"><strong>5.7</strong></td>
|
258 |
</tr>
|
259 |
+
<tr>
|
260 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
261 |
+
<td class="tg-0lax">6.6</td>
|
262 |
+
</tr>
|
263 |
<tr>
|
264 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
265 |
<td class="tg-0lax">5.8</td>
|
|
|
268 |
<td class="tg-9j4x" colspan="3">S2TT</td>
|
269 |
</tr>
|
270 |
<tr>
|
271 |
+
<td class="tg-0lax" rowspan="9">CoVoST2<br>en-de | de-en | en-zh | zh-en</td>
|
272 |
<td class="tg-0lax">SALMONN</td>
|
273 |
<td class="tg-0lax">18.6|-|33.1|-</td>
|
274 |
</tr>
|
|
|
296 |
<td class="tg-0lax">Qwen2-Audio</td>
|
297 |
<td class="tg-0lax">29.9|35.2|45.2|24.4</td>
|
298 |
</tr>
|
299 |
+
<tr>
|
300 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
301 |
+
<td class="tg-0lax">28.3|38.1|41.4|26.6</td>
|
302 |
+
</tr>
|
303 |
<tr>
|
304 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
305 |
<td class="tg-0lax"><strong>30.2</strong>|37.7|41.4|<strong>29.4</strong></td>
|
|
|
308 |
<td class="tg-9j4x" colspan="3">SER</td>
|
309 |
</tr>
|
310 |
<tr>
|
311 |
+
<td class="tg-0lax" rowspan="6">Meld</td>
|
312 |
<td class="tg-0lax">WavLM-large</td>
|
313 |
<td class="tg-0lax">0.542</td>
|
314 |
</tr>
|
|
|
324 |
<td class="tg-0lax">Qwen2-Audio</td>
|
325 |
<td class="tg-0lax">0.553</td>
|
326 |
</tr>
|
327 |
+
<tr>
|
328 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
329 |
+
<td class="tg-0lax">0.558</td>
|
330 |
+
</tr>
|
331 |
<tr>
|
332 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
333 |
<td class="tg-0lax"><strong>0.570</strong></td>
|
|
|
336 |
<td class="tg-9j4x" colspan="3">VSC</td>
|
337 |
</tr>
|
338 |
<tr>
|
339 |
+
<td class="tg-0lax" rowspan="6">VocalSound</td>
|
340 |
<td class="tg-0lax">CLAP</td>
|
341 |
<td class="tg-0lax">0.495</td>
|
342 |
</tr>
|
|
|
352 |
<td class="tg-0lax">Qwen2-Audio</td>
|
353 |
<td class="tg-0lax"><strong>0.939</strong></td>
|
354 |
</tr>
|
355 |
+
<tr>
|
356 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
357 |
+
<td class="tg-0lax">0.936</td>
|
358 |
+
</tr>
|
359 |
<tr>
|
360 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
361 |
<td class="tg-0lax"><strong>0.939</strong></td>
|
|
|
364 |
<td class="tg-9j4x" colspan="3">Music</td>
|
365 |
</tr>
|
366 |
<tr>
|
367 |
+
<td class="tg-0lax" rowspan="3">GiantSteps Tempo</td>
|
368 |
<td class="tg-0lax">Llark-7B</td>
|
369 |
<td class="tg-0lax">0.86</td>
|
370 |
</tr>
|
371 |
+
<tr>
|
372 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
373 |
+
<td class="tg-0lax"><strong>0.88</strong></td>
|
374 |
+
</tr>
|
375 |
<tr>
|
376 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
377 |
<td class="tg-0lax"><strong>0.88</strong></td>
|
378 |
</tr>
|
379 |
<tr>
|
380 |
+
<td class="tg-0lax" rowspan="3">MusicCaps</td>
|
381 |
<td class="tg-0lax">LP-MusicCaps</td>
|
382 |
+
<td class="tg-0lax">0.291|0.149|0.089|<strong>0.061</strong>|0.129|0.130</td>
|
383 |
+
</tr>
|
384 |
+
<tr>
|
385 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
386 |
+
<td class="tg-0lax">0.325|<strong>0.163</strong>|<strong>0.093</strong>|0.057|<strong>0.132</strong>|<strong>0.229</strong></td>
|
387 |
</tr>
|
388 |
<tr>
|
389 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
390 |
+
<td class="tg-0lax"><strong>0.328</strong>|0.162|0.090|0.055|0.127|0.225</td>
|
391 |
</tr>
|
392 |
<tr>
|
393 |
<td class="tg-9j4x" colspan="3">Audio Reasoning</td>
|
394 |
</tr>
|
395 |
<tr>
|
396 |
+
<td class="tg-0lax" rowspan="4">MMAU<br>Sound | Music | Speech | Avg</td>
|
397 |
<td class="tg-0lax">Gemini-Pro-V1.5</td>
|
398 |
<td class="tg-0lax">56.75|49.40|58.55|54.90</td>
|
399 |
</tr>
|
|
|
401 |
<td class="tg-0lax">Qwen2-Audio</td>
|
402 |
<td class="tg-0lax">54.95|50.98|42.04|49.20</td>
|
403 |
</tr>
|
404 |
+
<tr>
|
405 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
406 |
+
<td class="tg-0lax"><strong>70.27</strong>|60.48|59.16|63.30</td>
|
407 |
+
</tr>
|
408 |
<tr>
|
409 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
410 |
+
<td class="tg-0lax">67.87|<strong>69.16|59.76|65.60</strong></td>
|
411 |
</tr>
|
412 |
<tr>
|
413 |
<td class="tg-9j4x" colspan="3">Voice Chatting</td>
|
414 |
</tr>
|
415 |
<tr>
|
416 |
+
<td class="tg-0lax" rowspan="9">VoiceBench<br>AlpacaEval | CommonEval | SD-QA | MMSU</td>
|
417 |
<td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
|
418 |
<td class="tg-0lax"><strong>4.55</strong>|3.90|53.35|47.17</td>
|
419 |
</tr>
|
|
|
440 |
<tr>
|
441 |
<td class="tg-0lax">Qwen2-Audio</td>
|
442 |
<td class="tg-0lax">3.74|3.43|35.71|35.72</td>
|
443 |
+
</tr>
|
444 |
+
<tr>
|
445 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
446 |
+
<td class="tg-0lax">4.32|4.00|49.37|50.23</td>
|
447 |
</tr>
|
448 |
<tr>
|
449 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
450 |
<td class="tg-0lax">4.49|3.93|<strong>55.71</strong>|<strong>61.32</strong></td>
|
451 |
</tr>
|
452 |
<tr>
|
453 |
+
<td class="tg-0lax" rowspan="9">VoiceBench<br>OpenBookQA | IFEval | AdvBench | Avg</td>
|
454 |
<td class="tg-0lax">Ultravox-v0.4.1-LLaMA-3.1-8B</td>
|
455 |
<td class="tg-0lax">65.27|<strong>66.88</strong>|98.46|71.45</td>
|
456 |
</tr>
|
|
|
478 |
<td class="tg-0lax">Qwen2-Audio</td>
|
479 |
<td class="tg-0lax">49.45|26.33|96.73|55.35</td>
|
480 |
</tr>
|
481 |
+
<tr>
|
482 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B</td>
|
483 |
+
<td class="tg-0lax">74.73|42.10|98.85|68.81</td>
|
484 |
+
</tr>
|
485 |
<tr>
|
486 |
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
487 |
<td class="tg-0lax"><strong>81.10</strong>|52.87|<strong>99.42</strong>|<strong>74.12</strong></td>
|
|
|
492 |
<details>
|
493 |
<summary>Image -> Text</summary>
|
494 |
|
495 |
+
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
|
496 |
+
|--------------------------------|--------------|------------|------------|---------------|-------------|
|
497 |
+
| MMMU<sub>val</sub> | 59.2 | 53.1 | 53.9 | 58.6 | **60.0** |
|
498 |
+
| MMMU-Pro<sub>overall</sub> | 36.6 | 29.7 | - | **38.3** | 37.6 |
|
499 |
+
| MathVista<sub>testmini</sub> | 67.9 | 59.4 | **71.9** | 68.2 | 52.5 |
|
500 |
+
| MathVision<sub>full</sub> | 25.0 | 20.8 | 23.1 | **25.1** | - |
|
501 |
+
| MMBench-V1.1-EN<sub>test</sub> | 81.8 | 77.8 | 80.5 | **82.6** | 76.0 |
|
502 |
+
| MMVet<sub>turbo</sub> | 66.8 | 62.1 | **67.5** | 67.1 | 66.9 |
|
503 |
+
| MMStar | **64.0** | 55.7 | **64.0** | 63.9 | 54.8 |
|
504 |
+
| MME<sub>sum</sub> | 2340 | 2117 | **2372** | 2347 | 2003 |
|
505 |
+
| MuirBench | 59.2 | 48.0 | - | **59.2** | - |
|
506 |
+
| CRPE<sub>relation</sub> | **76.5** | 73.7 | - | 76.4 | - |
|
507 |
+
| RealWorldQA<sub>avg</sub> | 70.3 | 62.6 | **71.9** | 68.5 | - |
|
508 |
+
| MME-RealWorld<sub>en</sub> | **61.6** | 55.6 | - | 57.4 | - |
|
509 |
+
| MM-MT-Bench | 6.0 | 5.0 | - | **6.3** | - |
|
510 |
+
| AI2D | 83.2 | 79.5 | **85.8** | 83.9 | - |
|
511 |
+
| TextVQA<sub>val</sub> | 84.4 | 79.8 | 83.2 | **84.9** | - |
|
512 |
+
| DocVQA<sub>test</sub> | 95.2 | 93.3 | 93.5 | **95.7** | - |
|
513 |
+
| ChartQA<sub>test Avg</sub> | 85.3 | 82.8 | 84.9 | **87.3** | - |
|
514 |
+
| OCRBench_V2<sub>en</sub> | **57.8** | 51.7 | - | 56.3 | - |
|
515 |
+
|
516 |
+
|
517 |
+
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-VL-7B | Grounding DINO | Gemini 1.5 Pro |
|
518 |
+
|--------------------------|--------------|---------------|---------------|----------------|----------------|
|
519 |
+
| Refcoco<sub>val</sub> | 90.5 | 88.7 | 90.0 | **90.6** | 73.2 |
|
520 |
+
| Refcoco<sub>textA</sub> | **93.5** | 91.8 | 92.5 | 93.2 | 72.9 |
|
521 |
+
| Refcoco<sub>textB</sub> | 86.6 | 84.0 | 85.4 | **88.2** | 74.6 |
|
522 |
+
| Refcoco+<sub>val</sub> | 85.4 | 81.1 | 84.2 | **88.2** | 62.5 |
|
523 |
+
| Refcoco+<sub>textA</sub> | **91.0** | 87.5 | 89.1 | 89.0 | 63.9 |
|
524 |
+
| Refcoco+<sub>textB</sub> | **79.3** | 73.2 | 76.9 | 75.9 | 65.0 |
|
525 |
+
| Refcocog+<sub>val</sub> | **87.4** | 85.0 | 87.2 | 86.1 | 75.2 |
|
526 |
+
| Refcocog+<sub>test</sub> | **87.9** | 85.1 | 87.2 | 87.0 | 76.2 |
|
527 |
+
| ODinW | 42.4 | 39.2 | 37.3 | **55.0** | 36.7 |
|
528 |
+
| PointGrounding | 66.5 | 46.2 | **67.3** | - | - |
|
529 |
</details>
|
530 |
|
531 |
|
532 |
<details>
|
533 |
<summary>Video(without audio) -> Text</summary>
|
534 |
|
535 |
+
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Other Best | Qwen2.5-VL-7B | GPT-4o-mini |
|
536 |
+
|-----------------------------|--------------|------------|------------|---------------|-------------|
|
537 |
+
| Video-MME<sub>w/o sub</sub> | 64.3 | 62.0 | 63.9 | **65.1** | 64.8 |
|
538 |
+
| Video-MME<sub>w sub</sub> | **72.4** | 68.6 | 67.9 | 71.6 | - |
|
539 |
+
| MVBench | **70.3** | 68.7 | 67.2 | 69.6 | - |
|
540 |
+
| EgoSchema<sub>test</sub> | **68.6** | 61.4 | 63.2 | 65.0 | - |
|
541 |
</details>
|
542 |
|
543 |
<details>
|
|
|
555 |
<td class="tg-9j4x" colspan="3">Content Consistency</td>
|
556 |
</tr>
|
557 |
<tr>
|
558 |
+
<td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
|
559 |
<td class="tg-0lax">Seed-TTS_ICL</td>
|
560 |
<td class="tg-0lax">1.11 | 2.24 | 7.58</td>
|
561 |
</tr>
|
|
|
583 |
<td class="tg-0lax">CosyVoice 2-S</td>
|
584 |
<td class="tg-0lax">1.45 | 2.38 | 8.08</td>
|
585 |
</tr>
|
586 |
+
<tr>
|
587 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
|
588 |
+
<td class="tg-0lax">1.95 | 2.87 | 9.92</td>
|
589 |
+
</tr>
|
590 |
+
<tr>
|
591 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
|
592 |
+
<td class="tg-0lax">1.58 | 2.51 | 7.86</td>
|
593 |
+
</tr>
|
594 |
<tr>
|
595 |
<td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
|
596 |
<td class="tg-0lax">1.70 | 2.72 | 7.97</td>
|
|
|
603 |
<td class="tg-9j4x" colspan="3">Speaker Similarity</td>
|
604 |
</tr>
|
605 |
<tr>
|
606 |
+
<td class="tg-0lax" rowspan="11">SEED<br>test-zh | test-en | test-hard </td>
|
607 |
<td class="tg-0lax">Seed-TTS_ICL</td>
|
608 |
<td class="tg-0lax">0.796 | 0.762 | 0.776</td>
|
609 |
</tr>
|
|
|
631 |
<td class="tg-0lax">CosyVoice 2-S</td>
|
632 |
<td class="tg-0lax">0.753 | 0.654 | 0.732</td>
|
633 |
</tr>
|
634 |
+
<tr>
|
635 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B_ICL</td>
|
636 |
+
<td class="tg-0lax">0.741 | 0.635 | 0.748</td>
|
637 |
+
</tr>
|
638 |
+
<tr>
|
639 |
+
<td class="tg-0lax">Qwen2.5-Omni-3B_RL</td>
|
640 |
+
<td class="tg-0lax">0.744 | 0.635 | 0.746</td>
|
641 |
+
</tr>
|
642 |
<tr>
|
643 |
<td class="tg-0lax">Qwen2.5-Omni-7B_ICL</td>
|
644 |
<td class="tg-0lax">0.752 | 0.632 | 0.747</td>
|
|
|
653 |
<details>
|
654 |
<summary>Text -> Text</summary>
|
655 |
|
656 |
+
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-Omni-3B | Qwen2.5-7B | Qwen2.5-3B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
|
657 |
+
|-----------------------------------|-----------|------------|------------|------------|------------|-------------|-----------|
|
658 |
+
| MMLU-Pro | 47.0 | 40.4 | **56.3** | 43.7 | 44.1 | 48.3 | 52.1 |
|
659 |
+
| MMLU-redux | 71.0 | 60.9 | **75.4** | 64.4 | 67.3 | 67.2 | 72.8 |
|
660 |
+
| LiveBench<sub>0831</sub> | 29.6 | 22.3 | **35.9** | 26.8 | 29.2 | 26.7 | 30.6 |
|
661 |
+
| GPQA | 30.8 | 34.3 | **36.4** | 30.3 | 34.3 | 32.8 | 32.8 |
|
662 |
+
| MATH | 71.5 | 63.6 | **75.5** | 65.9 | 52.9 | 51.9 | 44.3 |
|
663 |
+
| GSM8K | 88.7 | 82.6 | **91.6** | 86.7 | 85.7 | 84.5 | 76.7 |
|
664 |
+
| HumanEval | 78.7 | 70.7 | **84.8** | 74.4 | 79.9 | 72.6 | 68.9 |
|
665 |
+
| MBPP | 73.2 | 70.4 | **79.2** | 72.7 | 67.2 | 69.6 | 74.9 |
|
666 |
+
| MultiPL-E | 65.8 | 57.6 | **70.4** | 60.2 | 59.1 | 50.7 | 53.4 |
|
667 |
+
| LiveCodeBench<sub>2305-2409</sub> | 24.6 | 16.5 | **28.7** | 19.9 | 23.9 | 8.3 | 18.9 |
|
668 |
</details>
|
669 |
|
670 |
## Quickstart
|
|
|
672 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni has been in the latest Hugging face transformers and we advise you to build from source with command:
|
673 |
```
|
674 |
pip uninstall transformers
|
675 |
+
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
|
676 |
pip install accelerate
|
677 |
```
|
678 |
or you might encounter the following error:
|
|
|
752 |
<details>
|
753 |
<summary>Minimum GPU memory requirements</summary>
|
754 |
|
755 |
+
|Model | Precision | 15(s) Video | 30(s) Video | 60(s) Video |
|
756 |
+
|--------------|-----------| ------------- | ------------- | ------------------ |
|
757 |
+
| Qwen-Omni-3B | FP32 | 89.10 GB | Not Recommend | Not Recommend |
|
758 |
+
| Qwen-Omni-3B | BF16 | 18.38 GB | 22.43 GB | 28.22 GB |
|
759 |
+
| Qwen-Omni-7B | FP32 | 93.56 GB | Not Recommend | Not Recommend |
|
760 |
+
| Qwen-Omni-7B | BF16 | 31.11 GB | 41.85 GB | 60.19 GB |
|
761 |
|
762 |
Note: The table above presents the theoretical minimum memory requirements for inference with `transformers` and `BF16` is test with `attn_implementation="flash_attention_2"`; however, in practice, the actual memory usage is typically at least 1.2 times higher. For more information, see the linked resource [here](https://huggingface.co/docs/accelerate/main/en/usage_guides/model_size_estimator).
|
763 |
</details>
|