Update README.md
Browse files
README.md
CHANGED
@@ -44,22 +44,68 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
|
|
44 |
### Performance
|
45 |
|
46 |
<details>
|
47 |
-
<summary>
|
48 |
|
49 |
-
|
50 |
-
|
51 |
-
|
52 |
-
|
53 |
-
|
54 |
-
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
-
|
60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
61 |
</details>
|
62 |
|
|
|
63 |
<details>
|
64 |
<summary>Audio -> Text</summary>
|
65 |
|
@@ -451,67 +497,6 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
|
|
451 |
| EgoSchema<sub>test</sub> | **69.6** | 63.2 | 65.0 | - |
|
452 |
</details>
|
453 |
|
454 |
-
<details>
|
455 |
-
<summary>Multimodality -> Text</summary>
|
456 |
-
|
457 |
-
<style type="text/css">
|
458 |
-
.tg {border-collapse:collapse;border-spacing:0;}
|
459 |
-
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
460 |
-
overflow:hidden;padding:10px 5px;word-break:normal;}
|
461 |
-
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
462 |
-
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
|
463 |
-
.tg .tg-0lax{text-align:left;vertical-align:top}
|
464 |
-
</style>
|
465 |
-
<table class=""><thead>
|
466 |
-
<tr>
|
467 |
-
<th class="tg-0lax">Datasets</th>
|
468 |
-
<th class="tg-0lax">Model</th>
|
469 |
-
<th class="tg-0lax">Performance</th>
|
470 |
-
</tr></thead>
|
471 |
-
<tbody>
|
472 |
-
<tr>
|
473 |
-
<td class="tg-0lax" rowspan="10">OmniBench<br>Speech | Sound Event | Music | Avg</td>
|
474 |
-
<td class="tg-0lax">Gemini-1.5-Pro</td>
|
475 |
-
<td class="tg-0lax">42.67%|42.26%|46.23%|42.91%</td>
|
476 |
-
</tr>
|
477 |
-
<tr>
|
478 |
-
<td class="tg-0lax">MIO-Instruct</td>
|
479 |
-
<td class="tg-0lax">36.96%|33.58%|11.32%|33.80%</td>
|
480 |
-
</tr>
|
481 |
-
<tr>
|
482 |
-
<td class="tg-0lax">AnyGPT (7B)</td>
|
483 |
-
<td class="tg-0lax">17.77%|20.75%|13.21%|18.04%</td>
|
484 |
-
</tr>
|
485 |
-
<tr>
|
486 |
-
<td class="tg-0lax">video-SALMONN</td>
|
487 |
-
<td class="tg-0lax">34.11%|31.70%|<span style="font-weight:bold">56.60%</span>|35.64%</td>
|
488 |
-
</tr>
|
489 |
-
<tr>
|
490 |
-
<td class="tg-0lax">UnifiedIO2-xlarge</td>
|
491 |
-
<td class="tg-0lax">39.56%|36.98%|29.25%|38.00%</td>
|
492 |
-
</tr>
|
493 |
-
<tr>
|
494 |
-
<td class="tg-0lax">UnifiedIO2-xxlarge</td>
|
495 |
-
<td class="tg-0lax">34.24%|36.98%|29.25%|38.00%</td>
|
496 |
-
</tr>
|
497 |
-
<tr>
|
498 |
-
<td class="tg-0lax">MiniCPM-o</td>
|
499 |
-
<td class="tg-0lax">34.24%|36.98%|24.53%|33.98%</td>
|
500 |
-
</tr>
|
501 |
-
<tr>
|
502 |
-
<td class="tg-0lax">Baichuan-Omni-1.5</td>
|
503 |
-
<td class="tg-0lax">-|-|-|40.50%</td>
|
504 |
-
</tr>
|
505 |
-
<tr>
|
506 |
-
<td class="tg-0lax">Qwen2-Audio</td>
|
507 |
-
<td class="tg-0lax">-|-|-|42.90%</td>
|
508 |
-
</tr>
|
509 |
-
<tr>
|
510 |
-
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
511 |
-
<td class="tg-0lax"><span style="font-weight:bold">55.25%</span>|<span style="font-weight:bold">60.00%</span>|52.83%|<span style="font-weight:bold">56.13%</span></td>
|
512 |
-
</tr>
|
513 |
-
</tbody></table>
|
514 |
-
</details>
|
515 |
|
516 |
<details>
|
517 |
<summary>Zero-shot Speech Generation</summary>
|
@@ -615,6 +600,23 @@ Qwen2.5-Omni is an end-to-end multimodal model designed to perceive diverse moda
|
|
615 |
</tbody></table>
|
616 |
</details>
|
617 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
618 |
## Quickstart
|
619 |
|
620 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
|
@@ -886,4 +888,21 @@ model = Qwen2_5OmniModel.from_pretrained(
|
|
886 |
)
|
887 |
```
|
888 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
889 |
<br>
|
|
|
44 |
### Performance
|
45 |
|
46 |
<details>
|
47 |
+
<summary>Multimodality -> Text</summary>
|
48 |
|
49 |
+
<style type="text/css">
|
50 |
+
.tg {border-collapse:collapse;border-spacing:0;}
|
51 |
+
.tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
52 |
+
overflow:hidden;padding:10px 5px;word-break:normal;}
|
53 |
+
.tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px;
|
54 |
+
font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;}
|
55 |
+
.tg .tg-0lax{text-align:left;vertical-align:top}
|
56 |
+
</style>
|
57 |
+
<table class=""><thead>
|
58 |
+
<tr>
|
59 |
+
<th class="tg-0lax">Datasets</th>
|
60 |
+
<th class="tg-0lax">Model</th>
|
61 |
+
<th class="tg-0lax">Performance</th>
|
62 |
+
</tr></thead>
|
63 |
+
<tbody>
|
64 |
+
<tr>
|
65 |
+
<td class="tg-0lax" rowspan="10">OmniBench<br>Speech | Sound Event | Music | Avg</td>
|
66 |
+
<td class="tg-0lax">Gemini-1.5-Pro</td>
|
67 |
+
<td class="tg-0lax">42.67%|42.26%|46.23%|42.91%</td>
|
68 |
+
</tr>
|
69 |
+
<tr>
|
70 |
+
<td class="tg-0lax">MIO-Instruct</td>
|
71 |
+
<td class="tg-0lax">36.96%|33.58%|11.32%|33.80%</td>
|
72 |
+
</tr>
|
73 |
+
<tr>
|
74 |
+
<td class="tg-0lax">AnyGPT (7B)</td>
|
75 |
+
<td class="tg-0lax">17.77%|20.75%|13.21%|18.04%</td>
|
76 |
+
</tr>
|
77 |
+
<tr>
|
78 |
+
<td class="tg-0lax">video-SALMONN</td>
|
79 |
+
<td class="tg-0lax">34.11%|31.70%|<span style="font-weight:bold">56.60%</span>|35.64%</td>
|
80 |
+
</tr>
|
81 |
+
<tr>
|
82 |
+
<td class="tg-0lax">UnifiedIO2-xlarge</td>
|
83 |
+
<td class="tg-0lax">39.56%|36.98%|29.25%|38.00%</td>
|
84 |
+
</tr>
|
85 |
+
<tr>
|
86 |
+
<td class="tg-0lax">UnifiedIO2-xxlarge</td>
|
87 |
+
<td class="tg-0lax">34.24%|36.98%|29.25%|38.00%</td>
|
88 |
+
</tr>
|
89 |
+
<tr>
|
90 |
+
<td class="tg-0lax">MiniCPM-o</td>
|
91 |
+
<td class="tg-0lax">34.24%|36.98%|24.53%|33.98%</td>
|
92 |
+
</tr>
|
93 |
+
<tr>
|
94 |
+
<td class="tg-0lax">Baichuan-Omni-1.5</td>
|
95 |
+
<td class="tg-0lax">-|-|-|40.50%</td>
|
96 |
+
</tr>
|
97 |
+
<tr>
|
98 |
+
<td class="tg-0lax">Qwen2-Audio</td>
|
99 |
+
<td class="tg-0lax">-|-|-|42.90%</td>
|
100 |
+
</tr>
|
101 |
+
<tr>
|
102 |
+
<td class="tg-0lax">Qwen2.5-Omni-7B</td>
|
103 |
+
<td class="tg-0lax"><span style="font-weight:bold">55.25%</span>|<span style="font-weight:bold">60.00%</span>|52.83%|<span style="font-weight:bold">56.13%</span></td>
|
104 |
+
</tr>
|
105 |
+
</tbody></table>
|
106 |
</details>
|
107 |
|
108 |
+
|
109 |
<details>
|
110 |
<summary>Audio -> Text</summary>
|
111 |
|
|
|
497 |
| EgoSchema<sub>test</sub> | **69.6** | 63.2 | 65.0 | - |
|
498 |
</details>
|
499 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
500 |
|
501 |
<details>
|
502 |
<summary>Zero-shot Speech Generation</summary>
|
|
|
600 |
</tbody></table>
|
601 |
</details>
|
602 |
|
603 |
+
<details>
|
604 |
+
<summary>Text -> Text</summary>
|
605 |
+
|
606 |
+
| Dataset | Qwen2.5-Omni-7B | Qwen2.5-7B | Qwen2-7B | Llama3.1-8B | Gemma2-9B |
|
607 |
+
|-----------------------------------|-----------|------------|----------|-------------|-----------|
|
608 |
+
| MMLU-Pro | 47.0 | **56.3** | 44.1 | 48.3 | 52.1 |
|
609 |
+
| MMLU-redux | 71.0 | **75.4** | 67.3 | 67.2 | 72.8 |
|
610 |
+
| LiveBench<sub>0831</sub> | 29.6 | **35.9** | 29.2 | 26.7 | 30.6 |
|
611 |
+
| GPQA | 30.8 | **36.4** | 34.3 | 32.8 | 32.8 |
|
612 |
+
| MATH | 71.5 | **75.5** | 52.9 | 51.9 | 44.3 |
|
613 |
+
| GSM8K | 88.7 | **91.6** | 85.7 | 84.5 | 76.7 |
|
614 |
+
| HumanEval | 79.9 | **84.8** | 79.9 | 72.6 | 68.9 |
|
615 |
+
| MBPP | 73.7 | **79.2** | 67.2 | 69.6 | 74.9 |
|
616 |
+
| MultiPL-E | 67.0 | **70.4** | 59.1 | 50.7 | 53.4 |
|
617 |
+
| LiveCodeBench<sub>2305-2409</sub> | 25.2 | **28.7** | 23.9 | 8.3 | 18.9 |
|
618 |
+
</details>
|
619 |
+
|
620 |
## Quickstart
|
621 |
|
622 |
Below, we provide simple examples to show how to use Qwen2.5-Omni with 🤗 Transformers. The codes of Qwen2.5-Omni on Hugging Face Transformers are in pull request stage and not merged into the main branch yet. Therefore, you may need to build from source to use it with command:
|
|
|
888 |
)
|
889 |
```
|
890 |
|
891 |
+
|
892 |
+
<!-- ## Citation
|
893 |
+
|
894 |
+
If you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)
|
895 |
+
|
896 |
+
|
897 |
+
|
898 |
+
```BibTeX
|
899 |
+
|
900 |
+
@article{Qwen2.5-Omni,
|
901 |
+
title={Qwen2.5-Omni Technical Report},
|
902 |
+
author={},
|
903 |
+
journal={arXiv preprint arXiv:2502.13923},
|
904 |
+
year={2025}
|
905 |
+
}
|
906 |
+
``` -->
|
907 |
+
|
908 |
<br>
|