Add pipeline tag and link to paper (#1)
Browse files- Add pipeline tag and link to paper (5fce1cd2af86d44e21770dd970b02198670ff195)
Co-authored-by: Niels Rogge <[email protected]>
README.md
CHANGED
@@ -1,25 +1,26 @@
|
|
1 |
---
|
2 |
-
tags:
|
3 |
-
- espnet
|
4 |
-
- audio
|
5 |
-
- automatic-speech-recognition
|
6 |
-
- speech-translation
|
7 |
-
- language-identification
|
8 |
-
language: multilingual
|
9 |
datasets:
|
10 |
- espnet/yodas_owsmv4
|
|
|
|
|
11 |
license: cc-by-4.0
|
12 |
metrics:
|
13 |
- cer
|
14 |
- bleu
|
15 |
- accuracy
|
16 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
17 |
---
|
18 |
|
19 |
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
|
20 |
It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
|
21 |
|
22 |
-
[OWSM-CTC v4](https://
|
23 |
The newly curated data will be publicly released. Please stay tuned!
|
24 |
|
25 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
@@ -32,8 +33,6 @@ espnet_model_zoo
|
|
32 |
|
33 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
34 |
|
35 |
-
|
36 |
-
|
37 |
### Example script for batched inference
|
38 |
|
39 |
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
|
@@ -157,8 +156,6 @@ segments = aligner(speech, text)
|
|
157 |
print(segments)
|
158 |
```
|
159 |
|
160 |
-
|
161 |
-
|
162 |
### OWSM series
|
163 |
|
164 |
#### Encoder-decoder OWSM
|
@@ -173,7 +170,6 @@ print(segments)
|
|
173 |
| OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
|
174 |
| OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
|
175 |
|
176 |
-
|
177 |
#### CTC-based OWSM
|
178 |
|
179 |
| Name | Size | Hugging Face Repo |
|
@@ -182,8 +178,6 @@ print(segments)
|
|
182 |
| OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
|
183 |
| OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
|
184 |
|
185 |
-
|
186 |
-
|
187 |
### Citations
|
188 |
|
189 |
#### OWSM v4
|
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
datasets:
|
3 |
- espnet/yodas_owsmv4
|
4 |
+
language: multilingual
|
5 |
+
library_name: espnet
|
6 |
license: cc-by-4.0
|
7 |
metrics:
|
8 |
- cer
|
9 |
- bleu
|
10 |
- accuracy
|
11 |
+
tags:
|
12 |
+
- espnet
|
13 |
+
- audio
|
14 |
+
- automatic-speech-recognition
|
15 |
+
- speech-translation
|
16 |
+
- language-identification
|
17 |
+
pipeline_tag: automatic-speech-recognition
|
18 |
---
|
19 |
|
20 |
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
|
21 |
It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
|
22 |
|
23 |
+
[OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
|
24 |
The newly curated data will be publicly released. Please stay tuned!
|
25 |
|
26 |
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
|
|
|
33 |
|
34 |
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
|
35 |
|
|
|
|
|
36 |
### Example script for batched inference
|
37 |
|
38 |
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
|
|
|
156 |
print(segments)
|
157 |
```
|
158 |
|
|
|
|
|
159 |
### OWSM series
|
160 |
|
161 |
#### Encoder-decoder OWSM
|
|
|
170 |
| OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
|
171 |
| OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
|
172 |
|
|
|
173 |
#### CTC-based OWSM
|
174 |
|
175 |
| Name | Size | Hugging Face Repo |
|
|
|
178 |
| OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
|
179 |
| OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
|
180 |
|
|
|
|
|
181 |
### Citations
|
182 |
|
183 |
#### OWSM v4
|