Fhrozen nielsr HF Staff commited on
Commit
0f96e13
·
verified ·
1 Parent(s): 36cbf93

Add pipeline tag and link to paper (#1)

Browse files

- Add pipeline tag and link to paper (5fce1cd2af86d44e21770dd970b02198670ff195)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +10 -16
README.md CHANGED
@@ -1,25 +1,26 @@
1
  ---
2
- tags:
3
- - espnet
4
- - audio
5
- - automatic-speech-recognition
6
- - speech-translation
7
- - language-identification
8
- language: multilingual
9
  datasets:
10
  - espnet/yodas_owsmv4
 
 
11
  license: cc-by-4.0
12
  metrics:
13
  - cer
14
  - bleu
15
  - accuracy
16
- library_name: espnet
 
 
 
 
 
 
17
  ---
18
 
19
  [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
20
  It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
21
 
22
- [OWSM-CTC v4](https://arxiv.org/abs/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
23
  The newly curated data will be publicly released. Please stay tuned!
24
 
25
  To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
@@ -32,8 +33,6 @@ espnet_model_zoo
32
 
33
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
34
 
35
-
36
-
37
  ### Example script for batched inference
38
 
39
  `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
@@ -157,8 +156,6 @@ segments = aligner(speech, text)
157
  print(segments)
158
  ```
159
 
160
-
161
-
162
  ### OWSM series
163
 
164
  #### Encoder-decoder OWSM
@@ -173,7 +170,6 @@ print(segments)
173
  | OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
174
  | OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
175
 
176
-
177
  #### CTC-based OWSM
178
 
179
  | Name | Size | Hugging Face Repo |
@@ -182,8 +178,6 @@ print(segments)
182
  | OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
183
  | OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
184
 
185
-
186
-
187
  ### Citations
188
 
189
  #### OWSM v4
 
1
  ---
 
 
 
 
 
 
 
2
  datasets:
3
  - espnet/yodas_owsmv4
4
+ language: multilingual
5
+ library_name: espnet
6
  license: cc-by-4.0
7
  metrics:
8
  - cer
9
  - bleu
10
  - accuracy
11
+ tags:
12
+ - espnet
13
+ - audio
14
+ - automatic-speech-recognition
15
+ - speech-translation
16
+ - language-identification
17
+ pipeline_tag: automatic-speech-recognition
18
  ---
19
 
20
  [OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
21
  It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
22
 
23
+ [OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
24
  The newly curated data will be publicly released. Please stay tuned!
25
 
26
  To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
 
33
 
34
  **The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
35
 
 
 
36
  ### Example script for batched inference
37
 
38
  `Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
 
156
  print(segments)
157
  ```
158
 
 
 
159
  ### OWSM series
160
 
161
  #### Encoder-decoder OWSM
 
170
  | OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
171
  | OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
172
 
 
173
  #### CTC-based OWSM
174
 
175
  | Name | Size | Hugging Face Repo |
 
178
  | OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
179
  | OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
180
 
 
 
181
  ### Citations
182
 
183
  #### OWSM v4