File size: 8,501 Bytes
4f60adb efab66c 0f96e13 4f60adb 0f96e13 4f60adb fa2e76b bddafbf 0f96e13 e11f9c9 4f60adb 432d8d9 44a22dc 432d8d9 44a22dc 432d8d9 44a22dc 432d8d9 7898a7a 432d8d9 de8c369 a8086dc de8c369 0515603 de8c369 efab66c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 |
---
datasets:
- espnet/yodas_owsmv4
language: multilingual
library_name: espnet
license: cc-by-4.0
metrics:
- cer
- bleu
- accuracy
tags:
- espnet
- audio
- automatic-speech-recognition
- speech-translation
- language-identification
pipeline_tag: automatic-speech-recognition
---
[OWSM-CTC](https://aclanthology.org/2024.acl-long.549/) (Peng et al., ACL 2024) is an encoder-only speech foundation model based on hierarchical multi-task self-conditioned CTC.
It follows the design of the project, [Open Whisper-style Speech Model (OWSM)](https://www.wavlab.org/activities/2024/owsm/).
[OWSM-CTC v4](https://huggingface.co/papers/2506.00338) is trained for three epochs on 320k hours of public audio data covering multilingual speech recognition, any-to-any speech translation, and language identification.
The newly curated data will be publicly released. Please stay tuned!
To use the pre-trained model, please install `espnet` and `espnet_model_zoo`. The requirements are:
```
librosa
torch
espnet
espnet_model_zoo
```
**The recipe can be found in ESPnet:** https://github.com/espnet/espnet/tree/master/egs2/owsm_ctc_v3.1/s2t1
### Example script for batched inference
`Speech2TextGreedySearch` now provides a unified batched inference method `batch_decode`. It performs CTC greedy decoding for a batch of short-form or long-form audios. If an audio is shorter than 30s, it will be padded to 30s; otherwise it will be split into overlapped segments (same as the "long-form ASR/ST" method below).
```python
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v4_1B",
device="cuda",
use_flash_attn=False, # set to True for better efficiency if flash attn is installed and dtype is float16 or bfloat16
lang_sym='<eng>',
task_sym='<asr>',
)
res = s2t.batch_decode(
"audio.wav", # a single audio (path or 1-D array/tensor) as input
batch_size=16,
context_len_in_secs=4,
) # res is a single str, i.e., the predicted text without special tokens
res = s2t.batch_decode(
["audio1.wav", "audio2.wav", "audio3.wav"], # a list of audios as input
batch_size=16,
context_len_in_secs=4,
) # res is a list of str
# Please check the code of `batch_decode` for all supported inputs
```
### Example script for short-form ASR/ST/LID
Our models are trained on 16kHz audio with a fixed duration of 30s. When using the pre-trained model, please ensure the input speech is 16kHz and pad or truncate it to 30s.
```python
import librosa
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v4_1B",
device="cuda",
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
# NOTE: OWSM-CTC is trained on 16kHz audio with a fixed 30s duration. Please ensure your input has the correct sample rate; otherwise resample it to 16k before feeding it to the model
speech, rate = librosa.load("xxx.wav", sr=16000)
speech = librosa.util.fix_length(speech, size=(16000 * 30))
res = s2t(speech)[0]
print(res)
```
### Example script for long-form ASR/ST
```python
import soundfile as sf
import torch
from espnet2.bin.s2t_inference_ctc import Speech2TextGreedySearch
context_len_in_secs = 4 # left and right context when doing buffered inference
batch_size = 32 # depends on the GPU memory
s2t = Speech2TextGreedySearch.from_pretrained(
"espnet/owsm_ctc_v4_1B",
device='cuda' if torch.cuda.is_available() else 'cpu',
generate_interctc_outputs=False,
lang_sym='<eng>',
task_sym='<asr>',
)
speech, rate = sf.read(
"xxx.wav"
)
text = s2t.decode_long_batched_buffered(
speech,
batch_size=batch_size,
context_len_in_secs=context_len_in_secs,
)
print(text)
```
### Example of CTC forced alignment using `ctc-segmentation`
CTC segmentation can be efficiently applied to audio of an arbitrary length.
```python
import soundfile as sf
from espnet2.bin.s2t_ctc_align import CTCSegmentation
from espnet_model_zoo.downloader import ModelDownloader
# Download model first
d = ModelDownloader()
downloaded = d.download_and_unpack("espnet/owsm_ctc_v4_1B")
aligner = CTCSegmentation(
**downloaded,
fs=16000,
ngpu=1,
batch_size=32, # batched parallel decoding; reduce it if your GPU memory is smaller
kaldi_style_text=True,
time_stamps="auto", # "auto" can be more accurate than "fixed" when converting token index to timestamp
lang_sym="<eng>",
task_sym="<asr>",
context_len_in_secs=2, # left and right context in buffered decoding
)
speech, rate = sf.read(
"./test_utils/ctc_align_test.wav"
)
print(f"speech duration: {len(speech) / rate : .2f} seconds")
text = """
utt1 THE SALE OF THE HOTELS
utt2 IS PART OF HOLIDAY'S STRATEGY
utt3 TO SELL OFF ASSETS
utt4 AND CONCENTRATE ON PROPERTY MANAGEMENT
"""
segments = aligner(speech, text)
print(segments)
```
### OWSM series
#### Encoder-decoder OWSM
| Name | Size | Hugging Face Repo |
| :--- | ---: | :---------------- |
| OWSM v3.1 base | 101M | https://huggingface.co/espnet/owsm_v3.1_ebf_base |
| OWSM v3.1 small | 367M | https://huggingface.co/espnet/owsm_v3.1_ebf_small |
| OWSM v3.1 medium | 1.02B | https://huggingface.co/espnet/owsm_v3.1_ebf |
| OWSM v3.2 small | 367M | https://huggingface.co/espnet/owsm_v3.2 |
| OWSM v4 base | 102M | https://huggingface.co/espnet/owsm_v4_base_102M |
| OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
| OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
#### CTC-based OWSM
| Name | Size | Hugging Face Repo |
| :--- | ---: | :---------------- |
| OWSM-CTC v3.1 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.1_1B |
| OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
| OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
### Citations
#### OWSM v4
```BibTex
@inproceedings{owsm-v4,
title={{OWSM} v4: Improving Open Whisper-Style Speech Models via Data Scaling and Cleaning},
author={Yifan Peng and Shakeel Muhammad and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) (accepted)},
year={2025},
}
```
#### OWSM-CTC
```BibTex
@inproceedings{owsm-ctc,
title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
author = "Peng, Yifan and
Sudo, Yui and
Shakeel, Muhammad and
Watanabe, Shinji",
booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
year = "2024",
month= {8},
url = "https://aclanthology.org/2024.acl-long.549",
}
```
#### OWSM v3.1 and v3.2
```BibTex
@inproceedings{owsm-v32,
title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2406.09282"
}
@inproceedings{owsm-v31,
title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
year={2024},
month={9},
pdf="https://arxiv.org/pdf/2401.16658",
}
```
#### Initial OWSM (v1, v2, v3)
```BibTex
@inproceedings{owsm,
title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
year={2023},
month={12},
pdf="https://arxiv.org/pdf/2309.13876",
}
``` |