pyf98 commited on
Commit
7c45ace
·
1 Parent(s): 3de552a
Files changed (23) hide show
  1. README.md +113 -0
  2. data/token_list/bpe_unigram50000/bpe.model +3 -0
  3. exp/s2t_stats_raw_bpe50000/train/feats_stats.npz +3 -0
  4. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/config.yaml +0 -0
  5. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/acc.png +0 -0
  6. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/backward_time.png +0 -0
  7. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/cer.png +0 -0
  8. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/cer_ctc.png +0 -0
  9. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/clip.png +0 -0
  10. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/forward_time.png +0 -0
  11. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/gpu_max_cached_mem_GB.png +0 -0
  12. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/grad_norm.png +0 -0
  13. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/iter_time.png +0 -0
  14. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss.png +0 -0
  15. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_att.png +0 -0
  16. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_ctc.png +0 -0
  17. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_scale.png +0 -0
  18. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/optim0_lr0.png +0 -0
  19. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/optim_step_time.png +0 -0
  20. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/train_time.png +0 -0
  21. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/wer.png +0 -0
  22. exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/valid.total_count.ave_5best.pth +3 -0
  23. meta.yaml +8 -0
README.md ADDED
@@ -0,0 +1,113 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - espnet
4
+ - audio
5
+ - automatic-speech-recognition
6
+ - speech-translation
7
+ - language-identification
8
+ language: multilingual
9
+ datasets:
10
+ - owsm_v4
11
+ license: cc-by-4.0
12
+ ---
13
+
14
+
15
+ ## Open Whisper-style Speech Model (OWSM)
16
+
17
+ OWSM aims to develop fully open speech foundation models using publicly available data and open-source toolkits including [ESPnet](https://github.com/espnet/espnet).
18
+
19
+ Inference examples can be found in our [project page](https://www.wavlab.org/activities/2024/owsm/).
20
+ The Gradio demo is [here](https://huggingface.co/spaces/pyf98/OWSM_v3_demo).
21
+
22
+ [OWSM v4]() is the latest version in the OWSM series, which significantly outperforms OWSM v3.1 in LID and multilingual ASR.
23
+
24
+ This repo contains a small-sized model with 370M parameters. It is trained on 320k hours of public speech data. It supports the following speech-to-text tasks:
25
+ - Language identification
26
+ - Speech recognition
27
+ - Speech translation
28
+ - Utterance-level timestamp prediction
29
+ - Long-form transcription
30
+
31
+
32
+ ### OWSM series
33
+
34
+ #### Encoder-decoder OWSM
35
+
36
+ | Name | Size | Hugging Face Repo |
37
+ | :--- | ---: | :---------------- |
38
+ | OWSM v3.1 base | 101M | https://huggingface.co/espnet/owsm_v3.1_ebf_base |
39
+ | OWSM v3.1 small | 367M | https://huggingface.co/espnet/owsm_v3.1_ebf_small |
40
+ | OWSM v3.1 medium | 1.02B | https://huggingface.co/espnet/owsm_v3.1_ebf |
41
+ | OWSM v3.2 small | 367M | https://huggingface.co/espnet/owsm_v3.2 |
42
+ | OWSM v4 base | 102M | https://huggingface.co/espnet/owsm_v4_base_102M |
43
+ | OWSM v4 small | 370M | https://huggingface.co/espnet/owsm_v4_small_370M |
44
+ | OWSM v4 medium | 1.02B | https://huggingface.co/espnet/owsm_v4_medium_1B |
45
+
46
+
47
+ #### CTC-based OWSM
48
+
49
+ | Name | Size | Hugging Face Repo |
50
+ | :--- | ---: | :---------------- |
51
+ | OWSM-CTC v3.1 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.1_1B |
52
+ | OWSM-CTC v3.2 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v3.2_ft_1B |
53
+ | OWSM-CTC v4 medium | 1.01B | https://huggingface.co/espnet/owsm_ctc_v4_1B |
54
+
55
+
56
+
57
+ ### Citations
58
+
59
+ #### OWSM v4
60
+
61
+ ```BibTex
62
+ Coming soon
63
+ ```
64
+
65
+ #### OWSM-CTC
66
+
67
+ ```BibTex
68
+ @inproceedings{owsm-ctc,
69
+ title = "{OWSM}-{CTC}: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification",
70
+ author = "Peng, Yifan and
71
+ Sudo, Yui and
72
+ Shakeel, Muhammad and
73
+ Watanabe, Shinji",
74
+ booktitle = "Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)",
75
+ year = "2024",
76
+ month= {8},
77
+ url = "https://aclanthology.org/2024.acl-long.549",
78
+ }
79
+ ```
80
+
81
+ #### OWSM v3.1 and v3.2
82
+
83
+ ```BibTex
84
+ @inproceedings{owsm-v32,
85
+ title={On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models},
86
+ author={Jinchuan Tian and Yifan Peng and William Chen and Kwanghee Choi and Karen Livescu and Shinji Watanabe},
87
+ booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
88
+ year={2024},
89
+ month={9},
90
+ pdf="https://arxiv.org/pdf/2406.09282"
91
+ }
92
+ @inproceedings{owsm-v31,
93
+ title={{OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer}},
94
+ author={Yifan Peng and Jinchuan Tian and William Chen and Siddhant Arora and Brian Yan and Yui Sudo and Muhammad Shakeel and Kwanghee Choi and Jiatong Shi and Xuankai Chang and Jee-weon Jung and Shinji Watanabe},
95
+ booktitle={Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH)},
96
+ year={2024},
97
+ month={9},
98
+ pdf="https://arxiv.org/pdf/2401.16658",
99
+ }
100
+ ```
101
+
102
+ #### Initial OWSM (v1, v2, v3)
103
+
104
+ ```BibTex
105
+ @inproceedings{owsm,
106
+ title={Reproducing Whisper-Style Training Using An Open-Source Toolkit And Publicly Available Data},
107
+ author={Yifan Peng and Jinchuan Tian and Brian Yan and Dan Berrebbi and Xuankai Chang and Xinjian Li and Jiatong Shi and Siddhant Arora and William Chen and Roshan Sharma and Wangyou Zhang and Yui Sudo and Muhammad Shakeel and Jee-weon Jung and Soumi Maiti and Shinji Watanabe},
108
+ booktitle={Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
109
+ year={2023},
110
+ month={12},
111
+ pdf="https://arxiv.org/pdf/2309.13876",
112
+ }
113
+ ```
data/token_list/bpe_unigram50000/bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ddb01f03dab493c18ab69391e98744c090f897890d8b529b30cae52a8d9eef4
3
+ size 1044580
exp/s2t_stats_raw_bpe50000/train/feats_stats.npz ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:00c22dba27594df8f1d8f74a491b20c6e6e8c17e92159f81dfd634f98c098654
3
+ size 1786
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/config.yaml ADDED
The diff for this file is too large to render. See raw diff
 
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/acc.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/backward_time.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/cer.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/cer_ctc.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/clip.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/forward_time.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/gpu_max_cached_mem_GB.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/grad_norm.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/iter_time.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_att.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_ctc.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/loss_scale.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/optim0_lr0.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/optim_step_time.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/train_time.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/images/wer.png ADDED
exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/valid.total_count.ave_5best.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ae83197c66c2d8f5fbd45ec7fa845fbee3e62a708c5f3612f2b5e42dc719c564
3
+ size 1478780074
meta.yaml ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ espnet: '202412'
2
+ files:
3
+ s2t_model_file: exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/valid.total_count.ave_5best.pth
4
+ python: 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:26:55) [GCC 12.3.0]
5
+ timestamp: 1738817987.260615
6
+ torch: 2.5.1
7
+ yaml_files:
8
+ s2t_train_config: exp/s2t_train_conv2d8_size768_e9_d9_mel128_raw_bpe50000/config.yaml