Audio-to-Audio
Transformers
Safetensors
speech_language_model
Files changed (1) hide show
  1. README.md +144 -130
README.md CHANGED
@@ -1,131 +1,145 @@
1
- ---
2
- base_model:
3
- - Qwen/Qwen2.5-0.5B
4
- datasets:
5
- - openslr/librispeech_asr
6
- - slprl/SpokenSwag
7
- - slprl/sTinyStories
8
- library_name: transformers
9
- license: mit
10
- pipeline_tag: audio-to-audio
11
- ---
12
-
13
- # Slamming: Training a Speech Language Model on One GPU in a Day
14
-
15
- The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).
16
-
17
- # Paper abstract
18
-
19
- We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
20
-
21
- # Model Card for Model ID
22
- This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
23
-
24
-
25
- ## Model Details
26
-
27
- ### Model Description
28
- This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
29
- It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
30
- the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
31
-
32
- The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
33
- [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
34
- [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
35
-
36
- - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
37
- - **Model type:** SpeechLM
38
- - **License:** MIT
39
- - **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
40
-
41
- ### Model Sources
42
-
43
- - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
44
- - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
45
- - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
46
-
47
- ## Uses
48
- This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
49
- [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
50
-
51
- ### Out-of-Scope Use
52
- This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
53
-
54
-
55
-
56
- ## How to Get Started with the Model
57
- We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
58
-
59
-
60
- ## Training Details
61
- We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
62
-
63
-
64
- ### Training Data
65
- This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
66
- [Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
67
- [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
68
- dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
69
-
70
- ### Training Procedure
71
- This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
72
- Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
73
-
74
- #### Preprocessing
75
- Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
76
- official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
77
- We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
78
-
79
-
80
- ## Evaluation
81
- The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
82
-
83
- | Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
84
- |-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
85
- | **Speech only pre-training** | | | | | | | | |
86
- | GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — |
87
- | SyllableLM | 4×A40 | 300M | 16B | 63.7 | — | 75.4 | — | — |
88
- | TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | — | — | 137.3 | 3.46 |
89
- | TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
90
- | TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
91
- | TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — |
92
- | Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — |
93
- | Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — |
94
- | SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — |
95
- | **With text / preference optimization** | | | | | | | | |
96
- | Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — |
97
- | Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | | |
98
- | SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — |
99
- | AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 || — |
100
- | AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — |
101
- | **Ours (_Slam_)** | | | | | | | | |
102
- | _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
103
- | _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
104
- | _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | **61.11** | **61.30** | **84.18** | **46.6** | 3.75 |
105
-
106
-
107
-
108
- ### Compute Infrastructure
109
- This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
110
-
111
- #### Hardware
112
- This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
113
-
114
- #### Software
115
- The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
116
- easy and efficient training of Speech Language Models.
117
-
118
- ## Citation
119
-
120
- **BibTeX:**
121
- ```
122
- @misc{maimon2025slamming,
123
- title={Slamming: Training a Speech Language Model on One GPU in a Day},
124
- author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
125
- year={2025},
126
- eprint={2502.15814},
127
- archivePrefix={arXiv},
128
- primaryClass={cs.LG},
129
- url={https://arxiv.org/abs/2502.15814},
130
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
131
  ```
 
1
+ ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-0.5B
4
+ datasets:
5
+ - openslr/librispeech_asr
6
+ - slprl/SpokenSwag
7
+ - slprl/sTinyStories
8
+ library_name: transformers
9
+ license: mit
10
+ pipeline_tag: audio-to-audio
11
+ language:
12
+ - zho
13
+ - eng
14
+ - fra
15
+ - spa
16
+ - por
17
+ - deu
18
+ - ita
19
+ - rus
20
+ - jpn
21
+ - kor
22
+ - vie
23
+ - tha
24
+ - ara
25
+ ---
26
+
27
+ # Slamming: Training a Speech Language Model on One GPU in a Day
28
+
29
+ The model was presented in the paper [Slamming: Training a Speech Language Model on One GPU in a Day](https://arxiv.org/abs/2502.15814).
30
+
31
+ # Paper abstract
32
+
33
+ We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .
34
+
35
+ # Model Card for Model ID
36
+ This is a Speech Language Model (SLM) trained for generating speech continuations over discrete [Hubert tokens](https://huggingface.co/slprl/mhubert-base-25hz).
37
+
38
+
39
+ ## Model Details
40
+
41
+ ### Model Description
42
+ This Speech Language Model, introduced in ["_Slamming_: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focuses on efficient training.
43
+ It was fine-tuned from [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) over a vocabulary of 500 speech tokens extracted from
44
+ the 11-th layer of [mhubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz).
45
+
46
+ The model was pre-trained using next-token prediction on a subset of LibriSpeech, Libri-Light and a synthetic dataset
47
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories). It was subsequently fine-tuned with DPO on
48
+ [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
49
+
50
+ - **Developed by:** [SLP-RL](https://huggingface.co/slprl)
51
+ - **Model type:** SpeechLM
52
+ - **License:** MIT
53
+ - **Finetuned from model:** [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B)
54
+
55
+ ### Model Sources
56
+
57
+ - **Repository:** [https://github.com/slp-rl/slamkit](https://github.com/slp-rl/slamkit)
58
+ - **Paper:** [https://arxiv.org/abs/2502.15814](https://arxiv.org/abs/2502.15814)
59
+ - **Demo:** [https://pages.cs.huji.ac.il/adiyoss-lab/slamming/](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/)
60
+
61
+ ## Uses
62
+ This base SpeechLM can be used to generate continuations for speech segments, or as a base for further tuning. See the _SlamKit_
63
+ [codebase](https://github.com/slp-rl/slamkit) for more details on usage, and checkout the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) for some generation examples
64
+
65
+ ### Out-of-Scope Use
66
+ This model was trained on curated speech datasets which contain mainly audio-books and stories, as such the outputs should not be treated as factual in any way.
67
+
68
+
69
+
70
+ ## How to Get Started with the Model
71
+ We refer users to the official repository for full usage explanations - [github](https://github.com/slp-rl/slamkit).
72
+
73
+
74
+ ## Training Details
75
+ We highly encourage users to read the full [paper](https://arxiv.org/abs/2502.15814), for full training details, a brief overview is provided below.
76
+
77
+
78
+ ### Training Data
79
+ This model was trained on a subset of [LibriSpeech](https://huggingface.co/datasets/openslr/librispeech_asr) train,
80
+ [Libri-Light](https://ai.meta.com/tools/libri-light/) and the synthetic dataset
81
+ [sTinyStories](https://huggingface.co/datasets/slprl/sTinyStories) for the pre-training phase. It was also trained with DPO on the synthetic
82
+ dataset [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
83
+
84
+ ### Training Procedure
85
+ This model was trained by next token prediction over several datasets, and then trained with DPO over [SpokenSwag](https://huggingface.co/datasets/slprl/SpokenSwag).
86
+ Please refer to the [paper](https://arxiv.org/abs/2502.15814) or [code](https://github.com/slp-rl/slamkit) for the full training recipes.
87
+
88
+ #### Preprocessing
89
+ Speech tokens are extracted from the audio using [Hubert-25hz](https://huggingface.co/slprl/mhubert-base-25hz), and quantised using the
90
+ official kmeans released with the model in [textlesslib](https://github.com/facebookresearch/textlesslib/tree/main). Units are de-duplicated.
91
+ We encourage you to explore the official repository for full details - [github](https://github.com/slp-rl/slamkit).
92
+
93
+
94
+ ## Evaluation
95
+ The paper provides full results, we do give here some results and also refer to the [demo page](https://pages.cs.huji.ac.il/adiyoss-lab/slamming/) to listen to some samples.
96
+
97
+ | Model | GPUs | Params | Num Tokens | sBLIMP ↑ | sStoryCloze ↑ | tStoryCloze ↑ | GenPPL ↓ | Auto-BLEU ↓ |
98
+ |-------------------------------------------|---------|--------|---------------|-----------|---------------|---------------|----------|-------------|
99
+ | **Speech only pre-training** | | | | | | | | |
100
+ | GSLM | 8×V100 | 100M | 1B | 54.2 | 53.3 | 66.6 | — | — |
101
+ | SyllableLM | 4×A40 | 300M | 16B | 63.7 || 75.4 || — |
102
+ | TWIST-350M | 8×V100 | 305M | 10.8B | 56.2 | | | 137.3 | 3.46 |
103
+ | TWIST-1.3B | 32×V100 | 1B | 10.8B | 57.0 | 52.4 | 70.6 | 131.8 | 3.20 |
104
+ | TWIST-7B | 32×V100 | 7B | 36B | 59.0 | 55.3 | 74.1 | 93.74 | 3.06 |
105
+ | TWIST-13B | 32×V100 | 13B | 36B | 59.2 | 55.4 | 76.4 | — | — |
106
+ | Scaled Optimal | — | 823M | 82B | **61.3** | 56.7 | 78.0 | — | — |
107
+ | Moshi | ?×H100 | 7B | ? | 58.9 | **58.7** | **81.8** | — | — |
108
+ | SpiritLM | 64×A100 | 7B | 100B | 58.0 | 54.8 | 72.9 | — | — |
109
+ | **With text / preference optimization** | | | | | | | | |
110
+ | Scaling Interleaving | — | 9B | ~1T | — | **62.4** | 82.9 | — | — |
111
+ | Moshi | ?×H100 | 7B | ~720B | 58.8 | 60.8 | 83.0 | — | — |
112
+ | SpiritLM | 64×A100 | 7B | 100B | 58.3 | 61.0 | 82.9 | — | — |
113
+ | AlignSLM-1.3B | 64×A100 | 1B | 10.8B + ~158B | 59.8 | 55.0 | 80.0 | — | — |
114
+ | AlignSLM-7B | 64×A100 | 7B | 36B + ~158B | **62.3** | 61.1 | **86.8** | — | — |
115
+ | **Ours (_Slam_)** | | | | | | | | |
116
+ | _Slam_ (-DPO) | 2×A100 | 358M | 16.7B | 58.53 | 58.15 | 80.71 | 67.3 | 3.25 |
117
+ | _Slam_ | 1×A5000 | 358M | 1.4B + 5M | 58.86 | 58.04 | 82.04 | 62.8 | 3.88 |
118
+ | _Slam_ (scaled) | 2×A100 | 358M | 16.7B + 9M | **61.11** | **61.30** | **84.18** | **46.6** | 3.75 |
119
+
120
+
121
+
122
+ ### Compute Infrastructure
123
+ This model was trained as part of ["*Slamming*: Training a Speech Language Model on One GPU in a Day"](https://arxiv.org/abs/2502.15814), focusing on efficient training.
124
+
125
+ #### Hardware
126
+ This model was trained using **only 2 Nvidia A100 GPU** for **48 hours**.
127
+
128
+ #### Software
129
+ The model was trained using the [*SlamKit*](https://github.com/slp-rl/slamkit) codebase which builds upon 🤗transformers extending it to support
130
+ easy and efficient training of Speech Language Models.
131
+
132
+ ## Citation
133
+
134
+ **BibTeX:**
135
+ ```
136
+ @misc{maimon2025slamming,
137
+ title={Slamming: Training a Speech Language Model on One GPU in a Day},
138
+ author={Gallil Maimon and Avishai Elmakies and Yossi Adi},
139
+ year={2025},
140
+ eprint={2502.15814},
141
+ archivePrefix={arXiv},
142
+ primaryClass={cs.LG},
143
+ url={https://arxiv.org/abs/2502.15814},
144
+ }
145
  ```