File size: 15,015 Bytes
03ea8f7
bc17b0f
61729bb
03ea8f7
 
 
 
 
 
320a65d
 
03ea8f7
178dd3e
03ea8f7
320a65d
178dd3e
 
 
 
 
 
 
 
 
 
 
 
 
03ea8f7
d4c8b9e
 
03ea8f7
 
d4c8b9e
 
435d503
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4c8b9e
 
 
 
03ea8f7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36b300c
178dd3e
 
d4c8b9e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178dd3e
 
 
 
d4c8b9e
178dd3e
 
 
 
 
 
 
 
 
f8b747b
 
178dd3e
 
 
d4c8b9e
f8b747b
178dd3e
 
 
 
 
 
f8b747b
178dd3e
f8b747b
178dd3e
f8b747b
178dd3e
 
f8b747b
178dd3e
f8b747b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae5e8cf
f8b747b
 
 
178dd3e
 
f8b747b
178dd3e
f8b747b
178dd3e
f8b747b
178dd3e
f8b747b
178dd3e
f8b747b
d4c8b9e
 
f8b747b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae5e8cf
f8b747b
 
d4c8b9e
70ab23f
 
 
 
 
d4c8b9e
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
library_name: transformers
base_model: openai/whisper-base
language:
- sv
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- KBLab/rixvox-v2
tags:
- ctranslate2
---
## KB-Whisper Base

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across [FLEURS](https://huggingface.co/datasets/google/fleurs), [CommonVoice](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_1) and [NST](https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-56/), our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's `whisper-large-v3`. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with `kb-whisper-small` outperforming `openai/whisper-large-v3` (a model six times its size).

| Model size  |   | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
|            | OpenAI  | 59.2   | 67.8   | 85.2   |
| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
|            | OpenAI  | 39.6   | 52.1   | 53.4   |
| [small](https://huggingface.co/KBLab/kb-whisper-small)      | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
|            | OpenAI  | 20.6   | 26.4   | 26.4   |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium)     | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
|            | OpenAI  | 12.1   | 15.8   | 17.1   |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)   | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
|            | OpenAI  | 7.8    | 9.5    | 11.3    |

Table: **Word Error Rate (WER)** comparison between KBLab's Whisper models and the corresponding OpenAI versions. 

### Usage

We provide checkpoints in different formats: `Hugging Face`, `whisper.cpp` (GGML), `onnx`, and `ctranslate2` (used in `faster-whisper` and `WhisperX`).

### 2025-05-13 Update! 
The default when loading our models through Hugging Face is **Stage 2**. 
As of May 2025 there exists two **Stage 2** versions in addition to the default, namely **Subtitle** and **Strict** that specify the transcription style. 
By specifying `revision="subtitle"` in `.from_pretrained()` the model version with a more condensed style of transcribing is accessed. 
By specifying `revision="strict"` in `.from_pretrained()` the more verbatim-like version of the model is accessed. 
Below is an example of how this argument is passed in the `.from_pretrained()` function 
```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-base"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache", revision="strict"
)
```
The verbosity of the transcription styles of the three model versions ranges from the least verbose **Subtitle**, to **Stage 2** (default) to the most verbose **Strict**. 

#### Hugging Face

Inference example for using `KB-Whisper` with Hugging Face:

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-base"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})
print(res)
```

#### Faster-whisper

[Faster-whisper](https://github.com/SYSTRAN/faster-whisper) provides fast and efficient inference via a reimplementation of Whisper using `ctranslate2`. 

```python
#### faster-whisper model ####
from faster_whisper import WhisperModel

model_id = "KBLab/kb-whisper-base"
model = WhisperModel(
    model_id,
    device="cuda",
    compute_type="float16",
    download_root="cache", # cache directory
    # condition_on_previous_text = False # Can reduce hallucinations if we don't use prompts
)

# Transcribe audio.wav (convert to 16khz mono wav first via ffmpeg)
segments, info = model.transcribe("audio.wav", condition_on_previous_text=False)
print("Detected language '%s' with probability %f" % (info.language, info.language_probability))

for segment in segments:
    print("[%.2fs -> %.2fs] %s" % (segment.start, segment.end, segment.text))
```

#### WhisperX

[WhisperX](https://github.com/m-bain/whisperX) provides a convenient method of getting accurate word level timestamps. The library combines (force aligns) the text output of Whisper with the accurate timestamps of Wav2vec2. We provide an example below of how to use `KB-Whisper` together with [KBLab/wav2vec2-large-voxrex-swedish](https://huggingface.co/KBLab/wav2vec2-large-voxrex-swedish).

```python
import whisperx

device = "cuda"
audio_file = "audio.wav"
batch_size = 16  # reduce if low on GPU mem
compute_type = "float16"  # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model(
    "KBLab/kb-whisper-base", device, compute_type=compute_type, download_root="cache"  # cache_dir
)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"])  # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"],
    device=device,
    model_name="KBLab/wav2vec2-large-voxrex-swedish",
    model_dir="cache",  # cache_dir
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device, return_char_alignments=False
)

print(result["segments"])  # word level timestamps after alignment
```

#### Whisper.cpp / GGML

We provide GGML checkpoints used in the apps `whisper.cpp` and `MacWhisper`. To use our model with `whisper.cpp` first clone the repository and build the library:

```
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
cmake -B build
cmake --build build --config Release
```

To use the model you need to download one of the GGML checkpoints we have uploaded. You can either press the download buttons [here](https://huggingface.co/KBLab/kb-whisper-base/tree/main), or download using `wget`:

```
wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model-q5_0.bin # Quantized version
# wget https://huggingface.co/KBLab/kb-whisper-base/resolve/main/ggml-model.bin # Non-quantized version
```

Run inference by specifying the model path after the argument `-m`, along with the path to the audio file as the last positional argument.

```
./build/bin/whisper-cli -m ggml-model-q5_0.bin ../audio.wav
```

#### onnx (optimum) and transformers.js usage

You can use the `onnx` checkpoints via Hugging Face's `optimum` library in the following manner:

```python
from optimum.onnxruntime import ORTModelForSpeechSeq2Seq
from transformers import AutoProcessor

model_id = "KBLab/kb-whisper-base"
processor = AutoProcessor.from_pretrained(model_id, cache_dir="cache")
model = ORTModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    cache_dir="cache",
    subfolder="onnx",
)

import soundfile as sf
audio = sf.read("audio.wav")

inputs = processor.feature_extractor(audio[0], sampling_rate=16000, return_tensors="pt")
gen_tokens = model.generate(**inputs, max_length=300)
processor.decode(gen_tokens[0], skip_special_tokens=True)
```

An example of an app that runs inference locally in the browser with `transformers.js` and `KB-Whisper` can be found at [https://whisper.mesu.re/](https://whisper.mesu.re/) (created by Pierre Mesure). A template for setting up such an app with javascript can be found at [https://github.com/xenova/whisper-web](https://github.com/xenova/whisper-web). 

### Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters. 

Stage 1 employed low threshold values (0 to 0.30 BLEU depending on dataset), whereas Stage 2 used stricter thresholds (`BLEU >= 0.7`, weighted ROUGE-N `>= 0.7`, CER of first and last 10 characters `<= 0.2`).

| Dataset      | Continued pretraining (h) -- Stage 1 | Finetuning (h) -- Stage 2 |
|-------------|--------------------------|--------------|
| Subtitles   | 34,261                   | 3,110        |
| Riksdag     | 21,949                   | 5,119        |
| ISOF        | 54                       | 54           |
| NST         | 250                      | 250          |
| **Total**   | **56,514**               | **8,533**    |


The default when loading our models through Hugging Face is **Stage 2**. We have however also uploaded continued pretraining checkpoints and tagged them. You can load these other checkpoints by specifying the `revision` in `.from_pretrained()`. The pretrained checkpoints tag can for example be found here: [`pretrained-checkpoint`](https://huggingface.co/KBLab/kb-whisper-large/tree/pretrained-checkpoint). The Stage 2 default model tag is named `standard`. We supply two different stage 2 checkpoints -- one with a more condensed style of transcribing -- under the name `subtitle`, and one more verbose called `strict`. 

### Evaluation


#### WER compared to OpenAI
| Model size  |  | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **13.2**  | **12.9**  | **11.2**  |
|            | OpenAI  | 59.2   | 67.8   | 85.2   |
| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **KBLab**   | **9.1**   | **8.7**   | **7.8**   |
|            | OpenAI  | 39.6   | 52.1   | 53.4   |
| [small](https://huggingface.co/KBLab/kb-whisper-small)     | **KBLab**   | **7.3**   | **6.4**   | **6.6**   |
|            | OpenAI  | 20.6   | 26.4   | 26.4   |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium)   | **KBLab**   | **6.6**   | **5.4**   | **5.8**   |
|            | OpenAI  | 12.1   | 15.8   | 17.1   |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)  | **KBLab**   | **5.4**   | **4.1**   | **5.2**   |
|            | OpenAI  | 7.8    | 9.5    | 11.3    |

#### WER for different KBLab stage2 versions

| Model size  |  | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **standard**   | **13.2**  | **12.9**  | **11.2**  |
|            | strict    | 14.1   | 13.4   | 11.0   |
|            | subtitle  | 13.3   | 12.9   | 11.4   |
| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **standard**   | **9.1**   | **8.7**   | **7.8**   |
|            | strict    | 10.4   | 9.6    | 8.4    |
|            | subtitle  | 9.1    | 8.7    | 7.9    |
| [small](https://huggingface.co/KBLab/kb-whisper-small)     | **standard**   | **7.3**   | **6.4**   | **6.6**   |
|            | strict    | 8.2    | 7.0    | 6.7    |
|            | subtitle  | 7.3    | 6.4    | 6.6    |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium)   | **standard**   | **6.6**   | **5.4**   | **5.8**   |
|            | strict    | 6.8    | 5.4    | 6.0    |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)  | **standard**   | **5.4**   | **4.1**   | **5.2**   |
|            | strict    | 5.3    | 4.0    | 5.1    |
|            | subtitle  | 5.3    | 4.1    | 5.3    |


#### BLEU Score compared to OpenAI
| Model size  |   | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **KBLab**   | **76.6**  | **73.7**  | **74.3**  |
|            | OpenAI  | 26.9   | 21.1   | 24.0   |
| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **KBLab**   | **83.2**   | **79.9**   | **78.3**   |
|            | OpenAI  | 41.1   | 32.5   | 36.9   |
| [small](https://huggingface.co/KBLab/kb-whisper-small)     | **KBLab**   | **86.6**   | **83.5**   | **79.6**   |
|            | OpenAI  | 64.0   | 56.5   | 58.2   |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium)   | **KBLab**   | **87.6**   | **85.0**   | **80.2**   |
|            | OpenAI  | 77.1   | 70.1   | 68.9   |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)  | **KBLab**   | **89.8**   | **87.2**   | **81.1**   |
|            | OpenAI  | 84.9    | 79.1    | 75.1    |

#### BLEU Score for different KBLab stage2 versions
| Model size  |   | FLEURS | CommonVoice | NST  |
|------------|---------|--------|-------------|------|
| [tiny](https://huggingface.co/KBLab/kb-whisper-tiny)       | **standard**   | **76.6**  | **73.7**  | **74.3**  |
|            | strict      | 75.3    | 72.9    | 74.6    |
|            | subtitle    | 76.6    | 73.7    | 74.1    |
| [base](https://huggingface.co/KBLab/kb-whisper-base)       | **standard**   | **83.2**   | **79.9**   | **78.3**   |
|            | strict      | 81.0    | 78.4    | 77.5    |
|            | subtitle    | 83.2    | 79.8    | 78.2    |
| [small](https://huggingface.co/KBLab/kb-whisper-small)     | **standard**   | **86.6**   | **83.5**   | **79.6**   |
|            | strict      | 84.9    | 82.4    | 79.3    |
|            | subtitle    | 86.6    | 83.5    | 79.6    |
| [medium](https://huggingface.co/KBLab/kb-whisper-medium)   | **standard**   | **87.6**   | **85.0**   | **80.2**   |
|            | strict      | 87.3    | 84.9    | 80.1    |
| [large-v3](https://huggingface.co/KBLab/kb-whisper-large)  | **standard**   | **89.8**   | **87.2**   | **81.1**   |
|            | strict      | 90.0    | 87.4    | 81.2    |
|            | subtitle    | 89.8    | 87.3    | 81.0    |



### Acknowledgements

We acknowledge the EuroHPC Joint Undertaking for awarding this project access to the EuroHPC supercomputer LEONARDO, hosted by CINECA (Italy) and the LEONARDO consortium through an EuroHPC AI and Data-Intensive Applications Access call.


### Citation

Paper reference coming soon.