KB-Whisper Base

The National Library of Sweden releases a new suite of Whisper models trained on over 50,000 hours of Swedish speech. In evaluations across FLEURS, CommonVoice and NST, our best performing model reduces the Word Error Rate (WER) by an average of 47% compared to OpenAI's whisper-large-v3. The performance of smaller Whisper model sizes on Swedish speech has also substantially improved, with kb-whisper-small outperforming openai/whisper-large-v3 (a model six times its size).

Model size FLEURS CommonVoice NST
tiny KBLab 13.2 12.9 11.2
OpenAI 59.2 67.8 85.2
base KBLab 9.1 8.7 7.8
OpenAI 39.6 52.1 53.4
small KBLab 7.3 6.4 6.6
OpenAI 20.6 26.4 26.4
medium KBLab 6.6 5.4 5.8
OpenAI 12.1 15.8 17.1
large-v3 KBLab 5.4 4.1 5.2
OpenAI 7.8 9.5 11.3

Usage

import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "KBLab/kb-whisper-base"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, use_safetensors=True, cache_dir="cache"
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

generate_kwargs = {"task": "transcribe", "language": "sv"}
# Add return_timestamps=True for output with timestamps
res = pipe("audio.mp3", 
           chunk_length_s=30,
           generate_kwargs={"task": "transcribe", "language": "sv"})

Training data

Our models have been trained on over 50,000 hours of Swedish audio with text transcriptions. The models were trained in 2 stages, each characterized by the application of different quality filters and thresholds for said filters.

Stage 1 employed low threshold values (0.15 to 0.30 BLEU), whereas Stage 2 used stricter thresholds (BLEU >= 0.7, weighted ROUGE-N >= 0.7, CER of first and last 10 characters <= 0.2).

Dataset Continued pretraining (h) -- Stage 1 Finetuning (h) -- Stage 2
Subtitles 34,261 3,110
Riksdag 21,949 5,119
ISOF 54 54
NST 250 250
Total 56,514 8,533

The default when loading our models through Hugging Face is Stage 2. We have however also uploaded the checkpoints of our continued pretraing and tagged them. You can load these other checkpoints by specifying the revision. For example: pretrained-checkpoint. The Stage 2 default model's tag is named standard.

Evaluation

Model size FLEURS CommonVoice NST
tiny KBLab 13.2 12.9 11.2
OpenAI 59.2 67.8 85.2
base KBLab 9.1 8.7 7.8
OpenAI 39.6 52.1 53.4
small KBLab 7.3 6.4 6.6
OpenAI 20.6 26.4 26.4
medium KBLab 6.6 5.4 5.8
OpenAI 12.1 15.8 17.1
large-v3 KBLab 5.4 4.1 5.2
OpenAI 7.8 9.5 11.3
Model size FLEURS CommonVoice NST
tiny KBLab 76.6 73.7 74.3
OpenAI 26.9 21.1 24.0
base KBLab 83.2 79.9 78.3
OpenAI 41.1 32.5 36.9
small KBLab 86.6 83.5 79.6
OpenAI 64.0 56.5 58.2
medium KBLab 87.6 85.0 80.2
OpenAI 77.1 70.1 68.9
large-v3 KBLab 89.8 87.2 81.1
OpenAI 84.9 79.1 75.1
Downloads last month
57
Safetensors
Model size
99.1M params
Tensor type
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train KBLab/kb-whisper-base

Collection including KBLab/kb-whisper-base