How biased is Whisper ? Evaluating Whisper Models for Robustness to Diverse English Accents

Community Article Published January 29, 2025

Table of Contents

1. Introduction

The diversity of English accents worldwide presents a significant challenge for Automatic Speech Recognition (ASR) systems. English is spoken as a first or second language by over a billion people, each influenced by unique cultural, regional, and linguistic factors. While ASR systems have made remarkable strides in accuracy, they still struggle with less represented accents, including African-accented English and English varieties spoken in countries such as India, Jamaica, and Scotland. These shortcomings highlight critical gaps in ASR robustness, particularly in global and multilingual settings.

Datasets such as the Edinburgh International Accents of English Corpus (EdAcc) and AfriSpeech aim to address this challenge by providing diverse benchmarks to evaluate ASR performance on English accents. EdAcc features dyadic conversations with speakers representing over 40 English accents, emphasizing linguistic diversity in conversational settings. Meanwhile, AfriSpeech focuses on African-accented English, including over 120 indigenous accents across 13 countries, combining clinical and general domains. Both datasets reveal considerable performance gaps in current state-of-the-art ASR models, underscoring the need for robust systems capable of handling global English variations.

This article evaluates OpenAI’s Whisper models, including both full-sized and distilled variants, to study their robustness to English accents. Using the EdAcc and AfriSpeech datasets, alongside the Open ASR benchmark as a general baseline, we analyze how models perform across diverse linguistic contexts. By exploring word error rates (WER) and identifying accent-specific challenges, this study aims to uncover trends in model robustness and guide the development of more inclusive ASR systems.


2. Overview of Whisper Models

The Whisper family of models, developed by OpenAI, represents a state-of-the-art solution in Automatic Speech Recognition (ASR), excelling in diverse linguistic and acoustic settings. These models are designed to handle a wide range of tasks, including transcription, translation, and language identification, and have been trained on vast datasets to ensure robustness. In this section, we break down the Whisper family, focusing on their variations and relevance to accent diversity.

2.1 Model Families

OpenAI Whisper

The Whisper family includes models of varying sizes—Tiny, Base, Small, Medium, and Large—offering a trade-off between computational efficiency and performance. Larger models, such as large-v3, leverage their greater capacity to handle complex linguistic variations and achieve superior accuracy, particularly with diverse accents. In contrast, smaller models like tiny and base are optimized for speed and resource-constrained environments but may struggle with less common or challenging accents.

Distilled Models

To improve efficiency without significantly compromising accuracy, distil/whisper models use knowledge distillation. This process involves training a smaller "student" model to mimic the outputs of a larger "teacher" model. The distil-small.en, distil-medium.en, distil-large-v2 and distil-large-v3 models strike a balance between computational resource usage and transcription accuracy. These models are particularly noteworthy for their ability to handle accented English speech while maintaining efficiency, making them viable options for real-time or on-device ASR tasks.

English-only vs. Multilingual

Whisper offers .en (English-only) and multilingual variants of its models.

  • English-only models (.en) are optimized for transcribing English and often show superior performance on datasets focusing exclusively on English accents, such as EdAcc or AfriSpeech. These models avoid the overhead of handling multiple languages, allowing more capacity to model English-specific features, including accent diversity.
  • Multilingual models, on the other hand, are designed to handle over 90 languages. While this broader capability can dilute their performance on English accents, they are indispensable in scenarios where mixed-language or multilingual input is common.

image/png

You can easily use these models with transformers !

from transformers import pipeline

# Load a Whisper model (e.g., "openai/whisper-large-v3")
# For distilled model: "distil/whisper-distil-large-v3"
asr_pipeline = pipeline("automatic-speech-recognition", model="openai/whisper-large-v3")

# Transcribe an audio file
audio_file = "path/to/audio/file.wav"
transcription = asr_pipeline(audio_file)

print("Transcription:", transcription["text"])

3. Datasets for Evaluation

To evaluate the robustness of Whisper models across diverse English accents, we utilize two key datasets:EdAcc, and AfriSpeech. Each dataset offers unique characteristics, enabling us to analyze performance in general-purpose, regional, and underrepresented accents. We will use the Open ASR Leaderboard as a baseline.

3.1 Open ASR Leaderboard

The Open ASR Leaderbard evaluation draws from a diverse collection of publicly available datasets, including AMI, Earnings22, LibriSpeech, GigaSpeech, SPGISpeech, TED-LIUM, and VoxPopuli. These datasets encompass a variety of speaking styles, contexts, and acoustic conditions:

  • AMI: Meeting recordings with overlapping speech, emphasizing conversational ASR challenges.
  • Earnings22: Financial earnings calls, testing transcription on domain-specific business jargon.
  • LibriSpeech: Clean and noisy read speech, serving as a benchmark for ASR performance on standard English.
  • GigaSpeech and SPGISpeech: Large-scale datasets of read and spontaneous speech with wide vocabulary coverage.
  • TED-LIUM: TED talks featuring diverse speakers, accents, and topics.
  • VoxPopuli: Multilingual and English political speeches and debates, adding variability in speaking styles.

Relevance: Open ASR serves as a general-purpose benchmark, covering a wide range of linguistic and acoustic conditions. While not heavily focused on accent diversity, it provides a baseline for measuring overall ASR robustness.

3.2 EdAcc

The Edinburgh International Accents of English Corpus (EdAcc) is designed to highlight the diversity of English as spoken globally. This dataset features approximately 40 hours of dyadic video-call conversations between participants with over 40 distinct self-reported English accents. These accents span both first-language (L1) and second-language (L2) speakers, with detailed linguistic background profiles for each participant.

  • Features: Natural conversational speech recorded in video-call settings, annotated for speaker accents and linguistic backgrounds.
  • Challenges: Includes accents such as Indian, Jamaican, and Nigerian English, which pose significant challenges for ASR systems trained primarily on standard US or UK English.

Relevance: EdAcc provides a focused evaluation of accent-specific robustness, offering insight into how well ASR models generalize to diverse English accents in real-world conversational settings.

3.3 AfriSpeech

The AfriSpeech dataset is a Pan-African accented English corpus featuring 200 hours of speech from 120 indigenous accents across 13 countries, such as Nigeria, Kenya, and South Africa. It includes speech from over 2,400 speakers and spans both clinical and general domains, making it one of the most comprehensive datasets for African-accented English.

For this evaluation, we focused on the Out-of-Distribution (OOD) test subset, which includes 20 distinct accents that were not present in the training set. This subset highlights accents that represent the greatest challenges for ASR systems, due to their low representation in global speech datasets. The OOD subset comprises the following accents:

  • Agatu, Angas, Bajju, Bini, Brass, Delta, Eggon, Ekene, Ekpeye, Gbagyi, Igarra, Ijaw-Nembe, Ikulu, Jaba, Jukun, Khana, Mada, Mwaghavul, Ukwuani, and Yoruba-Hausa.

  • Features:

    • Each accent in the OOD test subset is annotated with demographic and linguistic details, ensuring high-quality evaluation.
    • The accents reflect significant linguistic and cultural diversity, drawn from multiple language families such as Niger-Congo and Afro-Asiatic.
  • Challenges:

    • These accents are underrepresented in global ASR datasets, making them particularly challenging for pre-trained models that lack exposure to African linguistic features.
    • Names, phrases, and speech patterns unique to these accents test the generalization capability of ASR systems.

You can easily use these datasets with transformers !

from datasets import load_dataset

afrispeech = load_dataset("tobiolatunji/afrispeech-200", "all")
edacc = load_dataset("Steveeeeeeen/edacc_test", "Bulgarian_female")

4. Results

4.1 Overall average results

image/png

The results clearly show that the large models consistently top the leaderboards across all datasets, demonstrating their robustness and accuracy. The distilled large models, such as distil-large-v2 and distil-large-v3, achieve performance close to their full-sized counterparts in some scenarios, particularly in the Open ASR dataset. However, a deeper look reveals that the distilled models are less robust for datasets emphasizing diverse accents. On the EdAcc dataset, which focuses on English accent diversity, the distilled models exhibit a slight performance drop compared to the full-sized models. This trend becomes even more pronounced with the AfriSpeech OOD subset, where the distilled models perform significantly worse, highlighting their limitations in handling underrepresented and challenging accents. This suggests that while distillation offers computational efficiency, it may come at the cost of robustness, particularly for datasets featuring high accent variability.

4.2 English only vs Multilingual

Model Open ASR Ranking EdAcc Ranking AfriSpeech OOD Ranking
openai/whisper-medium 7 8 (-1) 5 (+2)
distil/whisper-medium.en 8 6 (+2) 6 (+2)
openai/whisper-small.en 10 11 (-1) 10
openai/whisper-small 12 12 9 (+3)
distil/whisper-base.en 13 13 14 (-1)
openai/whisper-base 14 14 13 (+1)
distil/whisper-tiny.en 15 15 16 (-1)
distil/whisper-tiny 16 16 15 (+1)

English-only models generally perform better on clean, standard English datasets like EdAcc, where accent diversity is moderate, while multilingual models excel on datasets with high accent variability, such as AfriSpeech OOD, showcasing superior generalization and robustness to underrepresented accents. This suggests that multilingual models are better suited for global applications requiring accent diversity coverage, while English-only models can be advantageous in standard English environments due to their efficiency !

4.3 Distil vs Full-sized

Model Open ASR Ranking EdAcc Ranking AfriSpeech OOD Ranking
openai/whisper-large-v3 1 1 1
distil/whisper-distil-large-v3 2 3 (-1) 7 (-5)
openai/whisper-large-v3-turbo 3 2 (+1) 2 (+1)
openai/whisper-large-v2 4 7 (-3) 3 (+1)
distil/whisper-distil-large-v2 5 4 (+1) 8 (-3)
openai/whisper-large-v1 6 5 (+1) 4 (+2)
distil/whisper-distil-small.en 9 9 12 (-3)
distil/whisper-distil-medium.en 11 10 (+1) 11

Distil models, which are trained exclusively on English, are generally paired with full-sized models. However, they particularly underperform on out-of-distribution tasks like AfriSpeech, where they lose significant rankings. While they maintain reasonable performance on in-distribution tasks like Open ASR and EdAcc, their ability to generalize appears compromised. It remains unclear whether this limitation is primarily due to the distillation process or their English-only training. Smaller distil models trade robustness for efficiency, making them suitable for standard datasets but less effective in handling diverse or challenging conditions, especially underrepresented accents. Further investigation is needed to disentangle the impact of distillation and English-only training on their performance.

5. Conclusion

In addition to the observations regarding the strengths and weaknesses of Whisper models, it is crucial to highlight the interplay between training objectives and model robustness. The results suggest that while distillation improves computational efficiency, it may introduce trade-offs that compromise performance on diverse and underrepresented accents, such as those in AfriSpeech OOD. Furthermore, the fact that distil models are trained exclusively on English raises an important question: does their underperformance stem primarily from the distillation process or from the reduced linguistic diversity in their training data? Addressing this uncertainty requires further investigation into the role of training data diversity and the impact of knowledge distillation on generalization to challenging linguistic conditions.

This deeper understanding will guide the development of future ASR models that balance efficiency, robustness, and inclusivity, ensuring better handling of global linguistic diversity while maintaining high performance on standard datasets. Additionally, targeted improvements, such as fine-tuning on diverse accent datasets or hybrid approaches combining multilingual and distilled models, could be explored to address these limitations effectively.

Community

Kinda wild how big the real life difference between reported evals and actual evals is 🤯

Impressive!

Sign up or log in to comment