File size: 3,473 Bytes
2da245c 684d363 2da245c 684d363 2da245c 684d363 0ea4eda 684d363 3961dea 684d363 e57a0ac c68e46b e57a0ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
base_model:
- openai/whisper-large-v3
datasets:
- mozilla-foundation/common_voice_11_0
- ajd12342/paraspeechcaps
language:
- en
license: openrail
metrics:
- accuracy
pipeline_tag: audio-classification
tags:
- model_hub_mixin
- pytorch_model_hub_mixin
- speaker_dialect_classification
library_name: transformers
---
# Whisper-Large v3 for English Dialect Classification
# Model Description
This model includes the implementation of English dialect classification described in <a href="https://arxiv.org/abs/2508.01691"><strong>**Voxlect: A Speech Foundation Model Benchmark for Modeling Dialect and Regional Languages Around the Globe**</strong></a>
Github repository: https://github.com/tiantiaf0627/voxlect
The included English dialects are:
```
[
'East Asia',
'English',
'Germanic',
'Irish',
'North America',
'Northern Irish',
'Oceania',
'Other',
'Romance',
'Scottish',
'Semitic',
'Slavic',
'South African',
'Southeast Asia',
'South Asia',
'Welsh'
]
```
Compared to Vox-Profile English accent/dialect models, we trained with additional speech data from TIMIT and ParaSpeechCaps.
# How to use this model
## Download repo
```bash
git clone [email protected]:tiantiaf0627/voxlect
```
## Install the package
```bash
conda create -n voxlect python=3.8
cd voxlect
pip install -e .
```
## Load the model
```python
# Load libraries
import torch
import torch.nn.functional as F
from src.model.dialect.whisper_dialect import WhisperWrapper
# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/voxlect-english-dialect-whisper-large-v3").to(device)
model.eval()
```
## Prediction
```python
# Label List
dialect_list = [
'East Asia',
'English',
'Germanic',
'Irish',
'North America',
'Northern Irish',
'Oceania',
'Other',
'Romance',
'Scottish',
'Semitic',
'Slavic',
'South African',
'Southeast Asia',
'South Asia',
'Welsh'
]
# Load data, here just zeros as the example
# Our training data filters output audio shorter than 3 seconds (unreliable predictions) and longer than 15 seconds (computation limitation)
# So you need to prepare your audio to a maximum of 15 seconds, 16kHz and mono channel
max_audio_length = 15 * 16000
data = torch.zeros([1, 16000]).float().to(device)[:, :max_audio_length]
logits, embeddings = model(data, return_feature=True)
# Probability and output
dialect_prob = F.softmax(logits, dim=1)
print(dialect_list[torch.argmax(dialect_prob).detach().cpu().item()])
```
Responsible Use: Users should respect the privacy and consent of the data subjects, and adhere to the relevant laws and regulations in their jurisdictions when using Voxlect.
## If you have any questions, please contact: Tiantian Feng ([email protected])
❌ **Out-of-Scope Use**
- Clinical or diagnostic applications
- Surveillance
- Privacy-invasive applications
- No commercial use
#### If you like our work or use the models in your work, kindly cite the following. We appreciate your recognition!
```
@article{feng2025voxlect,
title={Voxlect: A Speech Foundation Model Benchmark for Modeling Dialects and Regional Languages Around the Globe},
author={Feng, Tiantian and Huang, Kevin and Xu, Anfeng and Shi, Xuan and Lertpetchpun, Thanathai and Lee, Jihwan and Lee, Yoonjeong and Byrd, Dani and Narayanan, Shrikanth},
journal={arXiv preprint arXiv:2508.01691},
year={2025}
}
``` |