File size: 8,696 Bytes
aa437bd 0182784 aa437bd 0182784 8b7d557 0182784 9f00c4c 0182784 aed9bbf 0182784 94ef06b 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 aed9bbf 0182784 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 |
---
license: apache-2.0
datasets:
- mict-zhaw/chall
language:
- en
pipeline_tag: automatic-speech-recognition
---
# Model Card for chall_wav2vec2_xlsr_300m
This model, named `ChaLL-300M`, is an Automatic Speech Recognition (ASR) system specifically designed and fine-tuned to transcribe spontaneous
speech of young English learners while preserving their errors.
It was developed as part of a project to improve the speaking practice and provide corrective feedback in language learning environments.
## Model Details
### Model Description
`ChaLL-300M` is a fine-tuned ASR model from the Wav2Vec-XLSR-300M base model.
This model represents the best performing fold (Fold 5) from a k-fold cross-validation training process as described in the [paper](#).
It addresses the unique challenges of transcribing the spontaneous speech of young English learners by preserving their grammatical, lexical, and pronunciation errors.
- **Developed by:** Zurich University of Applied Sciences, Swiss public schools' collaboration
- **Funded by:** Swiss Innovation Agency (Innosuisse)
- **Shared by:** Zurich University of Applied Sciences
- **Model type:** Automatic Speech Recognition (ASR)
- **Language(s) (NLP):** English
- **License:** [More Information Needed]
- **Finetuned from model:** Wav2Vec-XLSR-300M
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/mict-zhaw/chall_e2e_stt
- **Paper:** [More Information Needed]
## Uses
### Direct Use
`ChaLL-300M` can be directly used for transcribing the English speech of young learners,
particularly in educational software designed to provide real-time feedback on language learning.
Its ability to accurately preserve and transcribe grammatical, lexical, and pronunciation errors
makes it especially useful for applications aimed at improving language acquisition.
### Downstream Use
`ChaLL-300M` can be integrated into larger systems for more complex applications.
Examples include:
- **Voice-based Chatbot for Language-Learners (ChaLL)**: ChaLL is a chatbot-based solution designed to enhance language learners' speaking skills through personalized, interactive practice. It stands out by focusing on real-world communication, unlike traditional language learning methods.
- *Interactive Speaking*: Engages learners in conversation, providing real-time feedback.
- *Personalized Feedback*: Uses the `ChaLL-300M` model to preserve and correct speech errors.
- *Supportive Environment*: Mimics real-world scenarios for practical learning.
- *Enhanced Competence*: Builds fluency and confidence for effective communication.
### Out-of-Scope Use
The `ChaLL-300M` model is not suitable for certain applications:
- **Transcription of Error-Free Speech**: The model is optimized for preserving errors in speech; therefore, it is not ideal for contexts where error-free transcription of adult native speakers' English speech is required.
- **High-Stakes Professional Transcriptions**: The model should not be used in professional or formal contexts where the accuracy of fluent or highly technical speech is crucial.
- **Bias-Sensitive Applications**: Avoid using the model in applications where the bias towards young learners’ speech patterns could negatively impact the results, such as in automated systems assessing adult speech or in regions with significantly different accents or dialects than those present in the training data.
## Bias, Risks, and Limitations
While `ChaLL-300M` is tailored to preserve the speech errors of young English learners,
it may not perform well with adult native speaker data or read-aloud tasks.
Additionally, the model could potentially amplify biases present in the training data,
such as regional accents or typical errors made by specific age groups.
There is also a risk that the model might not handle code-switching or non-English phrases accurately.
### Recommendations
Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.
It's recommended to carefully evaluate the context and specific requirements before deploying it in an application.
## How to Get Started with the Model
Use the code below to get started with the `ChaLL-300M` model:
```python
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio, torch
processor = Wav2Vec2Processor.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")
model = Wav2Vec2ForCTC.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")
# Load your own audio file
audio_input, _ = torchaudio.load("path_to_your_audio_file.wav")
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values[0]
# Perform transcription
with torch.no_grad():
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
```
## Training Details
### Training Data
The training data consists of 85 hours of spontaneous English speech recordings from young learners in Swiss public schools,
encompassing grades 4 to 6. The dataset contains 45,004 utterances from 327 distinct speakers.
The complete dataset cannot be freely accessed due to privacy concerns but can be obtained through a collaboration agreement.
Find more information in
[mict-zhaw/chall](https://huggingface.co/datasets/mict-zhaw/chall).
### Training Procedure
#### Preprocessing
- Removal of error annotations and transcript conventions
- Conversion to lowercase
- Standardization of text (e.g., removing unnecessary spaces, normalizing special characters)
- Transformation of digits to words using `num2words`
#### Training Hyperparameters
- **Training regime:** fp32
- **Learning rate:** 3e-5
- **Batch size per device:** 14
- **Gradient accumulation steps:** 15 (total batch size corresponds to approximately 2 hours of audio)
- **Optimizer:** 8-bit AdamW
- **Training steps:** 4000
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
The testing data consists of the same dataset used for training, partitioned into five distinct folds.
Each fold includes a variety of grades (4 to 6) to simulate a real-world user scenario where an ASR system is exposed to new, unseen speaker data.
#### Factors
- Grade levels (4th to 6th grade)
- School area codes
#### Metrics
To measure error preservation, we use the error annotations that were manually added to each utterance and a custom phonetic word-level alignment algorithm.
This algorithm aligns two or more sequences (e.g., a reference and one or multiple hypotheses), identifying matches, substitutions (S), insertions (I), and deletions (D) at the word level.
Our metric, WEPR (Word-Based Error Preservation Rate), considers only those word pairs where the reference word contains an error annotation.
WEPR is calculated according to the formula:
=\frac{(\mathcal{S+D})}{\mathcal{N}})
In addition to WEPR, we also compute the following general ASR metrics using all words in the utterance:
Word Error Rate (WER), Character Error Rate (CER), and character n-gram F-Score (chrF).
### Results
#### Summary
The `ChaLL-300M` model achieves strong performance in preserving the errors made by young English learners,
significantly outperforming baseline models in terms of WEPR.
The model also achieves competitive WER, CER, and chrF scores, indicating its effectiveness for the target demographic:
- **WER:** 0.30 ± 0.01
- **CER:** 0.16 ± 0.01
- **chrF:** 0.68 ± 0.01
- **WEPR:** 0.38 ± 0.03
## Model Examination
Further qualitative analysis on preserved errors indicates that the model effectively retains grammatical, lexical, and pronunciation errors
made by young learners. This aspect makes it particularly suitable for educational applications focused on language learning and correction.
## Citation
**BibTeX:**
```bibtex
@inproceedings{
anonymous2024errorpreserving,
title={Error-preserving Automatic Speech Recognition of Young English Learners' Language},
author={Janick Michot, Manuela Hürlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak},
booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},
year={2024},
url={https://openreview.net/forum?id=XPIwvlqIfI}
}
```
## Model Card Contact
[@mict-zhaw](https://github.com/mict-zhaw) |