File size: 8,696 Bytes

---

license: apache-2.0
datasets:
- mict-zhaw/chall
language:
- en
pipeline_tag: automatic-speech-recognition
---


# Model Card for chall_wav2vec2_xlsr_300m



This model, named `ChaLL-300M`, is an Automatic Speech Recognition (ASR) system specifically designed and fine-tuned to transcribe spontaneous 

speech of young English learners while preserving their errors.

It was developed as part of a project to improve the speaking practice and provide corrective feedback in language learning environments.



## Model Details



### Model Description



`ChaLL-300M` is a fine-tuned ASR model from the Wav2Vec-XLSR-300M base model.

This model represents the best performing fold (Fold 5) from a k-fold cross-validation training process as described in the [paper](#).

It addresses the unique challenges of transcribing the spontaneous speech of young English learners by preserving their grammatical, lexical, and pronunciation errors. 





- **Developed by:** Zurich University of Applied Sciences, Swiss public schools' collaboration

- **Funded by:** Swiss Innovation Agency (Innosuisse)

- **Shared by:** Zurich University of Applied Sciences

- **Model type:** Automatic Speech Recognition (ASR)

- **Language(s) (NLP):** English

- **License:** [More Information Needed]

- **Finetuned from model:** Wav2Vec-XLSR-300M



### Model Sources



<!-- Provide the basic links for the model. -->



- **Repository:** https://github.com/mict-zhaw/chall_e2e_stt

- **Paper:** [More Information Needed]



## Uses



### Direct Use



`ChaLL-300M` can be directly used for transcribing the English speech of young learners,

particularly in educational software designed to provide real-time feedback on language learning. 

Its ability to accurately preserve and transcribe grammatical, lexical, and pronunciation errors 

makes it especially useful for applications aimed at improving language acquisition.





### Downstream Use



`ChaLL-300M` can be integrated into larger systems for more complex applications.

Examples include:



- **Voice-based Chatbot for Language-Learners (ChaLL)**: ChaLL is a chatbot-based solution designed to enhance language learners' speaking skills through personalized, interactive practice. It stands out by focusing on real-world communication, unlike traditional language learning methods.

  - *Interactive Speaking*: Engages learners in conversation, providing real-time feedback.

  - *Personalized Feedback*: Uses the `ChaLL-300M` model to preserve and correct speech errors.

  - *Supportive Environment*: Mimics real-world scenarios for practical learning.

  - *Enhanced Competence*: Builds fluency and confidence for effective communication.





### Out-of-Scope Use



The `ChaLL-300M` model is not suitable for certain applications:



- **Transcription of Error-Free Speech**: The model is optimized for preserving errors in speech; therefore, it is not ideal for contexts where error-free transcription of adult native speakers' English speech is required.

- **High-Stakes Professional Transcriptions**: The model should not be used in professional or formal contexts where the accuracy of fluent or highly technical speech is crucial.

- **Bias-Sensitive Applications**: Avoid using the model in applications where the bias towards young learners’ speech patterns could negatively impact the results, such as in automated systems assessing adult speech or in regions with significantly different accents or dialects than those present in the training data.



## Bias, Risks, and Limitations



While `ChaLL-300M` is tailored to preserve the speech errors of young English learners, 

it may not perform well with adult native speaker data or read-aloud tasks.

Additionally, the model could potentially amplify biases present in the training data, 

such as regional accents or typical errors made by specific age groups. 

There is also a risk that the model might not handle code-switching or non-English phrases accurately.



### Recommendations



Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model.

It's recommended to carefully evaluate the context and specific requirements before deploying it in an application.





## How to Get Started with the Model



Use the code below to get started with the `ChaLL-300M` model:



```python

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

import torchaudio, torch



processor = Wav2Vec2Processor.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")

model = Wav2Vec2ForCTC.from_pretrained("mict-zhaw/chall_wav2vec2_xlsr_300m")



# Load your own audio file

audio_input, _ = torchaudio.load("path_to_your_audio_file.wav")
input_values = processor(audio_input, return_tensors="pt", sampling_rate=16000).input_values[0]



# Perform transcription

with torch.no_grad():
    logits = model(input_values).logits

    predicted_ids = torch.argmax(logits, dim=-1)

    transcription = processor.batch_decode(predicted_ids)


print(transcription)
```



## Training Details



### Training Data



The training data consists of 85 hours of spontaneous English speech recordings from young learners in Swiss public schools,

encompassing grades 4 to 6. The dataset contains 45,004 utterances from 327 distinct speakers. 

The complete dataset cannot be freely accessed due to privacy concerns but can be obtained through a collaboration agreement.

Find more information in 

[mict-zhaw/chall](https://huggingface.co/datasets/mict-zhaw/chall).



### Training Procedure



#### Preprocessing



- Removal of error annotations and transcript conventions

- Conversion to lowercase

- Standardization of text (e.g., removing unnecessary spaces, normalizing special characters)

- Transformation of digits to words using `num2words`



#### Training Hyperparameters



- **Training regime:** fp32

- **Learning rate:** 3e-5

- **Batch size per device:** 14

- **Gradient accumulation steps:** 15 (total batch size corresponds to approximately 2 hours of audio)

- **Optimizer:** 8-bit AdamW

- **Training steps:** 4000



## Evaluation



### Testing Data, Factors & Metrics



#### Testing Data



The testing data consists of the same dataset used for training, partitioned into five distinct folds.

Each fold includes a variety of grades (4 to 6) to simulate a real-world user scenario where an ASR system is exposed to new, unseen speaker data.



#### Factors



- Grade levels (4th to 6th grade)

- School area codes



#### Metrics



To measure error preservation, we use the error annotations that were manually added to each utterance and a custom phonetic word-level alignment algorithm.

This algorithm aligns two or more sequences (e.g., a reference and one or multiple hypotheses), identifying matches, substitutions (S), insertions (I), and deletions (D) at the word level. 

Our metric, WEPR (Word-Based Error Preservation Rate), considers only those word pairs where the reference word contains an error annotation.

WEPR is calculated according to the formula:



![Word-Based Error Preservation Rate (WEPR) Formula](https://latex.codecogs.com/svg.image?&space;WEPR(\mathcal{A})=\frac{(\mathcal{S&plus;D})}{\mathcal{N}})



In addition to WEPR, we also compute the following general ASR metrics using all words in the utterance:

Word Error Rate (WER), Character Error Rate (CER), and character n-gram F-Score (chrF).





### Results



#### Summary



The `ChaLL-300M` model achieves strong performance in preserving the errors made by young English learners,

significantly outperforming baseline models in terms of WEPR.

The model also achieves competitive WER, CER, and chrF scores, indicating its effectiveness for the target demographic:



- **WER:** 0.30 ± 0.01

- **CER:** 0.16 ± 0.01

- **chrF:** 0.68 ± 0.01

- **WEPR:** 0.38 ± 0.03



## Model Examination



Further qualitative analysis on preserved errors indicates that the model effectively retains grammatical, lexical, and pronunciation errors 

made by young learners. This aspect makes it particularly suitable for educational applications focused on language learning and correction.



## Citation



**BibTeX:**



```bibtex

@inproceedings{

  anonymous2024errorpreserving,

  title={Error-preserving Automatic Speech Recognition of Young English Learners' Language},

  author={Janick Michot, Manuela Hürlimann, Jan Deriu, Luzia Sauer, Katsiaryna Mlynchyk, Mark Cieliebak},

  booktitle={The 62nd Annual Meeting of the Association for Computational Linguistics},

  year={2024},

  url={https://openreview.net/forum?id=XPIwvlqIfI}

}

```

## Model Card Contact

[@mict-zhaw](https://github.com/mict-zhaw)