---
library_name: transformers
license: mit
datasets:
- h-j-han/SpeechQE-CoVoST2
language:
- es
- en
base_model:
- Unbabel/TowerInstruct-7B-v0.2
- openai/whisper-large-v2
---
# [SpeechQE: Estimating the Quality of Direct Speech Translation](https://aclanthology.org/2024.emnlp-main.1218)
This is End-to-End model for the task of quality estimation for speech translation (SpeechQE).

|Task | E2E Model | Trained Domain
|---|---|---|
|SpeechQE for English-to-German Speech Translation |[h-j-han/SpeechQE-TowerInstruct-7B-en2de](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-en2de)| CoVoST2|
|SpeechQE for Spanish-to-English Speech Translation  |[h-j-han/SpeechQE-TowerInstruct-7B-es2en](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-es2en)|CoVoST2|

## Architecture and Training
Our design incorporates a pretrained speech encoder (whisper-large-v2) and a large language model (TowerInstruct-7B-v0.2) to leverage their existing capabilities in extracting high-quality audio features and handling
translation-related tasks.
The model is trained with two-phase approach where we first train only an adapter with ASR and ST tasks while freezing textLLM to focus solely on mapping between text and speech modality. 
Then, we continue training with the SpeechQE task to let the LLM learn the unseen task of QE. In the second phase, the adapter pre-trained in the previous phase is frozen, while text-LLM is trained with LoRA


## Setup
We provide code in Github repo : https://github.com/h-j-han/SpeechQE  
```bash
$ git clone https://github.com/h-j-han/SpeechQE.git
$ cd SpeechQE
```
```bash
$ conda create -n speechqe Python=3.11 pytorch=2.0.1  pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia
$ conda activate speechqe
$ pip install -r requirements.txt
```

## Download Audio Data
Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0.
```
import datasets
cv4en = datasets.load_dataset(
    "mozilla-foundation/common_voice_4_0", "es", cache_dir='path/to/cv4/download',
)
```
## Evaluation
We provide SpeechQE benchmark: [h-j-han/SpeechQE-CoVoST2](https://huggingface.co/datasets/h-j-han/SpeechQE-CoVoST2).
BASE_AUDIO_PATH is the path of downloaded Common Voice dataset.
```bash
$ python speechqe/score_speechqe.py \
    --speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-es2en \
    --dataset_name=h-j-han/SpeechQE-CoVoST2 \
    --base_audio_path=$BASE_AUDIO_PATH \
    --dataset_config_name=es2en \
    --test_split_name=test \
```


## Reference
Please find details in [this EMNLP24 paper](https://aclanthology.org/2024.emnlp-main.1218) :
```
@misc{han2024speechqe,
    title={SpeechQE: Estimating the Quality of Direct Speech Translation},
    author={HyoJung Han and Kevin Duh and Marine Carpuat},
    year={2024},
    eprint={2410.21485},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
```