|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- h-j-han/SpeechQE-CoVoST2 |
|
language: |
|
- es |
|
- en |
|
base_model: |
|
- Unbabel/TowerInstruct-7B-v0.2 |
|
- openai/whisper-large-v2 |
|
--- |
|
# [SpeechQE: Estimating the Quality of Direct Speech Translation](https://aclanthology.org/2024.emnlp-main.1218) |
|
This is End-to-End model for the task of quality estimation for speech translation (SpeechQE). |
|
|
|
|Task | E2E Model | Trained Domain |
|
|---|---|---| |
|
|SpeechQE for English-to-German Speech Translation |[h-j-han/SpeechQE-TowerInstruct-7B-en2de](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-en2de)| CoVoST2| |
|
|SpeechQE for Spanish-to-English Speech Translation |[h-j-han/SpeechQE-TowerInstruct-7B-es2en](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-es2en)|CoVoST2| |
|
|
|
## Architecture and Training |
|
Our design incorporates a pretrained speech encoder (whisper-large-v2) and a large language model (TowerInstruct-7B-v0.2) to leverage their existing capabilities in extracting high-quality audio features and handling |
|
translation-related tasks. |
|
The model is trained with two-phase approach where we first train only an adapter with ASR and ST tasks while freezing textLLM to focus solely on mapping between text and speech modality. |
|
Then, we continue training with the SpeechQE task to let the LLM learn the unseen task of QE. In the second phase, the adapter pre-trained in the previous phase is frozen, while text-LLM is trained with LoRA |
|
|
|
|
|
## Setup |
|
We provide code in Github repo : https://github.com/h-j-han/SpeechQE |
|
```bash |
|
$ git clone https://github.com/h-j-han/SpeechQE.git |
|
$ cd SpeechQE |
|
``` |
|
```bash |
|
$ conda create -n speechqe Python=3.11 pytorch=2.0.1 pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia |
|
$ conda activate speechqe |
|
$ pip install -r requirements.txt |
|
``` |
|
|
|
## Download Audio Data |
|
Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0. |
|
``` |
|
import datasets |
|
cv4en = datasets.load_dataset( |
|
"mozilla-foundation/common_voice_4_0", "es", cache_dir='path/to/cv4/download', |
|
) |
|
``` |
|
## Evaluation |
|
We provide SpeechQE benchmark: [h-j-han/SpeechQE-CoVoST2](https://huggingface.co/datasets/h-j-han/SpeechQE-CoVoST2). |
|
BASE_AUDIO_PATH is the path of downloaded Common Voice dataset. |
|
```bash |
|
$ python speechqe/score_speechqe.py \ |
|
--speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-es2en \ |
|
--dataset_name=h-j-han/SpeechQE-CoVoST2 \ |
|
--base_audio_path=$BASE_AUDIO_PATH \ |
|
--dataset_config_name=es2en \ |
|
--test_split_name=test \ |
|
``` |
|
|
|
|
|
## Reference |
|
Please find details in [this EMNLP24 paper](https://aclanthology.org/2024.emnlp-main.1218) : |
|
``` |
|
@misc{han2024speechqe, |
|
title={SpeechQE: Estimating the Quality of Direct Speech Translation}, |
|
author={HyoJung Han and Kevin Duh and Marine Carpuat}, |
|
year={2024}, |
|
eprint={2410.21485}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |