h-j-han
/

SpeechQE-TowerInstruct-7B-es2en

Inference Endpoints

Model card Files Files and versions Community

SpeechQE-TowerInstruct-7B-es2en / README.md

h-j-han's picture

Update README.md

b22de0d verified 4 days ago

|

history blame contribute delete

2.87 kB

	---
	library_name: transformers
	license: mit
	datasets:
	- h-j-han/SpeechQE-CoVoST2
	language:
	- es
	- en
	base_model:
	- Unbabel/TowerInstruct-7B-v0.2
	- openai/whisper-large-v2
	---
	# [SpeechQE: Estimating the Quality of Direct Speech Translation](https://aclanthology.org/2024.emnlp-main.1218)
	This is End-to-End model for the task of quality estimation for speech translation (SpeechQE).

	\|Task \| E2E Model \| Trained Domain
	\|---\|---\|---\|
	\|SpeechQE for English-to-German Speech Translation \|[h-j-han/SpeechQE-TowerInstruct-7B-en2de](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-en2de)\| CoVoST2\|
	\|SpeechQE for Spanish-to-English Speech Translation \|[h-j-han/SpeechQE-TowerInstruct-7B-es2en](https://huggingface.co/h-j-han/SpeechQE-TowerInstruct-7B-es2en)\|CoVoST2\|

	## Architecture and Training
	Our design incorporates a pretrained speech encoder (whisper-large-v2) and a large language model (TowerInstruct-7B-v0.2) to leverage their existing capabilities in extracting high-quality audio features and handling
	translation-related tasks.
	The model is trained with two-phase approach where we first train only an adapter with ASR and ST tasks while freezing textLLM to focus solely on mapping between text and speech modality.
	Then, we continue training with the SpeechQE task to let the LLM learn the unseen task of QE. In the second phase, the adapter pre-trained in the previous phase is frozen, while text-LLM is trained with LoRA


	## Setup
	We provide code in Github repo : https://github.com/h-j-han/SpeechQE
	```bash
	$ git clone https://github.com/h-j-han/SpeechQE.git
	$ cd SpeechQE
	```
	```bash
	$ conda create -n speechqe Python=3.11 pytorch=2.0.1 pytorch-cuda=11.7 torchvision torchaudio -c pytorch -c nvidia
	$ conda activate speechqe
	$ pip install -r requirements.txt
	```

	## Download Audio Data
	Download the audio data from Common Voice. Here, we use mozilla-foundation/common_voice_4_0.
	```
	import datasets
	cv4en = datasets.load_dataset(
	"mozilla-foundation/common_voice_4_0", "es", cache_dir='path/to/cv4/download',
	)
	```
	## Evaluation
	We provide SpeechQE benchmark: [h-j-han/SpeechQE-CoVoST2](https://huggingface.co/datasets/h-j-han/SpeechQE-CoVoST2).
	BASE_AUDIO_PATH is the path of downloaded Common Voice dataset.
	```bash
	$ python speechqe/score_speechqe.py \
	--speechqe_model=h-j-han/SpeechQE-TowerInstruct-7B-es2en \
	--dataset_name=h-j-han/SpeechQE-CoVoST2 \
	--base_audio_path=$BASE_AUDIO_PATH \
	--dataset_config_name=es2en \
	--test_split_name=test \
	```


	## Reference
	Please find details in [this EMNLP24 paper](https://aclanthology.org/2024.emnlp-main.1218) :
	```
	@misc{han2024speechqe,
	title={SpeechQE: Estimating the Quality of Direct Speech Translation},
	author={HyoJung Han and Kevin Duh and Marine Carpuat},
	year={2024},
	eprint={2410.21485},
	archivePrefix={arXiv},
	primaryClass={cs.CL}
	}
	```