Translating English Audio Into Spanish Text

#61
by stvnchnsn - opened

I'm trying to translate audio that is in english to spanish text using the code listed below. No errors occur but the text is in english with no translation performed. Any clues?

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "translate"}
)

"language" parameter is used to indicate the spoken language in the audio.
The "translate" parameter indicates that the speech must be translated into English.

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Whisper was trained on speech recognition (audio in X -> text in X) and speech translation to English (audio in X -> text in En)

You can also 'trick' it into performing more general speech translation (audio in X -> text in Y) with reasonable results, but not as good as the trained tasks. You just need to set the language to your target language, and the task to "transcribe":

model_id = "openai/whisper-large-v3"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

translate_pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=1,
    torch_dtype=torch_dtype,
    device=device,
    return_timestamps=True,
    generate_kwargs={"language": "spanish", "task": "transcribe"}
)

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

Is there any way to improve performance? I didn't find any dataset (english audio - spanish test) for fine tuning

@Daniel981215
You could obtain such dataset by taking english speech-to-text dataset, then translating english text to spanish (using open source or cloud solutions)

I’ve created a code-switched language dataset for fine-tuning Whisper, including audio data along with CSV and Parquet files, which I’ve stored on Hugging Face. After preparing the dataset, I fine-tuned the model for translation. You can explore the entire end-to-end project in my repo. Here’s the link to check it out: https://github.com/pr0mila/MediBeng-Whisper-Tiny

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment