microsoft/Phi-4-multimodal-instruct · Audio transcription is not finishing the full dialogue

Audio transcription is not finishing the full dialogue

#21

by Farhang87 - opened 7 days ago

7 days ago

I am trying to benchmark the microsoft/Phi-4-multimodal-instruct on transcribing medical dialogue on a specific dataset, and it seems it randomly sometimes decides to not fully transcribe the dialogue. Am I the only one, or something wrong with the model?

Also, when inferencing in a loop, GPU ram keeps growing incrementally, not sure why?

fanruchao

6 days ago

Hi Farhang87,

Thanks for the interest in Phi4-multimodal. Here are some notes for your questions.

Model-wise, how long is the dialogue? The Phi4-mm model supports up to 40 seconds audio for transcription task. For longer audio, the model may miss some audios. You might need to use a VAD module to segment audio or finetune the model with long audios.

Inference-wise, how did you set your generation configuration? It is suggested to inference with temperature=0 or do_sample=False to get the best transcription performance. Please check our sample ASR inference code here:
https://github.com/huggingface/open_asr_leaderboard/pull/51/commits/a471a825a1b21aafdb8bc3a9732fc656ae1f9ec5#diff-0cc2e2a33d6603fa8731c9277f06335d7826b62e96d47eff8829821598f69027

Farhang87

5 days ago

•

edited 5 days ago

The dialogue samples are indeed exceeding the 40 seconds, and about 5-10 minutes long.

To be honest, summarization of dialogue (or any long speech segments) seems like on of the best selling points of a multi-modal version like Phi4, but when it's restricted to 40 second segments, it kind of loses value.

Thanks for the inference code, I'll try that further.

fanruchao

5 days ago

•

edited 5 days ago

Hi Farhang87,

For summarization task, Phi4 can support up to 30 minutes. It should work in your case if you want to get the main point for the long dialogue.

What I mean 40s limit here is for the transcription task when you want to get what the audio says word by word. The model may miss some parts of the audio transcriptions when it exceeds 40s. You need VAD to get the best transcription.

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment