Streaming?

#3
by pscar - opened

Thank you NVIDIA team for releasing yet another excellent ASR model!

Is there a guide on how to achieve streaming transcription using the latest parakeet-tdt-0.6b-v2 model?

NVIDIA org

You could do chunked streaming by following this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py directions on how to use is inside the script.

We noticed a bug with tdt for chunked streaming inference, we will push it soon to main for everyone to try!

We do also have dedicated cache-aware architecture for streaming use cases: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_multi . We are also working on an upgraded performant model to this one.

Hi @nithinraok . Thanks for that link. Waiting eagerly for the new streaming models! About the bug - do you recommend waiting for the bugfix if it's major or can the version on main be used already?

I second the idea for the live transcription. I would love an alternative to Whisper that had a decent interface that was running on my laptop and could work offline. Press a key, record your voice, let go of the key, transcribes, and pastes into a field.

Has it been fixed yet?
Or is there any update on the progress?

BatchedFrameASRTDT, ImportErro
rError. Could not import.

NVIDIA org

Yes, the fix is now merged to main. Use this script for performing buffered streaming: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py

Hi @nithinraok , thank you so much for the update! One question out of curiosity: according to the relevant commit, TDT does not currently support greedy_batch decoding strategy, but the .nemo file in this repository defaults to greedy_batch decoding strategy. Is this expected?

NVIDIA org

Yes that's used by default for offline. For streaming its get changed to greedy for now.

Thanks for your update. I saw that your huggingface demo has an interactive interface made with gradio. Can I deploy the streaming model interface on the server myself and use your gradio for non-commercial display?

Hi,

I am working on Real-Time Mic version, I have working one ready to test:

https://huggingface.co/spaces/WJ88/NVIDIA-Parakeet-TDT-0.6B-v2-INT8-Real-Time-Mic-Transcription
*the whole point of this space is to fit the model into 2vCPUs :) and it works!

The UI may not be nice but in overall just click RECORD, speak and watch transcription. After you finish, refresh the Browser Tab. (to free resources (please)
NOTE: the app is currently public, I mean, each user transcriptions are accumulating and other users cans see them, I am working on isolation but it is what it is, it works :)

You can use NVIDIA-Parakeet-TDT-0.6B-v2 without NVIDIA card in REAL-TIME - I encourage you to check it, check the code (its interesting that the model fits to 2vCPUs)
and finally clone and base on that make your own version! I will stick to optimizations and not fancy features in my repo.

"I love Pain"

I am in the main branch (commit 259d684e73c45091f0b6144342133e6ceb7e824c)
@nithinraok you mentioned that tdt streaming is fixed. Just checking again.
The script speech_to_text_buffered_infer_rnnt calles BatchedFrameASRTDT for tdt from streaming_utils.py with argument stateful_decoding, which I pass true.
But the class BatchedFrameASRTDT in its turn calls BatchedFrameASRRNNT (parent) like this
super().init(asr_model, frame_len=frame_len, total_buffer=total_buffer, batch_size=batch_size)
without passing stateful_decoding, and thus, it remains false is defined in default.

Is that how you intended it to be? Stateful decoding always false?

Sign up or log in to comment