OutOfMemoryError: CUDA out of memory. on RTX A5000

#46
by akskuchi - opened

Hi,

Thanks to the team for the contributing this impressive model.

I'm trying to transcribe an audio (native English speech) that spans ~40 mins. The model is not able to perform the transcription and fails with OutOfMemoryError: CUDA out of memory ERROR on a RTX A5000 GPU (with VRAM: 24 GB). Other hardware details: RAM: 128 GB, CPUs: 32.
I tried looking up if there are any preprocessing parameters that enable chunking of the audio into smaller segments (similar to OpenAI's Whisper), but couldn't find any.

Here's the code I used:

import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained(model_name="nvidia/parakeet-tdt-0.6b-v2")
output = asr_model.transcribe('input.wav', timestamps=True) 
# NOTE: using timestamps=False did not resolve OOM ERROR :(

Is this expected? I'd be grateful for any ideas or suggestions. Thanks!

Hi you could do two things:

  1. apply limited attention settings with:
asr_model.change_attention_model("rel_pos_local_attn", [256,256])
asr_model.change_subsampling_conv_chunking_factor(1)
asr_model.to(torch.bfloat16)
output = asr_model.transcribe('input.wav', timestamps=True)  

This will enable long form inference and length of it will depend on GPU available RAM. This comes at a little degradation in accuracy but not much.

  1. Chunk inference using this script: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py . Its not integrated to .transcribe() yet.

Sign up or log in to comment