What is the data format for training?

#10
by Nguyen667201 - opened

Thank you, NeMo team, for your great work!
I want to include timestamp, punctuation, and capitalization. Is this the correct format for the metadata?
{"audio_filepath": "audio.wav", "duration": 2.401, "text": "δ½•γ“γ‚Œ?", "pnc": "yes", "source_lang": "ja", "target_lang": "ja", "task": "asr", "sampling_rate": 16000, "lang": "ja"}

NVIDIA org

Yes look great! We have two types of data loading schemes, one with Lhotse and one with native Nemo. See here: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html Feel free to raise an issue on NeMo GitHub for further clarification if needed.

Thanks @nithinraok !
Regarding word-level timestamps, do I need to include markers in the training text, like "text": "<|3|> it's <|7|> <|8|> almost <|9|> <|14|> beyond <|20|> <|20|> conjecture <|28|>", in order to obtain timestamps during inference? Or is it sufficient to simply provide the plain text "text": " it's almost beyond conjecture " and still get word-level timestamps if the TDT mechanism functions similarly to CTC?

NVIDIA org

You don't need to train such way for rnnt, CTC or tdt decoders, you could extract them during decoding process with a well aligned model.

@nithinraok That means we just need the plain text and audio to train this model?

@nithinraok , Thanks for the reply.
So, if I want the model to output punctuation during inference, the training dataset should contain "text" like "text": 'I'm Nguyen' rather than "text":'i m nguyen', right?

Sign up or log in to comment