What is the data format for training?
Thank you, NeMo team, for your great work!
I want to include timestamp, punctuation, and capitalization. Is this the correct format for the metadata?
{"audio_filepath": "audio.wav", "duration": 2.401, "text": "δ½γγ?", "pnc": "yes", "source_lang": "ja", "target_lang": "ja", "task": "asr", "sampling_rate": 16000, "lang": "ja"}
Yes look great! We have two types of data loading schemes, one with Lhotse and one with native Nemo. See here: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html Feel free to raise an issue on NeMo GitHub for further clarification if needed.
Thanks
@nithinraok
!
Regarding word-level timestamps, do I need to include markers in the training text, like "text": "<|3|> it's <|7|> <|8|> almost <|9|> <|14|> beyond <|20|> <|20|> conjecture <|28|>", in order to obtain timestamps during inference? Or is it sufficient to simply provide the plain text "text": " it's almost beyond conjecture " and still get word-level timestamps if the TDT mechanism functions similarly to CTC?
You don't need to train such way for rnnt, CTC or tdt decoders, you could extract them during decoding process with a well aligned model.
@nithinraok
, Thanks for the reply.
So, if I want the model to output punctuation during inference, the training dataset should contain "text" like "text": 'I'm Nguyen' rather than "text":'i m nguyen', right?