What is the data format for training?

#10

by Nguyen667201 - opened May 6

May 6

Thank you, NeMo team, for your great work!
I want to include timestamp, punctuation, and capitalization. Is this the correct format for the metadata?
{"audio_filepath": "audio.wav", "duration": 2.401, "text": "何これ?", "pnc": "yes", "source_lang": "ja", "target_lang": "ja", "task": "asr", "sampling_rate": 16000, "lang": "ja"}

nithinraok

NVIDIA org May 6

Yes look great! We have two types of data loading schemes, one with Lhotse and one with native Nemo. See here: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/datasets.html Feel free to raise an issue on NeMo GitHub for further clarification if needed.

Nguyen667201

May 6

•

edited May 6

Thanks @nithinraok !
Regarding word-level timestamps, do I need to include markers in the training text, like "text": "<|3|> it's <|7|> <|8|> almost <|9|> <|14|> beyond <|20|> <|20|> conjecture <|28|>", in order to obtain timestamps during inference? Or is it sufficient to simply provide the plain text "text": " it's almost beyond conjecture " and still get word-level timestamps if the TDT mechanism functions similarly to CTC?

nithinraok

NVIDIA org May 6

You don't need to train such way for rnnt, CTC or tdt decoders, you could extract them during decoding process with a well aligned model.

leminhnguyen

May 6

@nithinraok That means we just need the plain text and audio to train this model?

Nguyen667201

May 6

•

edited May 6

@nithinraok , Thanks for the reply.
So, if I want the model to output punctuation during inference, the training dataset should contain "text" like "text": 'I'm Nguyen' rather than "text":'i m nguyen', right?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment