--- {} --- # Rimecaster (en-US) | [![Model architecture](https://img.shields.io/badge/Model_Arch-TitaNet--Large-lightgrey#model-badge)](#model-architecture) | [![Model size](https://img.shields.io/badge/Params-30M-lightgrey#model-badge)](#model-architecture) | [![Language](https://img.shields.io/badge/Language-en--US-lightgrey#model-badge)](#datasets) Rimecaster was developed by [Rime Labs](https://rime.ai/), trained with TTS tasks in mind and useful for speaker conditioning. This model extracts speaker embeddings from given speech, which can be the backbone for various TTS models. This model is adapted from Titanet-Large with a higher embedding dimension of 768 (up from 192). Read more in the [launch announcement blog post](https://www.rime.ai/blog/introducing-rimecaster/). See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/asr/speaker_recognition/models.html#titanet) for complete architecture details. ## NVIDIA NeMo: Training To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest Pytorch version. ``` pip install nemo_toolkit['all'] ``` ## How to Use this Model The model is available for use in the NeMo toolkit [3] and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset. ### Automatically instantiate the model ```python import nemo.collections.asr as nemo_asr speaker_model = nemo_asr.models.EncDecSpeakerLabelModel.from_pretrained("rimelabs/rimecaster") ``` ### Embedding Extraction Using ```python emb = speaker_model.get_embedding("an255-fash-b.wav") ``` ### Extracting Embeddings for more audio files To extract embeddings from a bunch of audio files: Write audio files to a `manifest.json` file with lines as in format: ```json {"audio_filepath": "/audio_file.wav", "duration": "duration of file in sec", "label": "speaker_id"} ``` Then running following script will extract embeddings and writes to current working directory: ```shell python /examples/speaker_tasks/recognition/extract_speaker_embeddings.py --manifest=manifest.json --model_path='/path/to/.nemo/file' ``` ### Input This model accepts 16000 KHz Mono-channel Audio (wav files) as input. ### Output This model provides speaker embeddings for an audio file. ## Model Architecture TitaNet model is a depth-wise separable conv1D model [1] for Speaker Verification and diarization tasks. You may find more info on the detail of this model here: [TitaNet-Model](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/speaker_recognition/models.html). ## Training The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/speaker_reco.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/recognition/conf/titanet-large.yaml). ## References [1] [TitaNet: Neural Model for Speaker Representation with 1D Depth-wise Separable convolutions and global context](https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9746806) [2] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo) ## Licence License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.