rimelabs
/

rimecaster

NeMo

Model card Files Files and versions Community

ghees commited on 18 days ago

Commit

7c00ba7

verified ·

1 Parent(s): 34b8e3e

Push model using huggingface_hub.

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +137 -92
rimecaster.nemo +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+rimecaster.nemo filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,151 +1,196 @@
-# Model Card for Rimecaster 🎸
-<!-- Provide a quick summary of what the model is/does. -->
-Rimecaster is a foundation model designed to generate extremely rich speaker representations (embeddings) trained by [Rime Labs](www.rime.ai), following the wonderful work by the NeMo team at NVIDIA on [TitaNet](https://huggingface.co/nvidia/speakerverification_en_titanet_large).
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [Rime Labs]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-This speaker representation model is intended for use in downstream speech tasks like diarization, speaker-conditioned speech recognition, and multi-speaker text-to-speech models, including in Rime's flagship speech synthesis model, Mist.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-CODE SAMPLES NEEDED
-## Training Details
-### Training Data
-Rimecaster was trained on speech datasets including:
-- Voxceleb
-- Fisher
-- Switchboard
-- Librispeech
-As well as a massive amount of proprietary speech data collected by Rime in our San Francisco, CA recording studio.
-### Training Procedure
-#### Preprocessing [optional]
-TBD
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-Performances of the these models are reported in terms of Equal Error Rate (EER%) on speaker verification evaluation trial files and as Diarization Error Rate (DER%) on diarization test sessions.
-* Speaker Verification (EER%)
-| Version | Model | Model Size | VoxCeleb1 (Cleaned trial file) |
-|---------|--------------|-----|---------------|
-| 1.0.0 | Rimecaster | XXXXM | XXXXXXX  |
-* Speaker Diarization (DER%)
-| Version | Model | Model Size | Evaluation Condition | NIST SRE 2000 | AMI (Lapel) | AMI (MixHeadset) | CH109 |
-|---------|--------------|-----|----------------------|---------------|-------------|------------------|-------|
-| 1.0.0 | Rimecaster | XXXXXXX | Oracle VAD KNOWN # of Speakers  |      XXXXXXX     |      XXXXXXX      |         XXXXXXX       | XXXXXXX |
-| 1.0.0 | Rimecaster | XXXXXXX | Oracle VAD UNKNOWN # of Speakers  |    XXXXXXX   |      XXXXXXX      |        XXXXXXX       |  XXXXXXX |
-#### Summary
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

+---
+library_name: nemo
+license: cc-by-4.0
+tags:
+- pytorch
+- NeMo
+---
+# Rimecaster
+<style>
+img {
+ display: inline;
+}
+</style>
+[![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
+| [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
+| [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
+**Put a short model description here.**
+See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
+## NVIDIA NeMo: Training
+To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
+```
+pip install nemo_toolkit['all']
+```
+## How to Use this Model
+The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
+### Automatically instantiate the model
+**NOTE**: Please update the model class below to match the class of the model being uploaded.
+```python
+import nemo.core import ModelPT
+model = ModelPT.from_pretrained("rimelabs/rimecaster")
+```
+### NOTE
+    Add some information about how to use the model here. An example is provided for ASR inference below.
+    ### Transcribing using Python
+    First, let's get a sample
+    ```
+    wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
+    ```
+    Then simply do:
+    ```
+    asr_model.transcribe(['2086-149220-0033.wav'])
+    ```
+    ### Transcribing many audio files
+    ```shell
+    python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py      pretrained_name="rimelabs/rimecaster"      audio_dir=""
+    ```
+### Input
+**Add some information about what are the inputs to this model**
+### Output
+**Add some information about what are the outputs of this model**
+## Model Architecture
+**Add information here discussing architectural details of the model or any comments to users about the model.**
+## Training
+**Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
+### NOTE
+    An example is provided below for ASR
+    The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
+    The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
+### Datasets
+**Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**
+### NOTE
+    An example for the manifest section is provided below for ASR datasets
+    datasets:
+    - librispeech_asr
+    - fisher_corpus
+    - Switchboard-1
+    - WSJ-0
+    - WSJ-1
+    - National-Singapore-Corpus-Part-1
+    - National-Singapore-Corpus-Part-6
+    - vctk
+    - voxpopuli
+    - europarl
+    - multilingual_librispeech
+    - mozilla-foundation/common_voice_8_0
+    - MLCommons/peoples_speech
+    The corresponding text in this section for those datasets is stated below -
+    The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
+    The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
+    - Librispeech 960 hours of English speech
+    - Fisher Corpus
+    - Switchboard-1 Dataset
+    - WSJ-0 and WSJ-1
+    - National Speech Corpus (Part 1, Part 6)
+    - VCTK
+    - VoxPopuli (EN)
+    - Europarl-ASR (EN)
+    - Multilingual Librispeech (MLS EN) - 2,000 hour subset
+    - Mozilla Common Voice (v7.0)
+    - People's Speech  - 12,000 hour subset
+## Performance
+**Add information here about the performance of the model. Discuss what is the metric that is being used to evaluate the model and if there are external links explaning the custom metric, please link to it.
+### NOTE
+    An example is provided below for ASR metrics list that can be added to the top of the README
+    model-index:
+    - name: PUT_MODEL_NAME
+      results:
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: AMI (Meetings test)
+          type: edinburghcstr/ami
+          config: ihm
+          split: test
+          args:
+            language: en
+        metrics:
+        - name: Test WER
+          type: wer
+          value: 17.10
+      - task:
+          name: Automatic Speech Recognition
+          type: automatic-speech-recognition
+        dataset:
+          name: Earnings-22
+          type: revdotcom/earnings22
+          split: test
+          args:
+            language: en
+        metrics:
+        - name: Test WER
+          type: wer
+          value: 14.11
+Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
+It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
+## Limitations
+**Discuss any practical limitations to the model when being used in real world cases. They can also be legal disclaimers, or discussion regarding the safety of the model (particularly in the case of LLMs).**
+### Note
+    An example is provided below
+    Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
+## License
+License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
+## References
+**Provide appropriate references in the markdown link format below. Please order them numerically.**
+[1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)

rimecaster.nemo ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cee3b74a85d6c1e004a6d982c98b09d59035c635eee9cab1651d8f399587b6c5
+size 121405440