ghees commited on
Commit
7c00ba7
·
verified ·
1 Parent(s): 34b8e3e

Push model using huggingface_hub.

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +137 -92
  3. rimecaster.nemo +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ rimecaster.nemo filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,151 +1,196 @@
 
 
 
 
 
 
 
1
 
2
- # Model Card for Rimecaster 🎸
3
 
4
- <!-- Provide a quick summary of what the model is/does. -->
 
 
 
 
5
 
6
- Rimecaster is a foundation model designed to generate extremely rich speaker representations (embeddings) trained by [Rime Labs](www.rime.ai), following the wonderful work by the NeMo team at NVIDIA on [TitaNet](https://huggingface.co/nvidia/speakerverification_en_titanet_large).
 
 
7
 
8
- ## Model Details
9
 
10
- ### Model Description
11
 
12
- <!-- Provide a longer summary of what this model is. -->
13
 
 
14
 
 
 
 
 
15
 
16
- - **Developed by:** [Rime Labs]
17
- - **Funded by [optional]:** [More Information Needed]
18
- - **Shared by [optional]:** [More Information Needed]
19
- - **Model type:** [More Information Needed]
20
- - **License:** [More Information Needed]
21
- - **Finetuned from model [optional]:** [More Information Needed]
22
 
23
- ### Model Sources [optional]
24
 
25
- <!-- Provide the basic links for the model. -->
26
 
27
- - **Repository:** [More Information Needed]
28
- - **Paper [optional]:** [More Information Needed]
29
- - **Demo [optional]:** [More Information Needed]
30
 
 
 
 
 
31
 
32
- ### Direct Use
33
 
34
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
35
 
36
- [More Information Needed]
 
 
 
 
 
 
 
 
37
 
38
- ### Downstream Use [optional]
39
 
40
- This speaker representation model is intended for use in downstream speech tasks like diarization, speaker-conditioned speech recognition, and multi-speaker text-to-speech models, including in Rime's flagship speech synthesis model, Mist.
 
 
41
 
 
42
 
 
43
 
 
44
 
45
- ## How to Get Started with the Model
46
 
47
- Use the code below to get started with the model.
48
 
49
- CODE SAMPLES NEEDED
50
 
51
- ## Training Details
52
 
53
- ### Training Data
54
 
55
- Rimecaster was trained on speech datasets including:
56
 
57
- - Voxceleb
58
- - Fisher
59
- - Switchboard
60
- - Librispeech
61
 
62
- As well as a massive amount of proprietary speech data collected by Rime in our San Francisco, CA recording studio.
63
 
64
- ### Training Procedure
65
 
66
 
 
67
 
68
- #### Preprocessing [optional]
69
 
70
- TBD
71
 
 
72
 
73
- #### Training Hyperparameters
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
76
 
77
- #### Speeds, Sizes, Times [optional]
78
 
79
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
80
 
81
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
82
 
83
- ## Evaluation
84
 
85
- <!-- This section describes the evaluation protocols and provides the results. -->
86
 
87
- ### Testing Data, Factors & Metrics
88
 
89
- #### Testing Data
90
 
91
- <!-- This should link to a Dataset Card if possible. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
- [More Information Needed]
94
 
95
- #### Factors
96
 
97
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
98
 
99
- [More Information Needed]
100
 
101
- #### Metrics
102
 
103
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
104
 
105
- [More Information Needed]
106
 
107
- ### Results
108
 
109
- Performances of the these models are reported in terms of Equal Error Rate (EER%) on speaker verification evaluation trial files and as Diarization Error Rate (DER%) on diarization test sessions.
110
 
111
- * Speaker Verification (EER%)
112
- | Version | Model | Model Size | VoxCeleb1 (Cleaned trial file) |
113
- |---------|--------------|-----|---------------|
114
- | 1.0.0 | Rimecaster | XXXXM | XXXXXXX |
115
 
116
- * Speaker Diarization (DER%)
117
- | Version | Model | Model Size | Evaluation Condition | NIST SRE 2000 | AMI (Lapel) | AMI (MixHeadset) | CH109 |
118
- |---------|--------------|-----|----------------------|---------------|-------------|------------------|-------|
119
- | 1.0.0 | Rimecaster | XXXXXXX | Oracle VAD KNOWN # of Speakers | XXXXXXX | XXXXXXX | XXXXXXX | XXXXXXX |
120
- | 1.0.0 | Rimecaster | XXXXXXX | Oracle VAD UNKNOWN # of Speakers | XXXXXXX | XXXXXXX | XXXXXXX | XXXXXXX |
121
 
 
122
 
123
- #### Summary
124
 
125
-
126
-
127
- ## Technical Specifications [optional]
128
-
129
- ### Model Architecture and Objective
130
-
131
- [More Information Needed]
132
-
133
- ### Compute Infrastructure
134
-
135
- [More Information Needed]
136
-
137
- #### Hardware
138
-
139
- [More Information Needed]
140
-
141
- #### Software
142
-
143
- [More Information Needed]
144
-
145
- ## Model Card Authors [optional]
146
-
147
- [More Information Needed]
148
-
149
- ## Model Card Contact
150
-
151
- [More Information Needed]
 
1
+ ---
2
+ library_name: nemo
3
+ license: cc-by-4.0
4
+ tags:
5
+ - pytorch
6
+ - NeMo
7
+ ---
8
 
9
+ # Rimecaster
10
 
11
+ <style>
12
+ img {
13
+ display: inline;
14
+ }
15
+ </style>
16
 
17
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-PUT-YOUR-ARCHITECTURE-HERE-lightgrey#model-badge)](#model-architecture)
18
+ | [![Model size](https://img.shields.io/badge/Params-PUT-YOUR-MODEL-SIZE-HERE-lightgrey#model-badge)](#model-architecture)
19
+ | [![Language](https://img.shields.io/badge/Language-PUT-YOUR-LANGUAGE-HERE-lightgrey#model-badge)](#datasets)
20
 
21
+ **Put a short model description here.**
22
 
23
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/stable/index.html) for complete architecture details.
24
 
 
25
 
26
+ ## NVIDIA NeMo: Training
27
 
28
+ To train, fine-tune or play with the model you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed latest Pytorch version.
29
+ ```
30
+ pip install nemo_toolkit['all']
31
+ ```
32
 
33
+ ## How to Use this Model
 
 
 
 
 
34
 
35
+ The model is available for use in the NeMo toolkit [3], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
36
 
37
+ ### Automatically instantiate the model
38
 
39
+ **NOTE**: Please update the model class below to match the class of the model being uploaded.
 
 
40
 
41
+ ```python
42
+ import nemo.core import ModelPT
43
+ model = ModelPT.from_pretrained("rimelabs/rimecaster")
44
+ ```
45
 
46
+ ### NOTE
47
 
48
+ Add some information about how to use the model here. An example is provided for ASR inference below.
49
 
50
+ ### Transcribing using Python
51
+ First, let's get a sample
52
+ ```
53
+ wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
54
+ ```
55
+ Then simply do:
56
+ ```
57
+ asr_model.transcribe(['2086-149220-0033.wav'])
58
+ ```
59
 
60
+ ### Transcribing many audio files
61
 
62
+ ```shell
63
+ python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py pretrained_name="rimelabs/rimecaster" audio_dir=""
64
+ ```
65
 
66
+ ### Input
67
 
68
+ **Add some information about what are the inputs to this model**
69
 
70
+ ### Output
71
 
72
+ **Add some information about what are the outputs of this model**
73
 
74
+ ## Model Architecture
75
 
76
+ **Add information here discussing architectural details of the model or any comments to users about the model.**
77
 
78
+ ## Training
79
 
80
+ **Add information here about how the model was trained. It should be as detailed as possible, potentially including the the link to the script used to train as well as the base config used to train the model. If extraneous scripts are used to prepare the components of the model, please include them here.**
81
 
82
+ ### NOTE
83
 
84
+ An example is provided below for ASR
 
 
 
85
 
86
+ The NeMo toolkit [3] was used for training the models for over several hundred epochs. These model are trained with this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_transducer/speech_to_text_rnnt_bpe.py) and this [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml).
87
 
88
+ The tokenizers for these models were built using the text transcripts of the train set with this [script](https://github.com/NVIDIA/NeMo/blob/main/scripts/tokenizers/process_asr_text_tokenizer.py).
89
 
90
 
91
+ ### Datasets
92
 
93
+ **Try to provide as detailed a list of datasets as possible. If possible, provide links to the datasets on HF by adding it to the manifest section at the top of the README (marked by ---).**
94
 
95
+ ### NOTE
96
 
97
+ An example for the manifest section is provided below for ASR datasets
98
 
99
+ datasets:
100
+ - librispeech_asr
101
+ - fisher_corpus
102
+ - Switchboard-1
103
+ - WSJ-0
104
+ - WSJ-1
105
+ - National-Singapore-Corpus-Part-1
106
+ - National-Singapore-Corpus-Part-6
107
+ - vctk
108
+ - voxpopuli
109
+ - europarl
110
+ - multilingual_librispeech
111
+ - mozilla-foundation/common_voice_8_0
112
+ - MLCommons/peoples_speech
113
 
114
+ The corresponding text in this section for those datasets is stated below -
115
 
116
+ The model was trained on 64K hours of English speech collected and prepared by NVIDIA NeMo and Suno teams.
117
 
118
+ The training dataset consists of private subset with 40K hours of English speech plus 24K hours from the following public datasets:
119
 
120
+ - Librispeech 960 hours of English speech
121
+ - Fisher Corpus
122
+ - Switchboard-1 Dataset
123
+ - WSJ-0 and WSJ-1
124
+ - National Speech Corpus (Part 1, Part 6)
125
+ - VCTK
126
+ - VoxPopuli (EN)
127
+ - Europarl-ASR (EN)
128
+ - Multilingual Librispeech (MLS EN) - 2,000 hour subset
129
+ - Mozilla Common Voice (v7.0)
130
+ - People's Speech - 12,000 hour subset
131
 
 
132
 
133
+ ## Performance
134
 
135
+ **Add information here about the performance of the model. Discuss what is the metric that is being used to evaluate the model and if there are external links explaning the custom metric, please link to it.
136
 
137
+ ### NOTE
138
 
139
+ An example is provided below for ASR metrics list that can be added to the top of the README
140
+
141
+ model-index:
142
+ - name: PUT_MODEL_NAME
143
+ results:
144
+ - task:
145
+ name: Automatic Speech Recognition
146
+ type: automatic-speech-recognition
147
+ dataset:
148
+ name: AMI (Meetings test)
149
+ type: edinburghcstr/ami
150
+ config: ihm
151
+ split: test
152
+ args:
153
+ language: en
154
+ metrics:
155
+ - name: Test WER
156
+ type: wer
157
+ value: 17.10
158
+ - task:
159
+ name: Automatic Speech Recognition
160
+ type: automatic-speech-recognition
161
+ dataset:
162
+ name: Earnings-22
163
+ type: revdotcom/earnings22
164
+ split: test
165
+ args:
166
+ language: en
167
+ metrics:
168
+ - name: Test WER
169
+ type: wer
170
+ value: 14.11
171
 
172
+ Provide any caveats about the results presented in the top of the discussion so that nuance is not lost.
173
 
174
+ It should ideally be in a tabular format (you can use the following website to make your tables in markdown format - https://www.tablesgenerator.com/markdown_tables)**
175
 
176
+ ## Limitations
177
 
178
+ **Discuss any practical limitations to the model when being used in real world cases. They can also be legal disclaimers, or discussion regarding the safety of the model (particularly in the case of LLMs).**
179
 
 
180
 
181
+ ### Note
182
 
183
+ An example is provided below
184
 
185
+ Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
186
 
 
187
 
188
+ ## License
 
 
 
189
 
190
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/). By downloading the public and release version of the model, you accept the terms and conditions of the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/) license.
 
 
 
 
191
 
192
+ ## References
193
 
194
+ **Provide appropriate references in the markdown link format below. Please order them numerically.**
195
 
196
+ [1] [NVIDIA NeMo Toolkit](https://github.com/NVIDIA/NeMo)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
rimecaster.nemo ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cee3b74a85d6c1e004a6d982c98b09d59035c635eee9cab1651d8f399587b6c5
3
+ size 121405440