Update README.md
Browse files
README.md
CHANGED
|
@@ -45,7 +45,7 @@ model-index:
|
|
| 45 |
metrics:
|
| 46 |
- name: Test DER
|
| 47 |
type: der
|
| 48 |
-
value: 13.
|
| 49 |
- task:
|
| 50 |
name: Speaker Diarization
|
| 51 |
type: speaker-diarization-with-post-processing
|
|
@@ -58,7 +58,7 @@ model-index:
|
|
| 58 |
metrics:
|
| 59 |
- name: Test DER
|
| 60 |
type: der
|
| 61 |
-
value: 42.
|
| 62 |
- task:
|
| 63 |
name: Speaker Diarization
|
| 64 |
type: speaker-diarization-with-post-processing
|
|
@@ -71,7 +71,7 @@ model-index:
|
|
| 71 |
metrics:
|
| 72 |
- name: Test DER
|
| 73 |
type: der
|
| 74 |
-
value: 18.
|
| 75 |
- task:
|
| 76 |
name: Speaker Diarization
|
| 77 |
type: speaker-diarization-with-post-processing
|
|
@@ -84,7 +84,7 @@ model-index:
|
|
| 84 |
metrics:
|
| 85 |
- name: Test DER
|
| 86 |
type: der
|
| 87 |
-
value: 6.
|
| 88 |
- task:
|
| 89 |
name: Speaker Diarization
|
| 90 |
type: speaker-diarization-with-post-processing
|
|
@@ -97,7 +97,7 @@ model-index:
|
|
| 97 |
metrics:
|
| 98 |
- name: Test DER
|
| 99 |
type: der
|
| 100 |
-
value: 10.
|
| 101 |
- task:
|
| 102 |
name: Speaker Diarization
|
| 103 |
type: speaker-diarization-with-post-processing
|
|
@@ -110,7 +110,7 @@ model-index:
|
|
| 110 |
metrics:
|
| 111 |
- name: Test DER
|
| 112 |
type: der
|
| 113 |
-
value: 12.
|
| 114 |
- task:
|
| 115 |
name: Speaker Diarization
|
| 116 |
type: speaker-diarization-with-post-processing
|
|
@@ -123,7 +123,7 @@ model-index:
|
|
| 123 |
metrics:
|
| 124 |
- name: Test DER
|
| 125 |
type: der
|
| 126 |
-
value:
|
| 127 |
- task:
|
| 128 |
name: Speaker Diarization
|
| 129 |
type: speaker-diarization-with-post-processing
|
|
@@ -136,7 +136,7 @@ model-index:
|
|
| 136 |
metrics:
|
| 137 |
- name: Test DER
|
| 138 |
type: der
|
| 139 |
-
value:
|
| 140 |
- task:
|
| 141 |
name: Speaker Diarization
|
| 142 |
type: speaker-diarization-with-post-processing
|
|
@@ -149,7 +149,7 @@ model-index:
|
|
| 149 |
metrics:
|
| 150 |
- name: Test DER
|
| 151 |
type: der
|
| 152 |
-
value: 10.
|
| 153 |
- task:
|
| 154 |
name: Speaker Diarization
|
| 155 |
type: speaker-diarization-with-post-processing
|
|
@@ -162,7 +162,7 @@ model-index:
|
|
| 162 |
metrics:
|
| 163 |
- name: Test DER
|
| 164 |
type: der
|
| 165 |
-
value:
|
| 166 |
metrics:
|
| 167 |
- der
|
| 168 |
pipeline_tag: audio-classification
|
|
@@ -187,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
|
|
| 187 |
<img src="figures/sortformer_intro.png" width="750" />
|
| 188 |
</div>
|
| 189 |
|
| 190 |
-
[Streaming Sortformer](https://arxiv.org/abs/
|
| 191 |
<div align="center">
|
| 192 |
<img src="figures/streaming_sortformer_ani.gif" width="1400" />
|
| 193 |
</div>
|
|
@@ -205,7 +205,7 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
|
|
| 205 |
|
| 206 |
Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
|
| 207 |
Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
|
| 208 |
-
and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/
|
| 209 |
|
| 210 |
<div align="center">
|
| 211 |
<img src="figures/sortformer-v1-model.png" width="450" />
|
|
@@ -283,6 +283,7 @@ Streaming configuration is defined by the following parameters, all measured in
|
|
| 283 |
Here are recommended configurations for different scenarios:
|
| 284 |
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
| 285 |
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
|
|
|
| 286 |
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
| 287 |
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
| 288 |
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
|
@@ -390,17 +391,19 @@ Data collection methods vary across individual datasets. For example, the above
|
|
| 390 |
* All evaluations include overlapping speech.
|
| 391 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
| 392 |
* Post-Processing (PP) is optimized on two different held-out dataset splits.
|
| 393 |
-
- [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/
|
| 394 |
-
- [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/
|
| 395 |
|
| 396 |
| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 397 |
|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
| 398 |
-
|
|
| 399 |
-
|
|
| 400 |
-
|
|
| 401 |
-
|
|
| 402 |
-
|
|
| 403 |
-
|
|
|
|
|
|
|
|
| 404 |
|
| 405 |
|
| 406 |
## NVIDIA Riva: Deployment
|
|
@@ -419,7 +422,7 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
|
|
| 419 |
## References
|
| 420 |
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 421 |
|
| 422 |
-
[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/
|
| 423 |
|
| 424 |
[3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
|
| 425 |
|
|
|
|
| 45 |
metrics:
|
| 46 |
- name: Test DER
|
| 47 |
type: der
|
| 48 |
+
value: 13.24
|
| 49 |
- task:
|
| 50 |
name: Speaker Diarization
|
| 51 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 58 |
metrics:
|
| 59 |
- name: Test DER
|
| 60 |
type: der
|
| 61 |
+
value: 42.56
|
| 62 |
- task:
|
| 63 |
name: Speaker Diarization
|
| 64 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 71 |
metrics:
|
| 72 |
- name: Test DER
|
| 73 |
type: der
|
| 74 |
+
value: 18.91
|
| 75 |
- task:
|
| 76 |
name: Speaker Diarization
|
| 77 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 84 |
metrics:
|
| 85 |
- name: Test DER
|
| 86 |
type: der
|
| 87 |
+
value: 6.57
|
| 88 |
- task:
|
| 89 |
name: Speaker Diarization
|
| 90 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 97 |
metrics:
|
| 98 |
- name: Test DER
|
| 99 |
type: der
|
| 100 |
+
value: 10.05
|
| 101 |
- task:
|
| 102 |
name: Speaker Diarization
|
| 103 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 110 |
metrics:
|
| 111 |
- name: Test DER
|
| 112 |
type: der
|
| 113 |
+
value: 12.44
|
| 114 |
- task:
|
| 115 |
name: Speaker Diarization
|
| 116 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 123 |
metrics:
|
| 124 |
- name: Test DER
|
| 125 |
type: der
|
| 126 |
+
value: 21.68
|
| 127 |
- task:
|
| 128 |
name: Speaker Diarization
|
| 129 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 136 |
metrics:
|
| 137 |
- name: Test DER
|
| 138 |
type: der
|
| 139 |
+
value: 28.74
|
| 140 |
- task:
|
| 141 |
name: Speaker Diarization
|
| 142 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 149 |
metrics:
|
| 150 |
- name: Test DER
|
| 151 |
type: der
|
| 152 |
+
value: 10.70
|
| 153 |
- task:
|
| 154 |
name: Speaker Diarization
|
| 155 |
type: speaker-diarization-with-post-processing
|
|
|
|
| 162 |
metrics:
|
| 163 |
- name: Test DER
|
| 164 |
type: der
|
| 165 |
+
value: 4.88
|
| 166 |
metrics:
|
| 167 |
- der
|
| 168 |
pipeline_tag: audio-classification
|
|
|
|
| 187 |
<img src="figures/sortformer_intro.png" width="750" />
|
| 188 |
</div>
|
| 189 |
|
| 190 |
+
[Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
|
| 191 |
<div align="center">
|
| 192 |
<img src="figures/streaming_sortformer_ani.gif" width="1400" />
|
| 193 |
</div>
|
|
|
|
| 205 |
|
| 206 |
Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
|
| 207 |
Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
|
| 208 |
+
and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
|
| 209 |
|
| 210 |
<div align="center">
|
| 211 |
<img src="figures/sortformer-v1-model.png" width="450" />
|
|
|
|
| 283 |
Here are recommended configurations for different scenarios:
|
| 284 |
| **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
|
| 285 |
| :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
|
| 286 |
+
| very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
|
| 287 |
| high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
|
| 288 |
| low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
|
| 289 |
| ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
|
|
|
|
| 391 |
* All evaluations include overlapping speech.
|
| 392 |
* Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
|
| 393 |
* Post-Processing (PP) is optimized on two different held-out dataset splits.
|
| 394 |
+
- [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
|
| 395 |
+
- [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
|
| 396 |
|
| 397 |
| **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
|
| 398 |
|-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
|
| 399 |
+
| 30.4s | no | 14.63 | 40.74 | 19.68 | 6.27 | 10.27 | 12.30 | 19.08 | 28.09 | 10.50 | 5.03 |
|
| 400 |
+
| 30.4s | yes | 13.45 | 41.40 | 18.85 | 5.34 | 9.22 | 11.29 | 18.84 | 27.29 | 9.54 | 4.61 |
|
| 401 |
+
| 10.0s | no | 14.90 | 41.06 | 19.96 | 6.96 | 11.05 | 12.93 | 20.47 | 28.10 | 11.21 | 5.28 |
|
| 402 |
+
| 10.0s | yes | 13.75 | 41.41 | 19.10 | 6.05 | 9.88 | 11.72 | 19.66 | 27.37 | 10.15 | 4.80 |
|
| 403 |
+
| 1.04s | no | 14.49 | 42.22 | 19.85 | 7.51 | 11.45 | 13.75 | 23.22 | 29.22 | 11.89 | 5.37 |
|
| 404 |
+
| 1.04s | yes | 13.24 | 42.56 | 18.91 | 6.57 | 10.05 | 12.44 | 21.68 | 28.74 | 10.70 | 4.88 |
|
| 405 |
+
| 0.32s | no | 14.64 | 43.47 | 20.19 | 8.63 | 12.91 | 16.19 | 29.40 | 30.60 | 13.57 | 6.46 |
|
| 406 |
+
| 0.32s | yes | 13.44 | 43.73 | 19.28 | 6.91 | 10.45 | 13.70 | 27.04 | 28.58 | 11.38 | 5.27 |
|
| 407 |
|
| 408 |
|
| 409 |
## NVIDIA Riva: Deployment
|
|
|
|
| 422 |
## References
|
| 423 |
[1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
|
| 424 |
|
| 425 |
+
[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
|
| 426 |
|
| 427 |
[3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
|
| 428 |
|