nvidia
/

diar_streaming_sortformer_4spk-v2

@@ -45,7 +45,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 13.32
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -58,7 +58,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 42.61
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -71,7 +71,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 18.97
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -84,7 +84,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 6.43
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -97,7 +97,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 10.26
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -110,7 +110,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 12.40
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -123,7 +123,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 24.41
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -136,7 +136,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 27.78
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -149,7 +149,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 10.79
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
@@ -162,7 +162,7 @@ model-index:
     metrics:
     - name: Test DER
       type: der
-      value: 5.09
 metrics:
 - der
 pipeline_tag: audio-classification
@@ -187,7 +187,7 @@ This model is a streaming version of Sortformer diarizer. [Sortformer](https://a
     <img src="figures/sortformer_intro.png" width="750" />
 </div>
-[Streaming Sortformer](https://arxiv.org/abs/25XX.XXXXX)[2] approach employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
     <img src="figures/streaming_sortformer_ani.gif" width="1400" />
 </div>
@@ -205,7 +205,7 @@ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate
 Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
 Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
-and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/25XX.XXXXX)[2].
 <div align="center">
     <img src="figures/sortformer-v1-model.png" width="450" />
@@ -283,6 +283,7 @@ Streaming configuration is defined by the following parameters, all measured in
 Here are recommended configurations for different scenarios:
 | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
 | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
 | high latency      | 10.0s       | 0.005   | 124            | 1                 | 124           | 124               | 188                    |
 | low latency       | 1.04s       | 0.093   | 6              | 7                 | 188           | 144               | 188                    |
 | ultra low latency | 0.32s       | 0.180   | 3              | 1                 | 188           | 144               | 188                    |
@@ -390,17 +391,19 @@ Data collection methods vary across individual datasets. For example, the above
 * All evaluations include overlapping speech.
 * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
 * Post-Processing (PP) is optimized on two different held-out dataset splits.
-    - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_dihard3-dev.yaml) for DIHARD III Eval
-    - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/sortformer_diar_4spk-v1_callhome-part1.yaml) for CALLHOME-part2 and CH109
 | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
 |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
-| 10.0s       | no   | 14.79                      | 41.06                      | 19.88                    | 6.80                    | 11.27                   | 12.21                   | 21.12                   | 27.84                   | 11.10                   | 5.27      |
-| 10.0s       | yes  | 13.67                      | 41.45                      | 19.02                    | 6.06                    | 10.01                   | 11.22                   | 20.34                   | 26.97                   | 10.09                   | 4.82      |
-| 1.04s       | no   | 14.57                      | 42.12                      | 19.89                    | 7.35                    | 11.57                   | 13.83                   | 25.81                   | 29.06                   | 12.00                   | 5.59      |
-| 1.04s       | yes  | 13.32                      | 42.61                      | 18.97                    | 6.43                    | 10.26                   | 12.40                   | 24.41                   | 27.78                   | 10.79                   | 5.09      |
-| 0.32s       | no   | 14.63                      | 43.76                      | 20.25                    | 8.60                    | 13.23                   | 16.08                   | 28.10                   | 30.63                   | 13.66                   | 6.60      |
-| 0.32s       | yes  | 13.43                      | 43.98                      | 19.32                    | 6.86                    | 10.84                   | 13.64                   | 25.78                   | 28.58                   | 11.50                   | 5.41      |
 ## NVIDIA Riva: Deployment
@@ -419,7 +422,7 @@ Check out [Riva live demo](https://developer.nvidia.com/riva#demos).
 ## References
 [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
-[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/25XX.XXXXX)
 [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)

     metrics:
     - name: Test DER
       type: der
+      value: 13.24
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 42.56
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 18.91
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 6.57
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 10.05
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 12.44
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 21.68
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 28.74
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 10.70
   - task:
       name: Speaker Diarization
       type: speaker-diarization-with-post-processing
     metrics:
     - name: Test DER
       type: der
+      value: 4.88
 metrics:
 - der
 pipeline_tag: audio-classification
     <img src="figures/sortformer_intro.png" width="750" />
 </div>
+[Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
 <div align="center">
     <img src="figures/streaming_sortformer_ani.gif" width="1400" />
 </div>
 Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
 Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
+and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
 <div align="center">
     <img src="figures/sortformer-v1-model.png" width="450" />
 Here are recommended configurations for different scenarios:
 | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
 | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
+| very high latency | 30.4s       | 0.002   | 340            | 40                | 40            | 300               | 188                    |
 | high latency      | 10.0s       | 0.005   | 124            | 1                 | 124           | 124               | 188                    |
 | low latency       | 1.04s       | 0.093   | 6              | 7                 | 188           | 144               | 188                    |
 | ultra low latency | 0.32s       | 0.180   | 3              | 1                 | 188           | 144               | 188                    |
 * All evaluations include overlapping speech.
 * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
 * Post-Processing (PP) is optimized on two different held-out dataset splits.
+    - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
+    - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
 | **Latency** | *PP* | **DIHARD III Eval <=4spk** | **DIHARD III Eval >=5spk** | **DIHARD III Eval full** | **CALLHOME-part2 2spk** | **CALLHOME-part2 3spk** | **CALLHOME-part2 4spk** | **CALLHOME-part2 5spk** | **CALLHOME-part2 6spk** | **CALLHOME-part2 full** | **CH109** |
 |-------------|------|----------------------------|----------------------------|--------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-------------------------|-----------|
+| 30.4s       | no   | 14.63                      | 40.74                      | 19.68                    | 6.27                    | 10.27                   | 12.30                   | 19.08                   | 28.09                   | 10.50                   | 5.03      |
+| 30.4s       | yes  | 13.45                      | 41.40                      | 18.85                    | 5.34                    | 9.22                    | 11.29                   | 18.84                   | 27.29                   | 9.54                    | 4.61      |
+| 10.0s       | no   | 14.90                      | 41.06                      | 19.96                    | 6.96                    | 11.05                   | 12.93                   | 20.47                   | 28.10                   | 11.21                   | 5.28      |
+| 10.0s       | yes  | 13.75                      | 41.41                      | 19.10                    | 6.05                    | 9.88                    | 11.72                   | 19.66                   | 27.37                   | 10.15                   | 4.80      |
+| 1.04s       | no   | 14.49                      | 42.22                      | 19.85                    | 7.51                    | 11.45                   | 13.75                   | 23.22                   | 29.22                   | 11.89                   | 5.37      |
+| 1.04s       | yes  | 13.24                      | 42.56                      | 18.91                    | 6.57                    | 10.05                   | 12.44                   | 21.68                   | 28.74                   | 10.70                   | 4.88      |
+| 0.32s       | no   | 14.64                      | 43.47                      | 20.19                    | 8.63                    | 12.91                   | 16.19                   | 29.40                   | 30.60                   | 13.57                   | 6.46      |
+| 0.32s       | yes  | 13.44                      | 43.73                      | 19.28                    | 6.91                    | 10.45                   | 13.70                   | 27.04                   | 28.58                   | 11.38                   | 5.27      |
 ## NVIDIA Riva: Deployment
 ## References
 [1] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
+[2] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
 [3] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)