Using this open-source model in production?
Consider switching to pyannoteAI for better and faster options.
🎹 Speaker diarization 2.5
Modified from pyannote/speaker-diarization-3.0
.
This pipeline uses pyannote/segmentation-3.0
for Speaker Segments but with Speaker Embedding: speechbrain/spkrec-ecapa-voxceleb
from pyannote/[email protected]
.
For some testings, Embeddings from speechbrain/spkrec-ecapa-voxceleb
seems better to detect automatic number of speakers.
Requirements
- Install
pyannote.audio
3.0
withpip install pyannote.audio
- Accept
pyannote/segmentation-3.0
user conditions - Create access token at
hf.co/settings/tokens
.
Usage
# instantiate the pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"Willy030125/speaker-diarization-2.5",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")
# run the pipeline on an audio file
diarization = pipeline("audio.wav")
# dump the diarization output to disk using RTTM format
with open("audio.rttm", "w") as rttm:
diarization.write_rttm(rttm)
Processing on GPU
pyannote.audio
pipelines run on CPU by default.
You can send them to GPU with the following lines:
import torch
pipeline.to(torch.device("cuda"))
Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part).
In other words, it takes approximately 1.5 minutes to process a one hour conversation.
Processing from memory
Pre-loading audio files in memory may result in faster processing:
waveform, sample_rate = torchaudio.load("audio.wav")
diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate})
Monitoring progress
Hooks are available to monitor the progress of the pipeline:
from pyannote.audio.pipelines.utils.hook import ProgressHook
with ProgressHook() as hook:
diarization = pipeline("audio.wav", hook=hook)
Controlling the number of speakers
In case the number of speakers is known in advance, one can use the num_speakers
option:
diarization = pipeline("audio.wav", num_speakers=2)
One can also provide lower and/or upper bounds on the number of speakers using min_speakers
and max_speakers
options:
diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5)
Benchmark
This pipeline has been benchmarked on a large collection of datasets.
Processing is fully automatic:
- no manual voice activity detection (as is sometimes the case in the literature)
- no manual number of speakers (though it is possible to provide it to the pipeline)
- no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset
... with the least forgiving diarization error rate (DER) setup (named "Full" in this paper):
- no forgiveness collar
- evaluation of overlapped speech
Benchmark | DER% | FA% | Miss% | Conf% | Expected output | File-level evaluation |
---|---|---|---|---|---|---|
AISHELL-4 | 12.3 | 3.8 | 4.4 | 4.1 | RTTM | eval |
AliMeeting (channel 1) | 24.3 | 4.4 | 10.0 | 9.9 | RTTM | eval |
AMI (headset mix, only_words) | 19.0 | 3.6 | 9.5 | 5.9 | RTTM | eval |
AMI (array1, channel 1, only_words) | 22.2 | 3.8 | 11.2 | 7.3 | RTTM | eval |
AVA-AVD | 49.1 | 10.8 | 15.7 | 22.5 | RTTM | eval |
DIHARD 3 (Full) | 21.7 | 6.2 | 8.1 | 7.3 | RTTM | eval |
MSDWild | 24.6 | 5.8 | 8.0 | 10.7 | RTTM | eval |
REPERE (phase 2) | 7.8 | 1.8 | 2.6 | 3.5 | RTTM | eval |
VoxConverse (v0.3) | 11.3 | 4.1 | 3.4 | 3.8 | RTTM | eval |
Citations
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
- Downloads last month
- 22