|  | --- | 
					
						
						|  | tags: | 
					
						
						|  | - pyannote | 
					
						
						|  | - pyannote-audio | 
					
						
						|  | - pyannote-audio-pipeline | 
					
						
						|  | - audio | 
					
						
						|  | - voice | 
					
						
						|  | - speech | 
					
						
						|  | - speaker | 
					
						
						|  | - speaker-diarization | 
					
						
						|  | - speaker-change-detection | 
					
						
						|  | - voice-activity-detection | 
					
						
						|  | - overlapped-speech-detection | 
					
						
						|  | - automatic-speech-recognition | 
					
						
						|  | license: mit | 
					
						
						|  | extra_gated_prompt: "The collected information will help acquire a better knowledge of pyannote.audio userbase and help its maintainers improve it further. Though this pipeline uses MIT license and will always remain open-source, we will occasionnally email you about premium pipelines and paid services around pyannote." | 
					
						
						|  | extra_gated_fields: | 
					
						
						|  | Company/university: text | 
					
						
						|  | Website: text | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  | Using this open-source model in production? | 
					
						
						|  | Consider switching to [pyannoteAI](https://www.pyannote.ai) for better and faster options. | 
					
						
						|  |  | 
					
						
						|  | # 🎹 Speaker diarization 3.0 | 
					
						
						|  |  | 
					
						
						|  | This pipeline has been trained by Séverin Baroudi with [pyannote.audio](https://github.com/pyannote/pyannote-audio) `3.0.0` using a combination of the training sets of AISHELL, AliMeeting, AMI, AVA-AVD, DIHARD, Ego4D, MSDWild, REPERE, and VoxConverse. | 
					
						
						|  |  | 
					
						
						|  | It ingests mono audio sampled at 16kHz and outputs speaker diarization as an [`Annotation`](http://pyannote.github.io/pyannote-core/structure.html#annotation) instance: | 
					
						
						|  |  | 
					
						
						|  | * stereo or multi-channel audio files are automatically downmixed to mono by averaging the channels. | 
					
						
						|  | * audio files sampled at a different rate are resampled to 16kHz automatically upon loading. | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Requirements | 
					
						
						|  |  | 
					
						
						|  | 1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio` | 
					
						
						|  | 2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions | 
					
						
						|  | 3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions | 
					
						
						|  | 4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens). | 
					
						
						|  |  | 
					
						
						|  | ## Usage | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | # instantiate the pipeline | 
					
						
						|  | from pyannote.audio import Pipeline | 
					
						
						|  | pipeline = Pipeline.from_pretrained( | 
					
						
						|  | "pyannote/speaker-diarization-3.0", | 
					
						
						|  | use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE") | 
					
						
						|  |  | 
					
						
						|  | # run the pipeline on an audio file | 
					
						
						|  | diarization = pipeline("audio.wav") | 
					
						
						|  |  | 
					
						
						|  | # dump the diarization output to disk using RTTM format | 
					
						
						|  | with open("audio.rttm", "w") as rttm: | 
					
						
						|  | diarization.write_rttm(rttm) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Processing on GPU | 
					
						
						|  |  | 
					
						
						|  | `pyannote.audio` pipelines run on CPU by default. | 
					
						
						|  | You can send them to GPU with the following lines: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | import torch | 
					
						
						|  | pipeline.to(torch.device("cuda")) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | Real-time factor is around 2.5% using one Nvidia Tesla V100 SXM2 GPU (for the neural inference part) and one Intel Cascade Lake 6248 CPU (for the clustering part). | 
					
						
						|  |  | 
					
						
						|  | In other words, it takes approximately 1.5 minutes to process a one hour conversation. | 
					
						
						|  |  | 
					
						
						|  | ### Processing from memory | 
					
						
						|  |  | 
					
						
						|  | Pre-loading audio files in memory may result in faster processing: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | waveform, sample_rate = torchaudio.load("audio.wav") | 
					
						
						|  | diarization = pipeline({"waveform": waveform, "sample_rate": sample_rate}) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Monitoring progress | 
					
						
						|  |  | 
					
						
						|  | Hooks are available to monitor the progress of the pipeline: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | from pyannote.audio.pipelines.utils.hook import ProgressHook | 
					
						
						|  | with ProgressHook() as hook: | 
					
						
						|  | diarization = pipeline("audio.wav", hook=hook) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Controlling the number of speakers | 
					
						
						|  |  | 
					
						
						|  | In case the number of speakers is known in advance, one can use the `num_speakers` option: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | diarization = pipeline("audio.wav", num_speakers=2) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | One can also provide lower and/or upper bounds on the number of speakers using `min_speakers` and `max_speakers` options: | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | diarization = pipeline("audio.wav", min_speakers=2, max_speakers=5) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Benchmark | 
					
						
						|  |  | 
					
						
						|  | This pipeline has been benchmarked on a large collection of datasets. | 
					
						
						|  |  | 
					
						
						|  | Processing is fully automatic: | 
					
						
						|  |  | 
					
						
						|  | * no manual voice activity detection (as is sometimes the case in the literature) | 
					
						
						|  | * no manual number of speakers (though it is possible to provide it to the pipeline) | 
					
						
						|  | * no fine-tuning of the internal models nor tuning of the pipeline hyper-parameters to each dataset | 
					
						
						|  |  | 
					
						
						|  | ... with the least forgiving diarization error rate (DER) setup (named *"Full"* in [this paper](https://doi.org/10.1016/j.csl.2021.101254)): | 
					
						
						|  |  | 
					
						
						|  | * no forgiveness collar | 
					
						
						|  | * evaluation of overlapped speech | 
					
						
						|  |  | 
					
						
						|  | | Benchmark                                                                                                                                   | [DER%](. "Diarization error rate") | [FA%](. "False alarm rate") | [Miss%](. "Missed detection rate") | [Conf%](. "Speaker confusion rate") | Expected output                                                                                                               | File-level evaluation                                                                                                         | | 
					
						
						|  | | ------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | --------------------------- | ---------------------------------- | ----------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | | 
					
						
						|  | | [AISHELL-4](http://www.openslr.org/111/)                                                                                                    | 12.3                              | 3.8                        | 4.4                               | 4.1                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.rttm)          | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AISHELL.SpeakerDiarization.Benchmark.test.eval)          | | 
					
						
						|  | | [AliMeeting (*channel 1*)](https://www.openslr.org/119/)                                                                                    | 24.3                              | 4.4                        | 10.0                              | 9.9                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AliMeeting.SpeakerDiarization.Benchmark.test.eval)       | | 
					
						
						|  | | [AMI (*headset mix,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words*)](https://github.com/BUTSpeechFIT/AMI-diarization-setup)       | 19.0                              | 3.6                        | 9.5                               | 5.9                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.rttm)              | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI.SpeakerDiarization.Benchmark.test.eval)              | | 
					
						
						|  | | [AMI (*array1, channel 1,*](https://groups.inf.ed.ac.uk/ami/corpus/) [*only_words)*](https://github.com/BUTSpeechFIT/AMI-diarization-setup) | 22.2                              | 3.8                        | 11.2                              | 7.3                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.rttm)          | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AMI-SDM.SpeakerDiarization.Benchmark.test.eval)          | | 
					
						
						|  | | [AVA-AVD](https://arxiv.org/abs/2111.14448) |  49.1 | 10.8 | 15.7| 22.5 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/AVA-AVD.SpeakerDiarization.Benchmark.test.eval)       | | 
					
						
						|  | | [DIHARD 3 (*Full*)](https://arxiv.org/abs/2012.01477)                                                                                       | 21.7                              | 6.2                       | 8.1                               | 7.3                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.rttm)           | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/DIHARD.SpeakerDiarization.Benchmark.test.eval)           | | 
					
						
						|  | | [MSDWild](https://x-lance.github.io/MSDWILD/) | 24.6 | 5.8 | 8.0 | 10.7 | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/MSDWILD.SpeakerDiarization.Benchmark.test.eval)       | | 
					
						
						|  | | [REPERE (*phase 2*)](https://islrn.org/resources/360-758-359-485-0/)                                                                        | 7.8                               | 1.8                        | 2.6                               | 3.5                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.rttm)           | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/REPERE.SpeakerDiarization.Benchmark.test.eval)           | | 
					
						
						|  | | [VoxConverse (*v0.3*)](https://github.com/joonson/voxconverse)                                                                              | 11.3                              | 4.1                        | 3.4                               | 3.8                                | [RTTM](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.rttm)       | [eval](https://huggingface.co/pyannote/speaker-diarization-3.0.0/blob/main/reproducible_research/VoxConverse.SpeakerDiarization.Benchmark.test.eval)       | | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | ## Citations | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @inproceedings{Plaquet23, | 
					
						
						|  | author={Alexis Plaquet and Hervé Bredin}, | 
					
						
						|  | title={{Powerset multi-class cross entropy loss for neural speaker diarization}}, | 
					
						
						|  | year=2023, | 
					
						
						|  | booktitle={Proc. INTERSPEECH 2023}, | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ```bibtex | 
					
						
						|  | @inproceedings{Bredin23, | 
					
						
						|  | author={Hervé Bredin}, | 
					
						
						|  | title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}}, | 
					
						
						|  | year=2023, | 
					
						
						|  | booktitle={Proc. INTERSPEECH 2023}, | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  |