-
pyannote/speaker-diarization-community-1
Automatic Speech Recognition β’ Updated β’ 1.63M β’ 186 -
pyannote/speaker-diarization-community-1-cloud
Voice Activity Detection β’ Updated β’ 61 -
pyannote/speaker-diarization-precision-2
Voice Activity Detection β’ Updated β’ 1.55k β’ 11 -
pyannote/wespeaker-voxceleb-resnet34-LM
Updated β’ 15.4M β’ 103
AI & ML interests
Speaker Intelligence Platform for developers
Recent Activity
π Simply detect, segment, label, and separate speakers in any language
pyannoteAI facilitates the understanding of speakers and conversation context. We focus on identifying speakers and conversation metadata under conditions that reflect real conversations rather than controlled recordings.
π€ What is speaker diarization?
Speaker diarization is the process of automatically partitioning the audio recording of a conversation into segments and labeling them by speaker, answering the question "who spoke when?". As the foundational layer of conversational AI, speaker diarization provides high-level insights for human-human and human-machine conversations, and unlocks a wide range of downstream applications: meeting transcription, call center analytics, voice agents, video dubbing.
βΆοΈ Getting started
Install pyannote.audio latest release available from with either
uv (recommended) or pip:
$ uv add pyannote.audio
$ pip install pyannote.audio
Enjoy state-of-the-art speaker diarization:
# download pretrained pipeline from Huggingface
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization-community-1', token="HUGGINGFACE_TOKEN")
# perform speaker diarization locally
output = pipeline('/path/to/audio.wav')
# enjoy state-of-the-art speaker diarization
for turn, speaker in output.speaker_diarization:
print(f"{speaker} speaks between t={turn.start}s and t={turn.end}s")
Read community-1 model card to make the most of it.
π State-of-the-art models
pyannoteAI research team trains cutting-edge speaker diarization models, thanks to Jean Zay π«π· supercomputer managed by GENCI π. They come in two flavors:
pyannote.audioopen models available on Huggingface and used by 140k+ developers over the world ;- premium models available on
pyannoteAIcloud (and on-premise for enterprise customers) that provide state-of-the-art speaker diarization as well as additional enterprise features.
| Benchmark (last updated in 2025-09) | legacy (3.1) |
community-1 |
precision-2 |
|---|---|---|---|
| AISHELL-4 | 12.2 | 11.7 | 11.4 π |
| AliMeeting (channel 1) | 24.5 | 20.3 | 15.2 π |
| AMI (IHM) | 18.8 | 17.0 | 12.9 π |
| AMI (SDM) | 22.7 | 19.9 | 15.6 π |
| AVA-AVD | 49.7 | 44.6 | 37.1 π |
| CALLHOME (part 2) | 28.5 | 26.7 | 16.6 π |
| DIHARD 3 (full) | 21.4 | 20.2 | 14.7 π |
| Ego4D (dev.) | 51.2 | 46.8 | 39.0 π |
| MSDWild | 25.4 | 22.8 | 17.3 π |
| RAMC | 22.2 | 20.8 | 10.5 π |
| REPERE (phase2) | 7.9 | 8.9 | 7.4 π |
| VoxConverse (v0.3) | 11.2 | 11.2 | 8.5 π |
Diarization error rate (in %, the lower, the better)
Our models achieve competitive performance across multiple public diarization datasets, explore pyannoteAI performance benchmark β‘οΈ https://www.pyannote.ai/benchmark
β©οΈ Going further, better, and faster
precision-2 premium model further improves accuracy, processing speed, as well as brings additional features.
| Features | community-1 |
precision-2 |
|---|---|---|
| Set exact/min/max number of speakers | β | β |
| Exclusive speaker diarization (for transcription) | β | β |
| Segmentation confidence scores | β | β |
| Speaker confidence scores | β | β |
| Voiceprinting | β | β |
| Speaker identification | β | β |
| STT Orchestration | β | β |
| Time to process 1h of audio (on H100) | 37s | 14s |
Create a pyannoteAI account, change one line of code, and enjoy free cloud credits to try precision-2 premium diarization:
# perform premium speaker diarization on pyannoteAI cloud
pipeline = Pipeline.from_pretrained('pyannote/speaker-diarization-precision-2', token="PYANNOTEAI_API_KEY")
better_output = pipeline('/path/to/audio.wav')
π Get speaker-attributed transcripts
We host open-source transcription models like Nvidia Parakeet-tdt-0.6b-v3 and OpenAI whisper-large-v3-turbo with specialized STT + diarization reconciliation logic for speaker-attributed transcripts.
STT orchestration orchestrates pyannoteAI diarization Precision-2 with transcription services. Instead of running diarization and transcription separately, then reconciling outputs manually, you make one API call and receive speaker-attributed transcripts.
To use this feature, make a request to the diarize API endpoint with the transcription:true flag.
# pip install pyannoteai-sdk
from pyannoteai.sdk import Client
client = Client("your-api-key")
job_id = client.diarize(
"[https://www.example/audio.wav](https://www.example/audio.wav)",
transcription=True)
job_output = client.retrieve(job_id)
for word in job_output['output']['wordLevelTranscription']:
print(word['start'], word['end'], word['speaker'], word['text'])
for turn in job_output['output']['turnLevelTranscription']:
print(turn['start'], turn['end'], turn['speaker'], turn['text'])
-
pyannote/speaker-diarization-3.1
Automatic Speech Recognition β’ Updated β’ 13.3M β’ 1.56k -
pyannote/wespeaker-voxceleb-resnet34-LM
Updated β’ 15.4M β’ 103 -
pyannote/segmentation-3.0
Voice Activity Detection β’ Updated β’ 14.2M β’ 809 -
pyannote/speech-separation-ami-1.0
Updated β’ 4.65k β’ 71
-
pyannote/speaker-diarization-community-1
Automatic Speech Recognition β’ Updated β’ 1.63M β’ 186 -
pyannote/speaker-diarization-community-1-cloud
Voice Activity Detection β’ Updated β’ 61 -
pyannote/speaker-diarization-precision-2
Voice Activity Detection β’ Updated β’ 1.55k β’ 11 -
pyannote/wespeaker-voxceleb-resnet34-LM
Updated β’ 15.4M β’ 103
-
pyannote/speaker-diarization-3.1
Automatic Speech Recognition β’ Updated β’ 13.3M β’ 1.56k -
pyannote/wespeaker-voxceleb-resnet34-LM
Updated β’ 15.4M β’ 103 -
pyannote/segmentation-3.0
Voice Activity Detection β’ Updated β’ 14.2M β’ 809 -
pyannote/speech-separation-ami-1.0
Updated β’ 4.65k β’ 71


