File size: 4,077 Bytes
233c5fc
 
 
 
 
 
 
 
daaf844
b8fa75f
f83a630
 
 
34d95ae
29a6fb0
 
bc31897
 
8e69f2b
 
f83a630
 
 
2a29e54
 
f83a630
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7443ce9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32eb5dc
7443ce9
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
---
license: apache-2.0
tags:
- speech
- audio
- voice
- speaker-diarization
- speaker-change-detection
- coreml
- speaker-segmentation
base_model:
- pyannote/speaker-diarization-3.1
- pyannote/wespeaker-voxceleb-resnet34-LM
pipeline_tag: voice-activity-detection
---


# **<span style="color:#5DAF8D">🧃 Speaker Diarization CoreML </span>**
[![Discord](https://img.shields.io/badge/Discord-Join%20Chat-7289da.svg)](https://discord.gg/WNsvaCtmDe)
[![GitHub Repo stars](https://img.shields.io/github/stars/FluidInference/FluidAudio?style=flat&logo=github)](https://github.com/FluidInference/FluidAudio)

State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.

Support any language, models are trained on acoustic signatures

## Model Description

This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.

## Usage

See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)

### With FluidAudio SDK (Recommended)

Installation
Add FluidAudio to your project using Swift Package Manager:

```
dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
],
```

```swift
import FluidAudio

Task {
    let diarizer = DiarizerManager()
    try await diarizer.initialize()
    
    let audioSamples: [Float] = // your 16kHz audio
    let result = try await diarizer.performCompleteDiarization(
        audioSamples, 
        sampleRate: 16000
    )
    
    for segment in result.segments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}


### Direct CoreML Usage
``swift
import CoreML

// Load the model
let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())

// Prepare input (16kHz audio)
let input = SpeakerDiarizationModelInput(audioSamples: audioArray)

// Run inference
let output = try! model.prediction(input: input)
```


## Acknowledgments
These CoreML models are based on excellent work from:

sherpa-onnx - Foundational diarization algorithms
pyannote-audio - State-of-the-art diarization research
wespeaker - Speaker embedding techniques


### Key Features
- **Apple Neural Engine Optimized**: Zero performance trade-offs with maximum efficiency
- **Real-time Processing**: RTF of 0.02x (50x faster than real-time)
- **Research-Competitive**: DER of 17.7% on AMI benchmark
- **Power Efficient**: Designed for maximum performance per watt
- **Privacy-First**: All processing happens on-device


## Intended Uses & Limitations

### Intended Uses
- **Meeting Transcription**: Real-time speaker identification in meetings
- **Voice Assistants**: Multi-speaker conversation understanding
- **Media Production**: Automated speaker labeling for podcasts/interviews
- **Research**: Academic research in speaker diarization
- **Privacy-Focused Applications**: On-device processing without cloud dependencies

### Limitations
- Optimized for 16kHz audio input
- Best performance with clear audio (no heavy background noise)
- May struggle with heavily overlapping speech
- Requires Apple devices with CoreML support

### Technical Specifications
- **Input**: 16kHz mono audio
- **Output**: Speaker segments with timestamps and IDs
- **Framework**: CoreML (converted from PyTorch)
- **Optimization**: Apple Neural Engine (ANE) optimized operations
- **Precision**: FP32 on CPU/GPU, FP16 on ANE

## Training Data

These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:
- Multi-speaker conversation datasets
- Various acoustic conditions
- Multiple languages and accents

*Note: Specific training data details depend on the original open-source model variant.*