bweng commited on
Commit
f83a630
·
verified ·
1 Parent(s): 3fb7321

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -1
README.md CHANGED
@@ -7,6 +7,112 @@ tags:
7
  - speaker-diarization
8
  - speaker-change-detection
9
  - coreml
 
 
 
10
  ---
11
 
12
- This is still in development.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  - speaker-diarization
8
  - speaker-change-detection
9
  - coreml
10
+ base_model:
11
+ - pyannote/speaker-diarization-3.1
12
+ - pyannote/wespeaker-voxceleb-resnet34-LM
13
  ---
14
 
15
+ # Speaker Diarization CoreML Models
16
+
17
+ State-of-the-art speaker diarization models optimized for Apple Neural Engine, powering real-time on-device speaker separation with research-competitive performance.
18
+
19
+ ## Model Description
20
+
21
+ This repository contains CoreML-optimized speaker diarization models specifically converted and optimized for Apple devices (macOS 13.0+, iOS 16.0+). These models enable efficient on-device speaker diarization with minimal power consumption while maintaining state-of-the-art accuracy.
22
+
23
+ ### Key Features
24
+ - **Apple Neural Engine Optimized**: Zero performance trade-offs with maximum efficiency
25
+ - **Real-time Processing**: RTF of 0.02x (50x faster than real-time)
26
+ - **Research-Competitive**: DER of 17.7% on AMI benchmark
27
+ - **Power Efficient**: Designed for maximum performance per watt
28
+ - **Privacy-First**: All processing happens on-device
29
+
30
+
31
+ ## Intended Uses & Limitations
32
+
33
+ ### Intended Uses
34
+ - **Meeting Transcription**: Real-time speaker identification in meetings
35
+ - **Voice Assistants**: Multi-speaker conversation understanding
36
+ - **Media Production**: Automated speaker labeling for podcasts/interviews
37
+ - **Research**: Academic research in speaker diarization
38
+ - **Privacy-Focused Applications**: On-device processing without cloud dependencies
39
+
40
+ ### Limitations
41
+ - Optimized for 16kHz audio input
42
+ - Best performance with clear audio (no heavy background noise)
43
+ - May struggle with heavily overlapping speech
44
+ - Requires Apple devices with CoreML support
45
+
46
+ ### Technical Specifications
47
+ - **Input**: 16kHz mono audio
48
+ - **Output**: Speaker segments with timestamps and IDs
49
+ - **Framework**: CoreML (converted from PyTorch)
50
+ - **Optimization**: Apple Neural Engine (ANE) optimized operations
51
+ - **Precision**: FP32
52
+
53
+ ## Training Data
54
+
55
+ These models are converted from open-source variants trained on diverse speaker diarization datasets. The original models were trained on:
56
+ - Multi-speaker conversation datasets
57
+ - Various acoustic conditions
58
+ - Multiple languages and accents
59
+
60
+ *Note: Specific training data details depend on the original open-source model variant.*
61
+
62
+ ## Usage
63
+
64
+ See the SDK for more details [https://github.com/FluidInference/FluidAudio](https://github.com/FluidInference/FluidAudio)
65
+
66
+ ### With FluidAudio SDK (Recommended)
67
+
68
+ Installation
69
+ Add FluidAudio to your project using Swift Package Manager:
70
+
71
+ ```
72
+ dependencies: [
73
+ .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.0.2"),
74
+ ],
75
+ ```
76
+
77
+ ```swift
78
+ import FluidAudio
79
+
80
+ Task {
81
+ let diarizer = DiarizerManager()
82
+ try await diarizer.initialize()
83
+
84
+ let audioSamples: [Float] = // your 16kHz audio
85
+ let result = try await diarizer.performCompleteDiarization(
86
+ audioSamples,
87
+ sampleRate: 16000
88
+ )
89
+
90
+ for segment in result.segments {
91
+ print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
92
+ }
93
+ }
94
+
95
+
96
+ ### Direct CoreML Usage
97
+ ``swift
98
+ import CoreML
99
+
100
+ // Load the model
101
+ let model = try! SpeakerDiarizationModel(configuration: MLModelConfiguration())
102
+
103
+ // Prepare input (16kHz audio)
104
+ let input = SpeakerDiarizationModelInput(audioSamples: audioArray)
105
+
106
+ // Run inference
107
+ let output = try! model.prediction(input: input)
108
+ ```
109
+
110
+
111
+ ## Acknowledgments
112
+ These CoreML models are based on excellent work from:
113
+
114
+ sherpa-onnx - Foundational diarization algorithms
115
+ pyannote-audio - State-of-the-art diarization research
116
+ wespeaker - Speaker embedding techniques
117
+
118
+