PyAnnote Segmentation ONNX
This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.
Overview
The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.
Features
- Platform Independence: Run speaker diarization without PyTorch dependencies
- Optimized Performance: ONNX runtime optimizations for faster inference
- Simple Integration: Straightforward JavaScript/Node.js implementation included
Quick Start
Installation
npm install onnxruntime-node node-fetch wav-decoder
Usage Example
import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';
async function fetchAudioAsTensor(path, sampling_rate) {
let audioBuffer;
// Check if path is a local file or URL
if (path.startsWith('http')) {
const response = await fetch(path);
audioBuffer = await response.arrayBuffer();
} else {
// Read local file
audioBuffer = await fs.readFile(path);
}
const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
const channelData = decoded.channelData[0];
const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
return tensor;
}
function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
const timeStep = 0.00625;
const numClasses = 7;
const numFrames = Math.floor(logitsData.length / numClasses);
const frames = [];
for (let i = 0; i < numFrames; i++) {
const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
const maxVal = Math.max(...frameData);
const maxIndex = frameData.indexOf(maxVal);
frames.push({
start: i * timeStep,
end: (i + 1) * timeStep,
id: maxIndex,
confidence: maxVal
});
}
const mergedResults = [];
let currentSegment = frames[0];
for (let i = 1; i < frames.length; i++) {
if (frames[i].id === currentSegment.id) {
currentSegment.end = frames[i].end;
currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
} else {
mergedResults.push({...currentSegment});
currentSegment = frames[i];
}
}
mergedResults.push(currentSegment);
return mergedResults;
}
(async () => {
const model_url = 'pyannote-segmentation-3.onnx';
const audio_path = './mlk.wav'; // Use relative path
const sampling_rate = 16000;
const session = await ort.InferenceSession.create(model_url);
const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);
const output = await session.run({ input_values: audioTensor });
const logits = output.logits.data;
const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);
console.table(result.map(r => ({
start: Number(r.start.toFixed(5)),
end: Number(r.end.toFixed(5)),
id: r.id,
confidence: r.confidence
})));
})();
Implementation Details
The repository includes a complete Node.js implementation for:
- Loading audio from local files or URLs
- Converting audio to the proper tensor format
- Running inference with ONNX Runtime
- Post-processing diarization results
Speaker ID Interpretation
The model classifies audio segments with IDs representing different speakers or audio conditions:
- ID 0: Primary speaker
- ID 2: Secondary speaker
- ID 3: Background noise or brief interjections
- ID 1: Not typically identified by the model
Performance Considerations
- The model processes audio with a time step of 0.00625 seconds
- Best results are achieved with 16kHz mono WAV files
- Processing longer audio files may require batching
Example Results
When run against an audio file, the code outputs a table like this:
βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββ¬βββββββββββββββββββββββ
β Index β Start β End β ID β Confidence β
βββββββββββΌβββββββββββΌβββββββββββΌβββββΌβββββββββββββββββββββββ€
β 0 β 0.00000 β 0.38750 β 0 β -0.5956847206408247 β
β 1 β 0.38750 β 0.87500 β 2 β -0.6725609518399854 β
β 2 β 0.87500 β 1.31875 β 0 β -0.6251495976493047 β
β 3 β 1.31875 β 1.68750 β 2 β -1.0951091697128392 β
β 4 β 1.68750 β 2.30000 β 3 β -1.2232454111418622 β
β 5 β 2.30000 β 3.19375 β 2 β -0.7195502450863511 β
β 6 β 3.19375 β 3.71250 β 0 β -0.6267317700475712 β
β 7 β 3.71250 β 4.64375 β 2 β -1.1656335032519587 β
β 8 β 4.64375 β 4.79375 β 0 β -1.0008199909561597 β
βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββ΄βββββββββββββββββββββββ
Each row represents a segment with:
start
: Start time of segment (seconds)end
: End time of segment (seconds)id
: Speaker/class IDconfidence
: Model confidence score (negative numbers closer to 0 indicate higher confidence)
In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.
Applications
This ONNX-converted model is suitable for:
- Cross-platform applications
- Edge devices with limited resources
- Server-side processing with Node.js
- Batch processing of audio files