Upload pyannote-segmentation-3.onnx
Browse files# PyAnnote Segmentation ONNX
This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.
## Overview
The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.
## Features
- **Platform Independence**: Run speaker diarization without PyTorch dependencies
- **Optimized Performance**: ONNX runtime optimizations for faster inference
- **Simple Integration**: Straightforward JavaScript/Node.js implementation included
## Quick Start
### Installation
```bash
npm install onnxruntime-node node-fetch wav-decoder
```
### Usage Example
```javascript
import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';
async function fetchAudioAsTensor(path, sampling_rate) {
let audioBuffer;
// Check if path is a local file or URL
if (path.startsWith('http')) {
const response = await fetch(path);
audioBuffer = await response.arrayBuffer();
} else {
// Read local file
audioBuffer = await fs.readFile(path);
}
const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
const channelData = decoded.channelData[0];
const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
return tensor;
}
function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
const timeStep = 0.00625;
const numClasses = 7;
const numFrames = Math.floor(logitsData.length / numClasses);
const frames = [];
for (let i = 0; i < numFrames; i++) {
const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
const maxVal = Math.max(...frameData);
const maxIndex = frameData.indexOf(maxVal);
frames.push({
start: i * timeStep,
end: (i + 1) * timeStep,
id: maxIndex,
confidence: maxVal
});
}
const mergedResults = [];
let currentSegment = frames[0];
for (let i = 1; i < frames.length; i++) {
if (frames[i].id === currentSegment.id) {
currentSegment.end = frames[i].end;
currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
} else {
mergedResults.push({...currentSegment});
currentSegment = frames[i];
}
}
mergedResults.push(currentSegment);
return mergedResults;
}
(async () => {
const model_url = 'pyannote-segmentation-3.onnx';
const audio_path = './mlk.wav'; // Use relative path
const sampling_rate = 16000;
const session = await ort.InferenceSession.create(model_url);
const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);
const output = await session.run({ input_values: audioTensor });
const logits = output.logits.data;
const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);
console.table(result.map(r => ({
start: Number(r.start.toFixed(5)),
end: Number(r.end.toFixed(5)),
id: r.id,
confidence: r.confidence
})));
})();
```
## Implementation Details
The repository includes a complete Node.js implementation for:
1. Loading audio from local files or URLs
2. Converting audio to the proper tensor format
3. Running inference with ONNX Runtime
4. Post-processing diarization results
## Speaker ID Interpretation
The model classifies audio segments with IDs representing different speakers or audio conditions:
- **ID 0**: Primary speaker
- **ID 2**: Secondary speaker
- **ID 3**: Background noise or brief interjections
- **ID 1**: Not typically identified by the model
## Performance Considerations
- The model processes audio with a time step of 0.00625 seconds
- Best results are achieved with 16kHz mono WAV files
- Processing longer audio files may require batching
## Example Results
When run against an audio file, the code outputs a table like this:
```
βββββββββββ¬βββββββββββ¬βββββββββββ¬βββββ¬βββββββββββββββββββββββ
β Index β Start β End β ID β Confidence β
βββββββββββΌβββββββββββΌβββββββββββΌβββββΌβββββββββββββββββββββββ€
β 0 β 0.00000 β 0.38750 β 0 β -0.5956847206408247 β
β 1 β 0.38750 β 0.87500 β 2 β -0.6725609518399854 β
β 2 β 0.87500 β 1.31875 β 0 β -0.6251495976493047 β
β 3 β 1.31875 β 1.68750 β 2 β -1.0951091697128392 β
β 4 β 1.68750 β 2.30000 β 3 β -1.2232454111418622 β
β 5 β 2.30000 β 3.19375 β 2 β -0.7195502450863511 β
β 6 β 3.19375 β 3.71250 β 0 β -0.6267317700475712 β
β 7 β 3.71250 β 4.64375 β 2 β -1.1656335032519587 β
β 8 β 4.64375 β 4.79375 β 0 β -1.0008199909561597 β
βββββββββββ΄βββββββββββ΄βββββββββββ΄βββββ΄βββββββββββββββββββββββ
```
Each row represents a segment with:
- `start`: Start time of segment (seconds)
- `end`: End time of segment (seconds)
- `id`: Speaker/class ID
- `confidence`: Model confidence score (negative numbers closer to 0 indicate higher confidence)
In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.
## Applications
This ONNX-converted model is suitable for:
- Cross-platform applications
- Edge devices with limited resources
- Server-side processing with Node.js
- Batch processing of audio files
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25
|
3 |
+
size 5986908
|