nevil-ramani
/

pyannote-segmentation-3_onnx

ONNX

Model card Files Files and versions Community

nevil-ramani commited on Apr 16

Commit

37e8e3b

verified ·

1 Parent(s): 4e5234a

Update README.md

Browse files

Files changed (1) hide show

README.md +171 -3

README.md CHANGED Viewed

@@ -1,3 +1,171 @@
----
-license: mit
----

+---
+license: mit
+---
+# PyAnnote Segmentation ONNX
+This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.
+## Overview
+The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.
+## Features
+- **Platform Independence**: Run speaker diarization without PyTorch dependencies
+- **Optimized Performance**: ONNX runtime optimizations for faster inference
+- **Simple Integration**: Straightforward JavaScript/Node.js implementation included
+## Quick Start
+### Installation
+```bash
+npm install onnxruntime-node node-fetch wav-decoder
+```
+### Usage Example
+```javascript
+import fs from 'fs/promises';
+import fetch from 'node-fetch';
+import wav from 'wav-decoder';
+import * as ort from 'onnxruntime-node';
+async function fetchAudioAsTensor(path, sampling_rate) {
+    let audioBuffer;
+    // Check if path is a local file or URL
+    if (path.startsWith('http')) {
+        const response = await fetch(path);
+        audioBuffer = await response.arrayBuffer();
+    } else {
+        // Read local file
+        audioBuffer = await fs.readFile(path);
+    }
+    const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
+    const channelData = decoded.channelData[0];
+    const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
+    return tensor;
+}
+function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
+    const timeStep = 0.00625;
+    const numClasses = 7;
+    const numFrames = Math.floor(logitsData.length / numClasses);
+    const frames = [];
+    for (let i = 0; i < numFrames; i++) {
+        const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
+        const maxVal = Math.max(...frameData);
+        const maxIndex = frameData.indexOf(maxVal);
+        frames.push({
+            start: i * timeStep,
+            end: (i + 1) * timeStep,
+            id: maxIndex,
+            confidence: maxVal
+        });
+    }
+    const mergedResults = [];
+    let currentSegment = frames[0];
+    for (let i = 1; i < frames.length; i++) {
+        if (frames[i].id === currentSegment.id) {
+            currentSegment.end = frames[i].end;
+            currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
+        } else {
+            mergedResults.push({...currentSegment});
+            currentSegment = frames[i];
+        }
+    }
+    mergedResults.push(currentSegment);
+    return mergedResults;
+}
+(async () => {
+    const model_url = 'pyannote-segmentation-3.onnx';
+    const audio_path = './mlk.wav';  // Use relative path
+    const sampling_rate = 16000;
+    const session = await ort.InferenceSession.create(model_url);
+    const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);
+    const output = await session.run({ input_values: audioTensor });
+    const logits = output.logits.data;
+    const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);
+    console.table(result.map(r => ({
+        start: Number(r.start.toFixed(5)),
+        end: Number(r.end.toFixed(5)),
+        id: r.id,
+        confidence: r.confidence
+    })));
+})();
+```
+## Implementation Details
+The repository includes a complete Node.js implementation for:
+1. Loading audio from local files or URLs
+2. Converting audio to the proper tensor format
+3. Running inference with ONNX Runtime
+4. Post-processing diarization results
+## Speaker ID Interpretation
+The model classifies audio segments with IDs representing different speakers or audio conditions:
+- **ID 0**: Primary speaker
+- **ID 2**: Secondary speaker
+- **ID 3**: Background noise or brief interjections
+- **ID 1**: Not typically identified by the model
+## Performance Considerations
+- The model processes audio with a time step of 0.00625 seconds
+- Best results are achieved with 16kHz mono WAV files
+- Processing longer audio files may require batching
+## Example Results
+When run against an audio file, the code outputs a table like this:
+```
+┌─────────┬──────────┬──────────┬────┬──────────────────────┐
+│ Index   │  Start   │   End    │ ID │     Confidence       │
+├─────────┼──────────┼──────────┼────┼──────────────────────┤
+│   0     │ 0.00000  │ 0.38750  │ 0  │ -0.5956847206408247  │
+│   1     │ 0.38750  │ 0.87500  │ 2  │ -0.6725609518399854  │
+│   2     │ 0.87500  │ 1.31875  │ 0  │ -0.6251495976493047  │
+│   3     │ 1.31875  │ 1.68750  │ 2  │ -1.0951091697128392  │
+│   4     │ 1.68750  │ 2.30000  │ 3  │ -1.2232454111418622  │
+│   5     │ 2.30000  │ 3.19375  │ 2  │ -0.7195502450863511  │
+│   6     │ 3.19375  │ 3.71250  │ 0  │ -0.6267317700475712  │
+│   7     │ 3.71250  │ 4.64375  │ 2  │ -1.1656335032519587  │
+│   8     │ 4.64375  │ 4.79375  │ 0  │ -1.0008199909561597  │
+└─────────┴──────────┴──────────┴────┴──────────────────────┘
+```
+Each row represents a segment with:
+- `start`: Start time of segment (seconds)
+- `end`: End time of segment (seconds)
+- `id`: Speaker/class ID
+- `confidence`: Model confidence score (negative numbers closer to 0 indicate higher confidence)
+In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.
+## Applications
+This ONNX-converted model is suitable for:
+- Cross-platform applications
+- Edge devices with limited resources
+- Server-side processing with Node.js
+- Batch processing of audio files