Upload pyannote-segmentation-3.onnx

# PyAnnote Segmentation ONNX

This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.

## Overview

The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.

## Features

- **Platform Independence**: Run speaker diarization without PyTorch dependencies
- **Optimized Performance**: ONNX runtime optimizations for faster inference
- **Simple Integration**: Straightforward JavaScript/Node.js implementation included

## Quick Start

### Installation

```bash
npm install onnxruntime-node node-fetch wav-decoder
```

### Usage Example

```javascript

import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';

async function fetchAudioAsTensor(path, sampling_rate) {
let audioBuffer;

// Check if path is a local file or URL
if (path.startsWith('http')) {
const response = await fetch(path);
audioBuffer = await response.arrayBuffer();
} else {
// Read local file
audioBuffer = await fs.readFile(path);
}

const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
const channelData = decoded.channelData[0];
const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
return tensor;
}

function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
const timeStep = 0.00625;
const numClasses = 7;
const numFrames = Math.floor(logitsData.length / numClasses);

const frames = [];
for (let i = 0; i < numFrames; i++) {
const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
const maxVal = Math.max(...frameData);
const maxIndex = frameData.indexOf(maxVal);

frames.push({
start: i * timeStep,
end: (i + 1) * timeStep,
id: maxIndex,
confidence: maxVal
});
}

const mergedResults = [];
let currentSegment = frames[0];

for (let i = 1; i < frames.length; i++) {
if (frames[i].id === currentSegment.id) {
currentSegment.end = frames[i].end;
currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
} else {
mergedResults.push({...currentSegment});
currentSegment = frames[i];
}
}
mergedResults.push(currentSegment);

return mergedResults;
}

(async () => {
const model_url = 'pyannote-segmentation-3.onnx';
const audio_path = './mlk.wav'; // Use relative path
const sampling_rate = 16000;

const session = await ort.InferenceSession.create(model_url);
const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);

const output = await session.run({ input_values: audioTensor });
const logits = output.logits.data;

const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);

console.table(result.map(r => ({
start: Number(r.start.toFixed(5)),
end: Number(r.end.toFixed(5)),
id: r.id,
confidence: r.confidence
})));
})();
```

## Implementation Details

The repository includes a complete Node.js implementation for:

1. Loading audio from local files or URLs
2. Converting audio to the proper tensor format
3. Running inference with ONNX Runtime
4. Post-processing diarization results

## Speaker ID Interpretation

The model classifies audio segments with IDs representing different speakers or audio conditions:

- **ID 0**: Primary speaker
- **ID 2**: Secondary speaker
- **ID 3**: Background noise or brief interjections
- **ID 1**: Not typically identified by the model

## Performance Considerations

- The model processes audio with a time step of 0.00625 seconds
- Best results are achieved with 16kHz mono WAV files
- Processing longer audio files may require batching

## Example Results

When run against an audio file, the code outputs a table like this:

```
┌─────────┬──────────┬──────────┬────┬──────────────────────┐
│ Index │ Start │ End │ ID │ Confidence │
├─────────┼──────────┼──────────┼────┼──────────────────────┤
│ 0 │ 0.00000 │ 0.38750 │ 0 │ -0.5956847206408247 │
│ 1 │ 0.38750 │ 0.87500 │ 2 │ -0.6725609518399854 │
│ 2 │ 0.87500 │ 1.31875 │ 0 │ -0.6251495976493047 │
│ 3 │ 1.31875 │ 1.68750 │ 2 │ -1.0951091697128392 │
│ 4 │ 1.68750 │ 2.30000 │ 3 │ -1.2232454111418622 │
│ 5 │ 2.30000 │ 3.19375 │ 2 │ -0.7195502450863511 │
│ 6 │ 3.19375 │ 3.71250 │ 0 │ -0.6267317700475712 │
│ 7 │ 3.71250 │ 4.64375 │ 2 │ -1.1656335032519587 │
│ 8 │ 4.64375 │ 4.79375 │ 0 │ -1.0008199909561597 │
└─────────┴──────────┴──────────┴────┴──────────────────────┘
```

Each row represents a segment with:
- `start`: Start time of segment (seconds)
- `end`: End time of segment (seconds)
- `id`: Speaker/class ID
- `confidence`: Model confidence score (negative numbers closer to 0 indicate higher confidence)

In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.

## Applications

This ONNX-converted model is suitable for:

- Cross-platform applications
- Edge devices with limited resources
- Server-side processing with Node.js
- Batch processing of audio files

Files changed (1) hide show

pyannote-segmentation-3.onnx +3 -0

pyannote-segmentation-3.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25
+size 5986908