nevil-ramani commited on
Commit
4e5234a
Β·
verified Β·
1 Parent(s): 056b444

Upload pyannote-segmentation-3.onnx

Browse files

# PyAnnote Segmentation ONNX

This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.

## Overview

The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.

## Features

- **Platform Independence**: Run speaker diarization without PyTorch dependencies
- **Optimized Performance**: ONNX runtime optimizations for faster inference
- **Simple Integration**: Straightforward JavaScript/Node.js implementation included

## Quick Start

### Installation

```bash
npm install onnxruntime-node node-fetch wav-decoder
```

### Usage Example

```javascript

import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';

async function fetchAudioAsTensor(path, sampling_rate) {
let audioBuffer;

// Check if path is a local file or URL
if (path.startsWith('http')) {
const response = await fetch(path);
audioBuffer = await response.arrayBuffer();
} else {
// Read local file
audioBuffer = await fs.readFile(path);
}

const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
const channelData = decoded.channelData[0];
const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
return tensor;
}

function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
const timeStep = 0.00625;
const numClasses = 7;
const numFrames = Math.floor(logitsData.length / numClasses);

const frames = [];
for (let i = 0; i < numFrames; i++) {
const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
const maxVal = Math.max(...frameData);
const maxIndex = frameData.indexOf(maxVal);

frames.push({
start: i * timeStep,
end: (i + 1) * timeStep,
id: maxIndex,
confidence: maxVal
});
}

const mergedResults = [];
let currentSegment = frames[0];

for (let i = 1; i < frames.length; i++) {
if (frames[i].id === currentSegment.id) {
currentSegment.end = frames[i].end;
currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
} else {
mergedResults.push({...currentSegment});
currentSegment = frames[i];
}
}
mergedResults.push(currentSegment);

return mergedResults;
}

(async () => {
const model_url = 'pyannote-segmentation-3.onnx';
const audio_path = './mlk.wav'; // Use relative path
const sampling_rate = 16000;

const session = await ort.InferenceSession.create(model_url);
const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);

const output = await session.run({ input_values: audioTensor });
const logits = output.logits.data;

const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);

console.table(result.map(r => ({
start: Number(r.start.toFixed(5)),
end: Number(r.end.toFixed(5)),
id: r.id,
confidence: r.confidence
})));
})();
```

## Implementation Details

The repository includes a complete Node.js implementation for:

1. Loading audio from local files or URLs
2. Converting audio to the proper tensor format
3. Running inference with ONNX Runtime
4. Post-processing diarization results

## Speaker ID Interpretation

The model classifies audio segments with IDs representing different speakers or audio conditions:

- **ID 0**: Primary speaker
- **ID 2**: Secondary speaker
- **ID 3**: Background noise or brief interjections
- **ID 1**: Not typically identified by the model

## Performance Considerations

- The model processes audio with a time step of 0.00625 seconds
- Best results are achieved with 16kHz mono WAV files
- Processing longer audio files may require batching

## Example Results

When run against an audio file, the code outputs a table like this:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Index β”‚ Start β”‚ End β”‚ ID β”‚ Confidence β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 0 β”‚ 0.00000 β”‚ 0.38750 β”‚ 0 β”‚ -0.5956847206408247 β”‚
β”‚ 1 β”‚ 0.38750 β”‚ 0.87500 β”‚ 2 β”‚ -0.6725609518399854 β”‚
β”‚ 2 β”‚ 0.87500 β”‚ 1.31875 β”‚ 0 β”‚ -0.6251495976493047 β”‚
β”‚ 3 β”‚ 1.31875 β”‚ 1.68750 β”‚ 2 β”‚ -1.0951091697128392 β”‚
β”‚ 4 β”‚ 1.68750 β”‚ 2.30000 β”‚ 3 β”‚ -1.2232454111418622 β”‚
β”‚ 5 β”‚ 2.30000 β”‚ 3.19375 β”‚ 2 β”‚ -0.7195502450863511 β”‚
β”‚ 6 β”‚ 3.19375 β”‚ 3.71250 β”‚ 0 β”‚ -0.6267317700475712 β”‚
β”‚ 7 β”‚ 3.71250 β”‚ 4.64375 β”‚ 2 β”‚ -1.1656335032519587 β”‚
β”‚ 8 β”‚ 4.64375 β”‚ 4.79375 β”‚ 0 β”‚ -1.0008199909561597 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```


Each row represents a segment with:
- `start`: Start time of segment (seconds)
- `end`: End time of segment (seconds)
- `id`: Speaker/class ID
- `confidence`: Model confidence score (negative numbers closer to 0 indicate higher confidence)

In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.

## Applications

This ONNX-converted model is suitable for:

- Cross-platform applications
- Edge devices with limited resources
- Server-side processing with Node.js
- Batch processing of audio files

Files changed (1) hide show
  1. pyannote-segmentation-3.onnx +3 -0
pyannote-segmentation-3.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:057ee564753071c0b09b5b611648b50ac188d50846bff5f01e9f7bbf1591ea25
3
+ size 5986908