PyAnnote Segmentation ONNX

This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.

Overview

The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.

Features

Platform Independence: Run speaker diarization without PyTorch dependencies
Optimized Performance: ONNX runtime optimizations for faster inference
Simple Integration: Straightforward JavaScript/Node.js implementation included

Quick Start

Installation

npm install onnxruntime-node node-fetch wav-decoder

Usage Example


import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';

async function fetchAudioAsTensor(path, sampling_rate) {
    let audioBuffer;
    
    // Check if path is a local file or URL
    if (path.startsWith('http')) {
        const response = await fetch(path);
        audioBuffer = await response.arrayBuffer();
    } else {
        // Read local file
        audioBuffer = await fs.readFile(path);
    }

    const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
    const channelData = decoded.channelData[0];
    const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
    return tensor;
}

function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
    const timeStep = 0.00625;
    const numClasses = 7;
    const numFrames = Math.floor(logitsData.length / numClasses);
    
    const frames = [];
    for (let i = 0; i < numFrames; i++) {
        const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
        const maxVal = Math.max(...frameData);
        const maxIndex = frameData.indexOf(maxVal);
        
        frames.push({
            start: i * timeStep,
            end: (i + 1) * timeStep,
            id: maxIndex,
            confidence: maxVal
        });
    }
    
    const mergedResults = [];
    let currentSegment = frames[0];
    
    for (let i = 1; i < frames.length; i++) {
        if (frames[i].id === currentSegment.id) {
            currentSegment.end = frames[i].end;
            currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
        } else {
            mergedResults.push({...currentSegment});
            currentSegment = frames[i];
        }
    }
    mergedResults.push(currentSegment);
    
    return mergedResults;
}

(async () => {
    const model_url = 'pyannote-segmentation-3.onnx';
    const audio_path = './mlk.wav';  // Use relative path
    const sampling_rate = 16000;

    const session = await ort.InferenceSession.create(model_url);
    const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);
    
    const output = await session.run({ input_values: audioTensor });
    const logits = output.logits.data;
    
    const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);
    
    console.table(result.map(r => ({
        start: Number(r.start.toFixed(5)),
        end: Number(r.end.toFixed(5)),
        id: r.id,
        confidence: r.confidence
    })));
})();

Implementation Details

The repository includes a complete Node.js implementation for:

Loading audio from local files or URLs
Converting audio to the proper tensor format
Running inference with ONNX Runtime
Post-processing diarization results

Speaker ID Interpretation

The model classifies audio segments with IDs representing different speakers or audio conditions:

ID 0: Primary speaker
ID 2: Secondary speaker
ID 3: Background noise or brief interjections
ID 1: Not typically identified by the model

Performance Considerations

The model processes audio with a time step of 0.00625 seconds
Best results are achieved with 16kHz mono WAV files
Processing longer audio files may require batching

Example Results

When run against an audio file, the code outputs a table like this:

┌─────────┬──────────┬──────────┬────┬──────────────────────┐
│ Index   │  Start   │   End    │ ID │     Confidence       │
├─────────┼──────────┼──────────┼────┼──────────────────────┤
│   0     │ 0.00000  │ 0.38750  │ 0  │ -0.5956847206408247  │
│   1     │ 0.38750  │ 0.87500  │ 2  │ -0.6725609518399854  │
│   2     │ 0.87500  │ 1.31875  │ 0  │ -0.6251495976493047  │
│   3     │ 1.31875  │ 1.68750  │ 2  │ -1.0951091697128392  │
│   4     │ 1.68750  │ 2.30000  │ 3  │ -1.2232454111418622  │
│   5     │ 2.30000  │ 3.19375  │ 2  │ -0.7195502450863511  │
│   6     │ 3.19375  │ 3.71250  │ 0  │ -0.6267317700475712  │
│   7     │ 3.71250  │ 4.64375  │ 2  │ -1.1656335032519587  │
│   8     │ 4.64375  │ 4.79375  │ 0  │ -1.0008199909561597  │
└─────────┴──────────┴──────────┴────┴──────────────────────┘

Each row represents a segment with:

start: Start time of segment (seconds)
end: End time of segment (seconds)
id: Speaker/class ID
confidence: Model confidence score (negative numbers closer to 0 indicate higher confidence)

In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.

Applications

This ONNX-converted model is suitable for:

Cross-platform applications
Edge devices with limited resources
Server-side processing with Node.js
Batch processing of audio files