PyAnnote Segmentation ONNX

This repository provides an ONNX version of the PyAnnote speaker segmentation model for efficient inference across various platforms.

Overview

The PyAnnote segmentation model has been converted from PyTorch to ONNX format (pyannote-segmentation-3.onnx) for improved deployment flexibility and performance. This conversion enables running speaker diarization on platforms without PyTorch dependencies, with potentially faster inference times.

Features

  • Platform Independence: Run speaker diarization without PyTorch dependencies
  • Optimized Performance: ONNX runtime optimizations for faster inference
  • Simple Integration: Straightforward JavaScript/Node.js implementation included

Quick Start

Installation

npm install onnxruntime-node node-fetch wav-decoder

Usage Example


import fs from 'fs/promises';
import fetch from 'node-fetch';
import wav from 'wav-decoder';
import * as ort from 'onnxruntime-node';

async function fetchAudioAsTensor(path, sampling_rate) {
    let audioBuffer;
    
    // Check if path is a local file or URL
    if (path.startsWith('http')) {
        const response = await fetch(path);
        audioBuffer = await response.arrayBuffer();
    } else {
        // Read local file
        audioBuffer = await fs.readFile(path);
    }

    const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
    const channelData = decoded.channelData[0];
    const tensor = new ort.Tensor('float32', channelData, [1, 1, channelData.length]);
    return tensor;
}

function postProcessSpeakerDiarization(logitsData, audioLength, samplingRate) {
    const timeStep = 0.00625;
    const numClasses = 7;
    const numFrames = Math.floor(logitsData.length / numClasses);
    
    const frames = [];
    for (let i = 0; i < numFrames; i++) {
        const frameData = Array.from(logitsData.slice(i * numClasses, (i + 1) * numClasses));
        const maxVal = Math.max(...frameData);
        const maxIndex = frameData.indexOf(maxVal);
        
        frames.push({
            start: i * timeStep,
            end: (i + 1) * timeStep,
            id: maxIndex,
            confidence: maxVal
        });
    }
    
    const mergedResults = [];
    let currentSegment = frames[0];
    
    for (let i = 1; i < frames.length; i++) {
        if (frames[i].id === currentSegment.id) {
            currentSegment.end = frames[i].end;
            currentSegment.confidence = (currentSegment.confidence + frames[i].confidence) / 2;
        } else {
            mergedResults.push({...currentSegment});
            currentSegment = frames[i];
        }
    }
    mergedResults.push(currentSegment);
    
    return mergedResults;
}

(async () => {
    const model_url = 'pyannote-segmentation-3.onnx';
    const audio_path = './mlk.wav';  // Use relative path
    const sampling_rate = 16000;

    const session = await ort.InferenceSession.create(model_url);
    const audioTensor = await fetchAudioAsTensor(audio_path, sampling_rate);
    
    const output = await session.run({ input_values: audioTensor });
    const logits = output.logits.data;
    
    const result = postProcessSpeakerDiarization(logits, audioTensor.dims[2], sampling_rate);
    
    console.table(result.map(r => ({
        start: Number(r.start.toFixed(5)),
        end: Number(r.end.toFixed(5)),
        id: r.id,
        confidence: r.confidence
    })));
})();

Implementation Details

The repository includes a complete Node.js implementation for:

  1. Loading audio from local files or URLs
  2. Converting audio to the proper tensor format
  3. Running inference with ONNX Runtime
  4. Post-processing diarization results

Speaker ID Interpretation

The model classifies audio segments with IDs representing different speakers or audio conditions:

  • ID 0: Primary speaker
  • ID 2: Secondary speaker
  • ID 3: Background noise or brief interjections
  • ID 1: Not typically identified by the model

Performance Considerations

  • The model processes audio with a time step of 0.00625 seconds
  • Best results are achieved with 16kHz mono WAV files
  • Processing longer audio files may require batching

Example Results

When run against an audio file, the code outputs a table like this:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Index   β”‚  Start   β”‚   End    β”‚ ID β”‚     Confidence       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   0     β”‚ 0.00000  β”‚ 0.38750  β”‚ 0  β”‚ -0.5956847206408247  β”‚
β”‚   1     β”‚ 0.38750  β”‚ 0.87500  β”‚ 2  β”‚ -0.6725609518399854  β”‚
β”‚   2     β”‚ 0.87500  β”‚ 1.31875  β”‚ 0  β”‚ -0.6251495976493047  β”‚
β”‚   3     β”‚ 1.31875  β”‚ 1.68750  β”‚ 2  β”‚ -1.0951091697128392  β”‚
β”‚   4     β”‚ 1.68750  β”‚ 2.30000  β”‚ 3  β”‚ -1.2232454111418622  β”‚
β”‚   5     β”‚ 2.30000  β”‚ 3.19375  β”‚ 2  β”‚ -0.7195502450863511  β”‚
β”‚   6     β”‚ 3.19375  β”‚ 3.71250  β”‚ 0  β”‚ -0.6267317700475712  β”‚
β”‚   7     β”‚ 3.71250  β”‚ 4.64375  β”‚ 2  β”‚ -1.1656335032519587  β”‚
β”‚   8     β”‚ 4.64375  β”‚ 4.79375  β”‚ 0  β”‚ -1.0008199909561597  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each row represents a segment with:

  • start: Start time of segment (seconds)
  • end: End time of segment (seconds)
  • id: Speaker/class ID
  • confidence: Model confidence score (negative numbers closer to 0 indicate higher confidence)

In this example, you can observe speaker transitions between speakers 0 and 2, with a brief segment of background noise (ID 3) around the 2-second mark.

Applications

This ONNX-converted model is suitable for:

  • Cross-platform applications
  • Edge devices with limited resources
  • Server-side processing with Node.js
  • Batch processing of audio files
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support