PyAnnote Embedding ONNX

About

This repository contains a converted version of PyAnnote's speaker embedding model from PyTorch to ONNX format, significantly reducing the model size from 96MB to just 17MB. This conversion provides substantial benefits for deployment scenarios where size and runtime efficiency are critical.

Key Benefits

Reduced Model Size: Compressed from 96MB to 17MB (~82% reduction)
Lightweight Runtime: Only requires ONNX Runtime (~2MB) instead of the full PyTorch stack
Deployment Flexibility: Easier to deploy in resource-constrained environments
Cross-Platform Support: Works with ONNX Runtime across different platforms and languages

Requirements

ONNX Runtime (Node.js version used in the example)
WAV decoder for audio processing

Usage

The repository includes a Node.js implementation demonstrating how to use the converted model for speaker verification tasks. The code loads audio files, extracts embeddings using the ONNX model, and compares them using cosine similarity.

Installation

npm install onnxruntime-node wav-decoder

Basic Usage Example

import * as ort from 'onnxruntime-node';
import fs from 'fs/promises';
import wav from 'wav-decoder';

function resampleAudio(audioData, originalSampleRate, targetSampleRate) {
    const ratio = originalSampleRate / targetSampleRate;
    const newLength = Math.floor(audioData.length / ratio);
    const result = new Float32Array(newLength);
    
    for (let i = 0; i < newLength; i++) {
        const originalIndex = Math.floor(i * ratio);
        result[i] = audioData[originalIndex];
    }
    
    return result;
}

async function loadAndPreprocessAudio(audioPath) {
    const audioBuffer = await fs.readFile(audioPath);
    const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
    const audioData = decoded.channelData[0];
    
    const targetSampleRate = 16000;
    const resampledData = resampleAudio(audioData, decoded.sampleRate, targetSampleRate);
    
    const finalData = new Float32Array(16000);
    finalData.set(resampledData.slice(0, 16000));
    
    return finalData;
}


function cosineSimilarity(embedding1, embedding2) {
    let dotProduct = 0;
    let norm1 = 0;
    let norm2 = 0;
    
    for (let i = 0; i < embedding1.length; i++) {
        dotProduct += embedding1[i] * embedding2[i];
        norm1 += embedding1[i] * embedding1[i];
        norm2 += embedding2[i] * embedding2[i];
    }
    
    return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2));
}

async function compareSpeakers(modelPath, audioPath1, audioPath2, threshold = 0.75) {
    try {
        const session = await ort.InferenceSession.create(modelPath);
        
        // Get embeddings for both audio files
        const audioData1 = await loadAndPreprocessAudio(audioPath1);
        const audioData2 = await loadAndPreprocessAudio(audioPath2);
        
        // Create tensors
        const inputTensor1 = new ort.Tensor('float32', audioData1, [1, 1, 16000]);
        const inputTensor2 = new ort.Tensor('float32', audioData2, [1, 1, 16000]);
        
        // Get embeddings
        const results1 = await session.run({ audio_input: inputTensor1 });
        const results2 = await session.run({ audio_input: inputTensor2 });
        
        const embedding1 = Array.from(results1[session.outputNames[0]].data);
        const embedding2 = Array.from(results2[session.outputNames[0]].data);
        
        // Calculate similarity
        const similarity = cosineSimilarity(embedding1, embedding2);
        
        return {
            isSameSpeaker: similarity > threshold,
            similarity: similarity,
            embedding1: embedding1,
            embedding2: embedding2
        };
        
    } catch (error) {
        console.error('Error during speaker comparison:', error);
        throw error;
    }
}

// Load the model and compare two audio files
const result = await compareSpeakers(
    './pyannote_embedding.onnx', 
    './speaker_1.wav', 
    './speaker_2.wav'
);

console.log('Similarity score:', result.similarity.toFixed(4));
console.log('Same speaker:', result.isSameSpeaker ? 'Yes' : 'No');

Implementation Details

The included code provides functionality for:

Loading and preprocessing audio files
Resampling audio to the required 16kHz sample rate
Running inference with the ONNX model
Computing cosine similarity between speaker embeddings
Making speaker verification decisions based on a similarity threshold

Model Information

Input: 1-second audio segment sampled at 16kHz
Output: Speaker embedding vector (dimension determined by the model architecture)
Format: ONNX format, optimized for inference