--- license: mit --- # PyAnnote Embedding ONNX ## About This repository contains a converted version of PyAnnote's speaker embedding model from PyTorch to ONNX format, significantly reducing the model size from 96MB to just 17MB. This conversion provides substantial benefits for deployment scenarios where size and runtime efficiency are critical. ## Key Benefits - **Reduced Model Size**: Compressed from 96MB to 17MB (~82% reduction) - **Lightweight Runtime**: Only requires ONNX Runtime (~2MB) instead of the full PyTorch stack - **Deployment Flexibility**: Easier to deploy in resource-constrained environments - **Cross-Platform Support**: Works with ONNX Runtime across different platforms and languages ## Requirements - ONNX Runtime (Node.js version used in the example) - WAV decoder for audio processing ## Usage The repository includes a Node.js implementation demonstrating how to use the converted model for speaker verification tasks. The code loads audio files, extracts embeddings using the ONNX model, and compares them using cosine similarity. ### Installation ```bash npm install onnxruntime-node wav-decoder ``` ### Basic Usage Example ```javascript import * as ort from 'onnxruntime-node'; import fs from 'fs/promises'; import wav from 'wav-decoder'; function resampleAudio(audioData, originalSampleRate, targetSampleRate) { const ratio = originalSampleRate / targetSampleRate; const newLength = Math.floor(audioData.length / ratio); const result = new Float32Array(newLength); for (let i = 0; i < newLength; i++) { const originalIndex = Math.floor(i * ratio); result[i] = audioData[originalIndex]; } return result; } async function loadAndPreprocessAudio(audioPath) { const audioBuffer = await fs.readFile(audioPath); const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer); const audioData = decoded.channelData[0]; const targetSampleRate = 16000; const resampledData = resampleAudio(audioData, decoded.sampleRate, targetSampleRate); const finalData = new Float32Array(16000); finalData.set(resampledData.slice(0, 16000)); return finalData; } function cosineSimilarity(embedding1, embedding2) { let dotProduct = 0; let norm1 = 0; let norm2 = 0; for (let i = 0; i < embedding1.length; i++) { dotProduct += embedding1[i] * embedding2[i]; norm1 += embedding1[i] * embedding1[i]; norm2 += embedding2[i] * embedding2[i]; } return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2)); } async function compareSpeakers(modelPath, audioPath1, audioPath2, threshold = 0.75) { try { const session = await ort.InferenceSession.create(modelPath); // Get embeddings for both audio files const audioData1 = await loadAndPreprocessAudio(audioPath1); const audioData2 = await loadAndPreprocessAudio(audioPath2); // Create tensors const inputTensor1 = new ort.Tensor('float32', audioData1, [1, 1, 16000]); const inputTensor2 = new ort.Tensor('float32', audioData2, [1, 1, 16000]); // Get embeddings const results1 = await session.run({ audio_input: inputTensor1 }); const results2 = await session.run({ audio_input: inputTensor2 }); const embedding1 = Array.from(results1[session.outputNames[0]].data); const embedding2 = Array.from(results2[session.outputNames[0]].data); // Calculate similarity const similarity = cosineSimilarity(embedding1, embedding2); return { isSameSpeaker: similarity > threshold, similarity: similarity, embedding1: embedding1, embedding2: embedding2 }; } catch (error) { console.error('Error during speaker comparison:', error); throw error; } } // Load the model and compare two audio files const result = await compareSpeakers( './pyannote_embedding.onnx', './speaker_1.wav', './speaker_2.wav' ); console.log('Similarity score:', result.similarity.toFixed(4)); console.log('Same speaker:', result.isSameSpeaker ? 'Yes' : 'No'); ``` ## Implementation Details The included code provides functionality for: 1. Loading and preprocessing audio files 2. Resampling audio to the required 16kHz sample rate 3. Running inference with the ONNX model 4. Computing cosine similarity between speaker embeddings 5. Making speaker verification decisions based on a similarity threshold ## Model Information - **Input**: 1-second audio segment sampled at 16kHz - **Output**: Speaker embedding vector (dimension determined by the model architecture) - **Format**: ONNX format, optimized for inference