PyAnnote Embedding ONNX
About
This repository contains a converted version of PyAnnote's speaker embedding model from PyTorch to ONNX format, significantly reducing the model size from 96MB to just 17MB. This conversion provides substantial benefits for deployment scenarios where size and runtime efficiency are critical.
Key Benefits
- Reduced Model Size: Compressed from 96MB to 17MB (~82% reduction)
- Lightweight Runtime: Only requires ONNX Runtime (~2MB) instead of the full PyTorch stack
- Deployment Flexibility: Easier to deploy in resource-constrained environments
- Cross-Platform Support: Works with ONNX Runtime across different platforms and languages
Requirements
- ONNX Runtime (Node.js version used in the example)
- WAV decoder for audio processing
Usage
The repository includes a Node.js implementation demonstrating how to use the converted model for speaker verification tasks. The code loads audio files, extracts embeddings using the ONNX model, and compares them using cosine similarity.
Installation
npm install onnxruntime-node wav-decoder
Basic Usage Example
import * as ort from 'onnxruntime-node';
import fs from 'fs/promises';
import wav from 'wav-decoder';
function resampleAudio(audioData, originalSampleRate, targetSampleRate) {
const ratio = originalSampleRate / targetSampleRate;
const newLength = Math.floor(audioData.length / ratio);
const result = new Float32Array(newLength);
for (let i = 0; i < newLength; i++) {
const originalIndex = Math.floor(i * ratio);
result[i] = audioData[originalIndex];
}
return result;
}
async function loadAndPreprocessAudio(audioPath) {
const audioBuffer = await fs.readFile(audioPath);
const decoded = await wav.decode(new Uint8Array(audioBuffer).buffer);
const audioData = decoded.channelData[0];
const targetSampleRate = 16000;
const resampledData = resampleAudio(audioData, decoded.sampleRate, targetSampleRate);
const finalData = new Float32Array(16000);
finalData.set(resampledData.slice(0, 16000));
return finalData;
}
function cosineSimilarity(embedding1, embedding2) {
let dotProduct = 0;
let norm1 = 0;
let norm2 = 0;
for (let i = 0; i < embedding1.length; i++) {
dotProduct += embedding1[i] * embedding2[i];
norm1 += embedding1[i] * embedding1[i];
norm2 += embedding2[i] * embedding2[i];
}
return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2));
}
async function compareSpeakers(modelPath, audioPath1, audioPath2, threshold = 0.75) {
try {
const session = await ort.InferenceSession.create(modelPath);
// Get embeddings for both audio files
const audioData1 = await loadAndPreprocessAudio(audioPath1);
const audioData2 = await loadAndPreprocessAudio(audioPath2);
// Create tensors
const inputTensor1 = new ort.Tensor('float32', audioData1, [1, 1, 16000]);
const inputTensor2 = new ort.Tensor('float32', audioData2, [1, 1, 16000]);
// Get embeddings
const results1 = await session.run({ audio_input: inputTensor1 });
const results2 = await session.run({ audio_input: inputTensor2 });
const embedding1 = Array.from(results1[session.outputNames[0]].data);
const embedding2 = Array.from(results2[session.outputNames[0]].data);
// Calculate similarity
const similarity = cosineSimilarity(embedding1, embedding2);
return {
isSameSpeaker: similarity > threshold,
similarity: similarity,
embedding1: embedding1,
embedding2: embedding2
};
} catch (error) {
console.error('Error during speaker comparison:', error);
throw error;
}
}
// Load the model and compare two audio files
const result = await compareSpeakers(
'./pyannote_embedding.onnx',
'./speaker_1.wav',
'./speaker_2.wav'
);
console.log('Similarity score:', result.similarity.toFixed(4));
console.log('Same speaker:', result.isSameSpeaker ? 'Yes' : 'No');
Implementation Details
The included code provides functionality for:
- Loading and preprocessing audio files
- Resampling audio to the required 16kHz sample rate
- Running inference with the ONNX model
- Computing cosine similarity between speaker embeddings
- Making speaker verification decisions based on a similarity threshold
Model Information
- Input: 1-second audio segment sampled at 16kHz
- Output: Speaker embedding vector (dimension determined by the model architecture)
- Format: ONNX format, optimized for inference
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support