metadata
library_name: transformers
pipeline_tag: audio-to-audio
tags:
- speech
- audio
- voice-conversion
- speecht5
- hifigan
- speechbrain
- xvector
- sagemaker
datasets:
- cmu-arctic
base_model:
- microsoft/speecht5_vc
- microsoft/speecht5_hifigan
- speechbrain/spkrec-ecapa-voxceleb
license: other
language: en
Model Card for speech-conversion
Any‑to‑any voice conversion (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.
This model card documents the repository amirhossein-yousefi/speech-conversion, which wraps the Hugging Face implementation of SpeechT5 (voice conversion) and the matching HiFiGAN vocoder, with a lightweight training loop and optional AWS SageMaker entry points.
Model Details
Model Description
- Developed by: Amirhossein Yousefiramandi (repo author)
- Shared by : Amirhossein Yousefiramandi
- Model type: Speech-to-speech voice conversion using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with speaker x‑vectors (ECAPA) for target voice conditioning
- Language(s): English (training examples use CMU ARCTIC)
- License: Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list MIT (see sources).
- Finetuned from model :
microsoft/speecht5_vc
(withmicrosoft/speecht5_hifigan
as vocoder); speaker embeddings viaspeechbrain/spkrec-ecapa-voxceleb
Model Sources
- Repository: https://github.com/amirhossein-yousefi/speech-conversion
- Base model card: https://huggingface.co/microsoft/speecht5_vc
- Paper (SpeechT5): https://arxiv.org/abs/2110.07205
Training Hardware & Environment
- Device: Laptop (Windows, WDDM driver model)
- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
- Driver: 576.52
- CUDA (driver): 12.9
- PyTorch: 2.8.0+cu129
- CUDA available: ✅
Training Logs & Metrics
- Total FLOPs (training):
1,571,716,275,216,842,800
- Training runtime:
1,688.2899
seconds - Hardware Type: Single NVIDIA GeForce RTX 3080 Ti Laptop GPU
Uses
Direct Use
- Voice conversion (VC): Given a source speech clip (content) and a short reference clip of the target speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.
Downstream Use
- Fine‑tuning VC on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
- Building hosted inference services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.
Out‑of‑Scope Use
- Any application that impersonates individuals without explicit consent (e.g., fraud, deepfakes).
- Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).
Bias, Risks, and Limitations
- Identity & consent: Converting into a target speaker’s timbre can be misused for impersonation. Always obtain informed consent from target speakers.
- Dataset coverage: CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
- Artifacts & intelligibility: Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.
Recommendations
- Keep reference audio mono 16 kHz, clean, and at least a few seconds long.
- Use GPU for real‑time or faster‑than‑real‑time conversion.
- Consider adding post‑filters (denoising, loudness normalization) for production use.
- Obtain explicit consent and disclose synthesized audio where appropriate.
How to Get Started with the Model
CLI (local inference)
# Requirements (Python 3.10+)
pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
"numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
"speechbrain>=1.0.0" soundfile
# One‑shot conversion (mono 16 kHz WAVs)
python scripts/convert_once.py \
--checkpoint microsoft/speecht5_vc \
--src path/to/src.wav \
--ref path/to/ref.wav \
--out converted.wav