metadata

library_name: transformers
pipeline_tag: audio-to-audio
tags:
  - speech
  - audio
  - voice-conversion
  - speecht5
  - hifigan
  - speechbrain
  - xvector
  - sagemaker
datasets:
  - cmu-arctic
base_model:
  - microsoft/speecht5_vc
  - microsoft/speecht5_hifigan
  - speechbrain/spkrec-ecapa-voxceleb
license: other
language: en

Model Card for `speech-conversion`

Any‑to‑any voice conversion (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.

This model card documents the repository amirhossein-yousefi/speech-conversion, which wraps the Hugging Face implementation of SpeechT5 (voice conversion) and the matching HiFiGAN vocoder, with a lightweight training loop and optional AWS SageMaker entry points.

Model Details

Model Description

Developed by: Amirhossein Yousefiramandi (repo author)
Shared by : Amirhossein Yousefiramandi
Model type: Speech-to-speech voice conversion using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with speaker x‑vectors (ECAPA) for target voice conditioning
Language(s): English (training examples use CMU ARCTIC)
License: Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list MIT (see sources).
Finetuned from model : microsoft/speecht5_vc (with microsoft/speecht5_hifigan as vocoder); speaker embeddings via speechbrain/spkrec-ecapa-voxceleb

Model Sources

Repository: https://github.com/amirhossein-yousefi/speech-conversion
Base model card: https://huggingface.co/microsoft/speecht5_vc
Paper (SpeechT5): https://arxiv.org/abs/2110.07205

Training Hardware & Environment

Device: Laptop (Windows, WDDM driver model)
GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
Driver: 576.52
CUDA (driver): 12.9
PyTorch: 2.8.0+cu129
CUDA available: ✅

Training Logs & Metrics

Total FLOPs (training): 1,571,716,275,216,842,800
Training runtime: 1,688.2899 seconds
Hardware Type: Single NVIDIA GeForce RTX 3080 Ti Laptop GPU

Uses

Direct Use

Voice conversion (VC): Given a source speech clip (content) and a short reference clip of the target speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.

Downstream Use

Fine‑tuning VC on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
Building hosted inference services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.

Out‑of‑Scope Use

Any application that impersonates individuals without explicit consent (e.g., fraud, deepfakes).
Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).

Bias, Risks, and Limitations

Identity & consent: Converting into a target speaker’s timbre can be misused for impersonation. Always obtain informed consent from target speakers.
Dataset coverage: CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
Artifacts & intelligibility: Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.

Recommendations

Keep reference audio mono 16 kHz, clean, and at least a few seconds long.
Use GPU for real‑time or faster‑than‑real‑time conversion.
Consider adding post‑filters (denoising, loudness normalization) for production use.
Obtain explicit consent and disclose synthesized audio where appropriate.

How to Get Started with the Model

CLI (local inference)

# Requirements (Python 3.10+)
pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
            "numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
            "speechbrain>=1.0.0" soundfile

# One‑shot conversion (mono 16 kHz WAVs)
python scripts/convert_once.py \
  --checkpoint microsoft/speecht5_vc \
  --src path/to/src.wav \
  --ref path/to/ref.wav \
  --out converted.wav

Model Card for speech-conversion