Speech-Conversion / README.md
Amirhossein75's picture
add hardware spec
3fc6ba6
metadata
library_name: transformers
pipeline_tag: audio-to-audio
tags:
  - speech
  - audio
  - voice-conversion
  - speecht5
  - hifigan
  - speechbrain
  - xvector
  - sagemaker
datasets:
  - cmu-arctic
base_model:
  - microsoft/speecht5_vc
  - microsoft/speecht5_hifigan
  - speechbrain/spkrec-ecapa-voxceleb
license: other
language: en

Model Card for speech-conversion

Any‑to‑any voice conversion (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.

This model card documents the repository amirhossein-yousefi/speech-conversion, which wraps the Hugging Face implementation of SpeechT5 (voice conversion) and the matching HiFiGAN vocoder, with a lightweight training loop and optional AWS SageMaker entry points.

Model Details

Model Description

  • Developed by: Amirhossein Yousefiramandi (repo author)
  • Shared by : Amirhossein Yousefiramandi
  • Model type: Speech-to-speech voice conversion using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with speaker x‑vectors (ECAPA) for target voice conditioning
  • Language(s): English (training examples use CMU ARCTIC)
  • License: Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list MIT (see sources).
  • Finetuned from model : microsoft/speecht5_vc (with microsoft/speecht5_hifigan as vocoder); speaker embeddings via speechbrain/spkrec-ecapa-voxceleb

Model Sources

Training Hardware & Environment

  • Device: Laptop (Windows, WDDM driver model)
  • GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
  • Driver: 576.52
  • CUDA (driver): 12.9
  • PyTorch: 2.8.0+cu129
  • CUDA available:

Training Logs & Metrics

  • Total FLOPs (training): 1,571,716,275,216,842,800
  • Training runtime: 1,688.2899 seconds
  • Hardware Type: Single NVIDIA GeForce RTX 3080 Ti Laptop GPU

Uses

Direct Use

  • Voice conversion (VC): Given a source speech clip (content) and a short reference clip of the target speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.

Downstream Use

  • Fine‑tuning VC on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
  • Building hosted inference services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.

Out‑of‑Scope Use

  • Any application that impersonates individuals without explicit consent (e.g., fraud, deepfakes).
  • Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).

Bias, Risks, and Limitations

  • Identity & consent: Converting into a target speaker’s timbre can be misused for impersonation. Always obtain informed consent from target speakers.
  • Dataset coverage: CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
  • Artifacts & intelligibility: Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.

Recommendations

  • Keep reference audio mono 16 kHz, clean, and at least a few seconds long.
  • Use GPU for real‑time or faster‑than‑real‑time conversion.
  • Consider adding post‑filters (denoising, loudness normalization) for production use.
  • Obtain explicit consent and disclose synthesized audio where appropriate.

How to Get Started with the Model

CLI (local inference)

# Requirements (Python 3.10+)
pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
            "numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
            "speechbrain>=1.0.0" soundfile

# One‑shot conversion (mono 16 kHz WAVs)
python scripts/convert_once.py \
  --checkpoint microsoft/speecht5_vc \
  --src path/to/src.wav \
  --ref path/to/ref.wav \
  --out converted.wav