--- library_name: transformers pipeline_tag: audio-to-audio tags: - speech - audio - voice-conversion - speecht5 - hifigan - speechbrain - xvector - sagemaker datasets: - cmu-arctic base_model: - microsoft/speecht5_vc - microsoft/speecht5_hifigan - speechbrain/spkrec-ecapa-voxceleb license: other language: en --- # Model Card for `speech-conversion` > Any‑to‑any **voice conversion** (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip. This model card documents the repository **amirhossein-yousefi/speech-conversion**, which wraps the Hugging Face implementation of **SpeechT5 (voice conversion)** and the matching **HiFiGAN** vocoder, with a lightweight training loop and optional AWS SageMaker entry points. ## Model Details ### Model Description - **Developed by:** Amirhossein Yousefiramandi (repo author) - **Shared by :** Amirhossein Yousefiramandi - **Model type:** Speech-to-speech **voice conversion** using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with **speaker x‑vectors** (ECAPA) for target voice conditioning - **Language(s):** English (training examples use CMU ARCTIC) - **License:** Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list **MIT** (see sources). - **Finetuned from model :** `microsoft/speecht5_vc` (with `microsoft/speecht5_hifigan` as vocoder); speaker embeddings via `speechbrain/spkrec-ecapa-voxceleb` ### Model Sources - **Repository:** https://github.com/amirhossein-yousefi/speech-conversion - **Base model card:** https://huggingface.co/microsoft/speecht5_vc - **Paper (SpeechT5):** https://arxiv.org/abs/2110.07205 ## Training Hardware & Environment - **Device:** Laptop (Windows, WDDM driver model) - **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM) - **Driver:** 576.52 - **CUDA (driver):** 12.9 - **PyTorch:** 2.8.0+cu129 - **CUDA available:** ✅ ## Training Logs & Metrics - **Total FLOPs (training):** `1,571,716,275,216,842,800` - **Training runtime:** `1,688.2899` seconds - **Hardware Type:** Single NVIDIA GeForce RTX 3080 Ti Laptop GPU ## Uses ### Direct Use - **Voice conversion (VC):** Given a **source** speech clip (content) and a short **reference** clip of the **target** speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration. ### Downstream Use - **Fine‑tuning VC** on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity. - Building **hosted inference** services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers. ### Out‑of‑Scope Use - Any application that **impersonates** individuals without explicit consent (e.g., fraud, deepfakes). - Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication). ## Bias, Risks, and Limitations - **Identity & consent:** Converting into a target speaker’s timbre can be misused for impersonation. Always obtain **informed consent** from target speakers. - **Dataset coverage:** CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions. - **Artifacts & intelligibility:** Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio. ### Recommendations - Keep reference audio **mono 16 kHz**, clean, and at least a few seconds long. - Use **GPU** for real‑time or faster‑than‑real‑time conversion. - Consider adding **post‑filters** (denoising, loudness normalization) for production use. - Obtain explicit **consent** and disclose synthesized audio where appropriate. ## How to Get Started with the Model ### CLI (local inference) ```bash # Requirements (Python 3.10+) pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \ "numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \ "speechbrain>=1.0.0" soundfile # One‑shot conversion (mono 16 kHz WAVs) python scripts/convert_once.py \ --checkpoint microsoft/speecht5_vc \ --src path/to/src.wav \ --ref path/to/ref.wav \ --out converted.wav