File size: 4,323 Bytes
657fd4b 3fc6ba6 657fd4b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 |
---
library_name: transformers
pipeline_tag: audio-to-audio
tags:
- speech
- audio
- voice-conversion
- speecht5
- hifigan
- speechbrain
- xvector
- sagemaker
datasets:
- cmu-arctic
base_model:
- microsoft/speecht5_vc
- microsoft/speecht5_hifigan
- speechbrain/spkrec-ecapa-voxceleb
license: other
language: en
---
# Model Card for `speech-conversion`
> Any‑to‑any **voice conversion** (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.
This model card documents the repository **amirhossein-yousefi/speech-conversion**, which wraps the Hugging Face implementation of **SpeechT5 (voice conversion)** and the matching **HiFiGAN** vocoder, with a lightweight training loop and optional AWS SageMaker entry points.
## Model Details
### Model Description
- **Developed by:** Amirhossein Yousefiramandi (repo author)
- **Shared by :** Amirhossein Yousefiramandi
- **Model type:** Speech-to-speech **voice conversion** using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with **speaker x‑vectors** (ECAPA) for target voice conditioning
- **Language(s):** English (training examples use CMU ARCTIC)
- **License:** Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list **MIT** (see sources).
- **Finetuned from model :** `microsoft/speecht5_vc` (with `microsoft/speecht5_hifigan` as vocoder); speaker embeddings via `speechbrain/spkrec-ecapa-voxceleb`
### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/speech-conversion
- **Base model card:** https://huggingface.co/microsoft/speecht5_vc
- **Paper (SpeechT5):** https://arxiv.org/abs/2110.07205
## Training Hardware & Environment
- **Device:** Laptop (Windows, WDDM driver model)
- **GPU:** NVIDIA GeForce **RTX 3080 Ti Laptop GPU** (16 GB VRAM)
- **Driver:** 576.52
- **CUDA (driver):** 12.9
- **PyTorch:** 2.8.0+cu129
- **CUDA available:** ✅
## Training Logs & Metrics
- **Total FLOPs (training):** `1,571,716,275,216,842,800`
- **Training runtime:** `1,688.2899` seconds
- **Hardware Type:** Single NVIDIA GeForce RTX 3080 Ti Laptop GPU
## Uses
### Direct Use
- **Voice conversion (VC):** Given a **source** speech clip (content) and a short **reference** clip of the **target** speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.
### Downstream Use
- **Fine‑tuning VC** on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
- Building **hosted inference** services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.
### Out‑of‑Scope Use
- Any application that **impersonates** individuals without explicit consent (e.g., fraud, deepfakes).
- Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).
## Bias, Risks, and Limitations
- **Identity & consent:** Converting into a target speaker’s timbre can be misused for impersonation. Always obtain **informed consent** from target speakers.
- **Dataset coverage:** CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
- **Artifacts & intelligibility:** Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.
### Recommendations
- Keep reference audio **mono 16 kHz**, clean, and at least a few seconds long.
- Use **GPU** for real‑time or faster‑than‑real‑time conversion.
- Consider adding **post‑filters** (denoising, loudness normalization) for production use.
- Obtain explicit **consent** and disclose synthesized audio where appropriate.
## How to Get Started with the Model
### CLI (local inference)
```bash
# Requirements (Python 3.10+)
pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
"numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
"speechbrain>=1.0.0" soundfile
# One‑shot conversion (mono 16 kHz WAVs)
python scripts/convert_once.py \
--checkpoint microsoft/speecht5_vc \
--src path/to/src.wav \
--ref path/to/ref.wav \
--out converted.wav
|