Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
library_name: transformers
|
3 |
+
pipeline_tag: audio-to-audio
|
4 |
+
tags:
|
5 |
+
- speech
|
6 |
+
- audio
|
7 |
+
- voice-conversion
|
8 |
+
- speecht5
|
9 |
+
- hifigan
|
10 |
+
- speechbrain
|
11 |
+
- xvector
|
12 |
+
- sagemaker
|
13 |
+
datasets:
|
14 |
+
- cmu-arctic
|
15 |
+
base_model:
|
16 |
+
- microsoft/speecht5_vc
|
17 |
+
- microsoft/speecht5_hifigan
|
18 |
+
- speechbrain/spkrec-ecapa-voxceleb
|
19 |
+
license: other
|
20 |
+
language: en
|
21 |
+
---
|
22 |
+
|
23 |
+
# Model Card for `speech-conversion`
|
24 |
+
|
25 |
+
> Any‑to‑any **voice conversion** (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.
|
26 |
+
|
27 |
+
This model card documents the repository **amirhossein-yousefi/speech-conversion**, which wraps the Hugging Face implementation of **SpeechT5 (voice conversion)** and the matching **HiFiGAN** vocoder, with a lightweight training loop and optional AWS SageMaker entry points.
|
28 |
+
|
29 |
+
## Model Details
|
30 |
+
|
31 |
+
### Model Description
|
32 |
+
- **Developed by:** Amirhossein Yousefiramandi (repo author)
|
33 |
+
- **Shared by :** Amirhossein Yousefiramandi
|
34 |
+
- **Model type:** Speech-to-speech **voice conversion** using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with **speaker x‑vectors** (ECAPA) for target voice conditioning
|
35 |
+
- **Language(s):** English (training examples use CMU ARCTIC)
|
36 |
+
- **License:** Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list **MIT** (see sources).
|
37 |
+
- **Finetuned from model :** `microsoft/speecht5_vc` (with `microsoft/speecht5_hifigan` as vocoder); speaker embeddings via `speechbrain/spkrec-ecapa-voxceleb`
|
38 |
+
|
39 |
+
### Model Sources
|
40 |
+
- **Repository:** https://github.com/amirhossein-yousefi/speech-conversion
|
41 |
+
- **Base model card:** https://huggingface.co/microsoft/speecht5_vc
|
42 |
+
- **Paper (SpeechT5):** https://arxiv.org/abs/2110.07205
|
43 |
+
|
44 |
+
## Uses
|
45 |
+
|
46 |
+
### Direct Use
|
47 |
+
- **Voice conversion (VC):** Given a **source** speech clip (content) and a short **reference** clip of the **target** speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.
|
48 |
+
|
49 |
+
### Downstream Use
|
50 |
+
- **Fine‑tuning VC** on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
|
51 |
+
- Building **hosted inference** services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.
|
52 |
+
|
53 |
+
### Out‑of‑Scope Use
|
54 |
+
- Any application that **impersonates** individuals without explicit consent (e.g., fraud, deepfakes).
|
55 |
+
- Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).
|
56 |
+
|
57 |
+
## Bias, Risks, and Limitations
|
58 |
+
- **Identity & consent:** Converting into a target speaker’s timbre can be misused for impersonation. Always obtain **informed consent** from target speakers.
|
59 |
+
- **Dataset coverage:** CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
|
60 |
+
- **Artifacts & intelligibility:** Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.
|
61 |
+
|
62 |
+
### Recommendations
|
63 |
+
- Keep reference audio **mono 16 kHz**, clean, and at least a few seconds long.
|
64 |
+
- Use **GPU** for real‑time or faster‑than‑real‑time conversion.
|
65 |
+
- Consider adding **post‑filters** (denoising, loudness normalization) for production use.
|
66 |
+
- Obtain explicit **consent** and disclose synthesized audio where appropriate.
|
67 |
+
|
68 |
+
## How to Get Started with the Model
|
69 |
+
|
70 |
+
### CLI (local inference)
|
71 |
+
```bash
|
72 |
+
# Requirements (Python 3.10+)
|
73 |
+
pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
|
74 |
+
"numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
|
75 |
+
"speechbrain>=1.0.0" soundfile
|
76 |
+
|
77 |
+
# One‑shot conversion (mono 16 kHz WAVs)
|
78 |
+
python scripts/convert_once.py \
|
79 |
+
--checkpoint microsoft/speecht5_vc \
|
80 |
+
--src path/to/src.wav \
|
81 |
+
--ref path/to/ref.wav \
|
82 |
+
--out converted.wav
|