Amirhossein75 commited on
Commit
657fd4b
·
verified ·
1 Parent(s): ce919bc

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: audio-to-audio
4
+ tags:
5
+ - speech
6
+ - audio
7
+ - voice-conversion
8
+ - speecht5
9
+ - hifigan
10
+ - speechbrain
11
+ - xvector
12
+ - sagemaker
13
+ datasets:
14
+ - cmu-arctic
15
+ base_model:
16
+ - microsoft/speecht5_vc
17
+ - microsoft/speecht5_hifigan
18
+ - speechbrain/spkrec-ecapa-voxceleb
19
+ license: other
20
+ language: en
21
+ ---
22
+
23
+ # Model Card for `speech-conversion`
24
+
25
+ > Any‑to‑any **voice conversion** (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.
26
+
27
+ This model card documents the repository **amirhossein-yousefi/speech-conversion**, which wraps the Hugging Face implementation of **SpeechT5 (voice conversion)** and the matching **HiFiGAN** vocoder, with a lightweight training loop and optional AWS SageMaker entry points.
28
+
29
+ ## Model Details
30
+
31
+ ### Model Description
32
+ - **Developed by:** Amirhossein Yousefiramandi (repo author)
33
+ - **Shared by :** Amirhossein Yousefiramandi
34
+ - **Model type:** Speech-to-speech **voice conversion** using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with **speaker x‑vectors** (ECAPA) for target voice conditioning
35
+ - **Language(s):** English (training examples use CMU ARCTIC)
36
+ - **License:** Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list **MIT** (see sources).
37
+ - **Finetuned from model :** `microsoft/speecht5_vc` (with `microsoft/speecht5_hifigan` as vocoder); speaker embeddings via `speechbrain/spkrec-ecapa-voxceleb`
38
+
39
+ ### Model Sources
40
+ - **Repository:** https://github.com/amirhossein-yousefi/speech-conversion
41
+ - **Base model card:** https://huggingface.co/microsoft/speecht5_vc
42
+ - **Paper (SpeechT5):** https://arxiv.org/abs/2110.07205
43
+
44
+ ## Uses
45
+
46
+ ### Direct Use
47
+ - **Voice conversion (VC):** Given a **source** speech clip (content) and a short **reference** clip of the **target** speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.
48
+
49
+ ### Downstream Use
50
+ - **Fine‑tuning VC** on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
51
+ - Building **hosted inference** services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.
52
+
53
+ ### Out‑of‑Scope Use
54
+ - Any application that **impersonates** individuals without explicit consent (e.g., fraud, deepfakes).
55
+ - Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).
56
+
57
+ ## Bias, Risks, and Limitations
58
+ - **Identity & consent:** Converting into a target speaker’s timbre can be misused for impersonation. Always obtain **informed consent** from target speakers.
59
+ - **Dataset coverage:** CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
60
+ - **Artifacts & intelligibility:** Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.
61
+
62
+ ### Recommendations
63
+ - Keep reference audio **mono 16 kHz**, clean, and at least a few seconds long.
64
+ - Use **GPU** for real‑time or faster‑than‑real‑time conversion.
65
+ - Consider adding **post‑filters** (denoising, loudness normalization) for production use.
66
+ - Obtain explicit **consent** and disclose synthesized audio where appropriate.
67
+
68
+ ## How to Get Started with the Model
69
+
70
+ ### CLI (local inference)
71
+ ```bash
72
+ # Requirements (Python 3.10+)
73
+ pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
74
+ "numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
75
+ "speechbrain>=1.0.0" soundfile
76
+
77
+ # One‑shot conversion (mono 16 kHz WAVs)
78
+ python scripts/convert_once.py \
79
+ --checkpoint microsoft/speecht5_vc \
80
+ --src path/to/src.wav \
81
+ --ref path/to/ref.wav \
82
+ --out converted.wav