Speech-Conversion / README.md

add hardware spec

3fc6ba6 8 days ago

4.32 kB

	---
	library_name: transformers
	pipeline_tag: audio-to-audio
	tags:
	- speech
	- audio
	- voice-conversion
	- speecht5
	- hifigan
	- speechbrain
	- xvector
	- sagemaker
	datasets:
	- cmu-arctic
	base_model:
	- microsoft/speecht5_vc
	- microsoft/speecht5_hifigan
	- speechbrain/spkrec-ecapa-voxceleb
	license: other
	language: en
	---

	# Model Card for `speech-conversion`

	> Any‑to‑any voice conversion (speech‑to‑speech) powered by Microsoft’s SpeechT5 voice‑conversion model. Convert a source utterance into the timbre of a target speaker using a short reference clip.

	This model card documents the repository amirhossein-yousefi/speech-conversion, which wraps the Hugging Face implementation of SpeechT5 (voice conversion) and the matching HiFiGAN vocoder, with a lightweight training loop and optional AWS SageMaker entry points.

	## Model Details

	### Model Description
	- Developed by: Amirhossein Yousefiramandi (repo author)
	- Shared by : Amirhossein Yousefiramandi
	- Model type: Speech-to-speech voice conversion using SpeechT5 (encoder–decoder) + HiFiGAN vocoder with speaker x‑vectors (ECAPA) for target voice conditioning
	- Language(s): English (training examples use CMU ARCTIC)
	- License: Repository does not include a license file (treat as “other”). Underlying base models on Hugging Face list MIT (see sources).
	- Finetuned from model : `microsoft/speecht5_vc` (with `microsoft/speecht5_hifigan` as vocoder); speaker embeddings via `speechbrain/spkrec-ecapa-voxceleb`

	### Model Sources
	- Repository: https://github.com/amirhossein-yousefi/speech-conversion
	- Base model card: https://huggingface.co/microsoft/speecht5_vc
	- Paper (SpeechT5): https://arxiv.org/abs/2110.07205

	## Training Hardware & Environment

	- Device: Laptop (Windows, WDDM driver model)
	- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
	- Driver: 576.52
	- CUDA (driver): 12.9
	- PyTorch: 2.8.0+cu129
	- CUDA available: ✅

	## Training Logs & Metrics

	- Total FLOPs (training): `1,571,716,275,216,842,800`
	- Training runtime: `1,688.2899` seconds
	- Hardware Type: Single NVIDIA GeForce RTX 3080 Ti Laptop GPU
	## Uses

	### Direct Use
	- Voice conversion (VC): Given a source speech clip (content) and a short reference clip of the target speaker, synthesize the source content in the target’s voice. Intended for prototyping, research demos, and educational exploration.

	### Downstream Use
	- Fine‑tuning VC on paired speakers (e.g., CMU ARCTIC pairs) for improved similarity.
	- Building hosted inference services (e.g., AWS SageMaker real‑time endpoint) using the provided handlers.

	### Out‑of‑Scope Use
	- Any application that impersonates individuals without explicit consent (e.g., fraud, deepfakes).
	- Safety‑critical domains where audio identity must not be spoofed (e.g., voice‑based authentication).

	## Bias, Risks, and Limitations
	- Identity & consent: Converting into a target speaker’s timbre can be misused for impersonation. Always obtain informed consent from target speakers.
	- Dataset coverage: CMU ARCTIC is studio‑quality North American English; performance may degrade on other accents, languages, or noisy conditions.
	- Artifacts & intelligibility: Conversion quality depends on model checkpoint quality and speaker embedding robustness; artifacts may appear for out‑of‑domain inputs or poor reference audio.

	### Recommendations
	- Keep reference audio mono 16 kHz, clean, and at least a few seconds long.
	- Use GPU for real‑time or faster‑than‑real‑time conversion.
	- Consider adding post‑filters (denoising, loudness normalization) for production use.
	- Obtain explicit consent and disclose synthesized audio where appropriate.

	## How to Get Started with the Model

	### CLI (local inference)
	```bash
	# Requirements (Python 3.10+)
	pip install "transformers>=4.42" "datasets>=2.20" "torch>=2.1" \
	"numpy>=1.24" "sentencepiece>=0.1.99" "protobuf>=4.23" \
	"speechbrain>=1.0.0" soundfile

	# One‑shot conversion (mono 16 kHz WAVs)
	python scripts/convert_once.py \
	--checkpoint microsoft/speecht5_vc \
	--src path/to/src.wav \
	--ref path/to/ref.wav \
	--out converted.wav