Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.0.0
metadata
title: SingingSDS
emoji: πΆ
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false
SingingSDS: Role-Playing Singing Spoken Dialogue System
A role-playing singing dialogue system that converts speech input into character-based singing output.
Installation
Requirements
- Python 3.11+
- CUDA (optional, for GPU acceleration)
Install Dependencies
Option 1: Using Conda (Recommended)
conda create -n singingsds python=3.11
conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
Option 2: Using pip only
pip install -r requirements.txt
Option 3: Using pip with virtual environment
python -m venv singingsds_env
# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate
pip install -r requirements.txt
Usage
Command Line Interface (CLI)
Example Usage
python cli.py \
--query_audio tests/audio/hello.wav \
--config_path config/cli/yaoyin_default.yaml \
--output_audio outputs/yaoyin_hello.wav \
--eval_results_csv outputs/yaoyin_test.csv
Inference-Only Mode
Run minimal inference without evaluation.
python cli.py \
--query_audio tests/audio/hello.wav \
--config_path config/cli/yaoyin_default_infer_only.yaml \
--output_audio outputs/yaoyin_hello.wav
Parameter Description
--query_audio: Input audio file path (required)--config_path: Configuration file path (default: config/cli/yaoyin_default.yaml)--output_audio: Output audio file path (required)
Web Interface (Gradio)
Start the web interface:
python app.py
Then visit the displayed address in your browser to use the graphical interface.
Configuration
Character Configuration
The system supports multiple preset characters:
- Yaoyin (ι₯ι³): Default timbre is
timbre2 - Limei (δΈ½ζ’
): Default timbre is
timbre1
Model Configuration
ASR Models
openai/whisper-large-v3-turboopenai/whisper-large-v3openai/whisper-mediumopenai/whisper-smallfunasr/paraformer-zh
LLM Models
gemini-2.5-flashgoogle/gemma-2-2bmeta-llama/Llama-3.2-3B-Instructmeta-llama/Llama-3.1-8B-InstructQwen/Qwen3-8BQwen/Qwen3-30B-A3BMiniMaxAI/MiniMax-Text-01
SVS Models
espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg(Bilingual)espnet/aceopencpop_svs_visinger2_40singer_pretrain(Chinese)
Project Structure
SingingSDS/
βββ app.py, cli.py # Entry points (demo app & CLI)
βββ pipeline.py # Main orchestration pipeline
βββ interface.py # Gradio interface
βββ characters/ # Virtual character definitions
βββ modules/ # Core modules
β βββ asr/ # ASR models (Whisper, Paraformer)
β βββ llm/ # LLMs (Gemini, LLaMA, etc.)
β βββ svs/ # Singing voice synthesis (ESPnet)
β βββ utils/ # G2P, text normalization, resources
βββ config/ # YAML configuration files
βββ data/ # Dataset metadata and length info
βββ data_handlers/ # Parsers for KiSing, Touhou, etc.
βββ evaluation/ # Evaluation metrics
βββ resources/ # Singer embeddings, phoneme dicts, MIDI
βββ assets/ # Character visuals
βββ tests/ # Unit tests and sample audios
βββ README.md, requirements.txt
Contributing
Issues and Pull Requests are welcome!