SingingSDS / README.md
jhansss's picture
Relax Python version specification in README.md
ff8bce5

A newer version of the Gradio SDK is available: 6.0.0

Upgrade
metadata
title: SingingSDS
emoji: 🎢
colorFrom: pink
colorTo: yellow
sdk: gradio
sdk_version: 5.4.0
app_file: app.py
pinned: false

SingingSDS: Role-Playing Singing Spoken Dialogue System

A role-playing singing dialogue system that converts speech input into character-based singing output.

Installation

Requirements

  • Python 3.11+
  • CUDA (optional, for GPU acceleration)

Install Dependencies

Option 1: Using Conda (Recommended)

conda create -n singingsds python=3.11

conda activate singingsds
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Option 2: Using pip only

pip install -r requirements.txt

Option 3: Using pip with virtual environment

python -m venv singingsds_env

# On Windows:
singingsds_env\Scripts\activate
# On macOS/Linux:
source singingsds_env/bin/activate

pip install -r requirements.txt

Usage

Command Line Interface (CLI)

Example Usage

python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default.yaml \
  --output_audio outputs/yaoyin_hello.wav \
  --eval_results_csv outputs/yaoyin_test.csv

Inference-Only Mode

Run minimal inference without evaluation.

python cli.py \
  --query_audio tests/audio/hello.wav \
  --config_path config/cli/yaoyin_default_infer_only.yaml \
  --output_audio outputs/yaoyin_hello.wav

Parameter Description

  • --query_audio: Input audio file path (required)
  • --config_path: Configuration file path (default: config/cli/yaoyin_default.yaml)
  • --output_audio: Output audio file path (required)

Web Interface (Gradio)

Start the web interface:

python app.py

Then visit the displayed address in your browser to use the graphical interface.

Configuration

Character Configuration

The system supports multiple preset characters:

  • Yaoyin (ι₯音): Default timbre is timbre2
  • Limei (δΈ½ζ’…): Default timbre is timbre1

Model Configuration

ASR Models

  • openai/whisper-large-v3-turbo
  • openai/whisper-large-v3
  • openai/whisper-medium
  • openai/whisper-small
  • funasr/paraformer-zh

LLM Models

  • gemini-2.5-flash
  • google/gemma-2-2b
  • meta-llama/Llama-3.2-3B-Instruct
  • meta-llama/Llama-3.1-8B-Instruct
  • Qwen/Qwen3-8B
  • Qwen/Qwen3-30B-A3B
  • MiniMaxAI/MiniMax-Text-01

SVS Models

  • espnet/mixdata_svs_visinger2_spkemb_lang_pretrained_avg (Bilingual)
  • espnet/aceopencpop_svs_visinger2_40singer_pretrain (Chinese)

Project Structure

SingingSDS/
β”œβ”€β”€ app.py, cli.py               # Entry points (demo app & CLI)
β”œβ”€β”€ pipeline.py                  # Main orchestration pipeline
β”œβ”€β”€ interface.py                 # Gradio interface
β”œβ”€β”€ characters/                  # Virtual character definitions
β”œβ”€β”€ modules/                     # Core modules
β”‚   β”œβ”€β”€ asr/                     # ASR models (Whisper, Paraformer)
β”‚   β”œβ”€β”€ llm/                     # LLMs (Gemini, LLaMA, etc.)
β”‚   β”œβ”€β”€ svs/                     # Singing voice synthesis (ESPnet)
β”‚   └── utils/                   # G2P, text normalization, resources
β”œβ”€β”€ config/                      # YAML configuration files 
β”œβ”€β”€ data/                        # Dataset metadata and length info
β”œβ”€β”€ data_handlers/               # Parsers for KiSing, Touhou, etc.
β”œβ”€β”€ evaluation/                  # Evaluation metrics
β”œβ”€β”€ resources/                   # Singer embeddings, phoneme dicts, MIDI
β”œβ”€β”€ assets/                      # Character visuals
β”œβ”€β”€ tests/                       # Unit tests and sample audios
└── README.md, requirements.txt

Contributing

Issues and Pull Requests are welcome!

License