free-svc / README.md
alefiury's picture
Update README.md
ab1387c verified
---
license: cc-by-nc-sa-4.0
language:
- en
- pt
- es
- zh
- nl
- fr
- de
- it
- ja
- pl
pipeline_tag: audio-to-audio
tags:
- audio
- voice
- voice conversion
- singing voice conversion
- vc
- svc
- multilingual
---
# FreeSVC: Zero-shot Multilingual Singing Voice Conversion
**FreeSVC** is a promising multilingual zero-shot singing voice conversion model. It enables the conversion of singing voices across languages without the need for extensive language-specific training. [GitHub repository](https://github.com/freds0/free-svc). [Paper arXiv pre-print](https://arxiv.org/abs/2501.05586).
## Supported Languages
| Language | ID | Status | Speech Data | Singing Data |
|------------|-----|--------------|-------------|--------------|
| Chinese | 0 | βœ… Full | 255h | 70h |
| Dutch | 1 | βœ… Full | Part of CML | - |
| English | 2 | βœ… Full | 921h | 47h |
| French | 3 | βœ… Full | Part of CML | - |
| German | 4 | βœ… Full | Part of CML | - |
| Italian | 5 | βœ… Full | Part of CML | - |
| Japanese | 6 | βœ… Full | 30h | - |
| Other* | 7 | ⚠️ Partial | - | 10h |
| Polish | 8 | βœ… Full | Part of CML | - |
| Portuguese | 9 | βœ… Full | Part of CML | - |
| Spanish | 10 | βœ… Full | Part of CML | - |
*Note: The "Other" category is used for vocal techniques without content.
## Model Overview
FreeSVC leverages an enhanced VITS architecture integrated with Speaker-invariant Clustering (SPIN) and the ECAPA2 speaker encoder. This combination effectively separates speaker characteristics from linguistic content, ensuring high-quality and natural-sounding voice conversions across multiple languages.
## Training Datasets
FreeSVC was trained on a diverse set of speech and singing datasets covering multiple languages:
| **Dataset** | **Hours** | **Language** | **Type** |
|----------------------|------------|--------------|--------------|
| AISHELL-1 | 170h | Chinese | Speech |
| AISHELL-3 | 85h | Chinese | Speech |
| CML-TTS | 3.1k | 7 Languages | Speech |
| HiFiTTS | 292h | English | Speech |
| JVS | 30h | Japanese | Speech |
| LibriTTS-R | 585h | English | Speech |
| NUS (NHSS) | 7h | English | Speech, Singing |
| OpenSinger | 50h | Chinese | Singing |
| Opencpop | 5h | Chinese | Singing |
| PopBuTFy | 10h, 40h | Chinese, English | Singing |
| POPCS | 5h | Chinese | Singing |
| VCTK | 44h | English | Speech |
| VocalSet | 10h | Other | Singing |
## License
FreeSVC is released under the **Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)** license. This means:
- The model **can only be used for research and non-commercial purposes**. Any commercial use is strictly prohibited.
- Any derivative works must be **shared under the same license**.
- Proper attribution must be given when using the model.
Users must also **comply with the licenses of the original datasets** used for training. Some datasets may have additional restrictions beyond CC BY-NC-SA 4.0. Ensure you review and adhere to their terms before using the model.
For full details, refer to the [CC BY-NC-SA 4.0 License](https://creativecommons.org/licenses/by-nc-sa/4.0/).
## Citation
```
@INPROCEEDINGS{10890068,
author={Ferreira, Alef Iury and Gris, Lucas Rafael and Da Rosa, Augusto and Oliveira, Frederico and Casanova, Edresson and Sousa, Rafael and Junior, Arnaldo and Soares, Anderson and Filho, Arlindo GalvΓ£o},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={FreeSVC: Towards Zero-shot Multilingual Singing Voice Conversion},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Training;Source coding;Zero shot learning;Refining;Signal processing;Data models;Acoustics;Multilingual;Data mining;Speech synthesis;Singing Voice Conversion;Synthesis of Singing Voices;Cross-lingual and multilingual aspects in speech synthesis},
doi={10.1109/ICASSP49660.2025.10890068}}
```