Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,165 @@
|
|
1 |
-
---
|
2 |
-
license:
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
datasets:
|
4 |
+
- amphion/Emilia-Dataset
|
5 |
+
language:
|
6 |
+
- en
|
7 |
+
base_model:
|
8 |
+
- Marvis-AI/marvis-tts-250m-v0.1-base
|
9 |
+
library_name: transformers
|
10 |
+
tags:
|
11 |
+
- mlx
|
12 |
+
- mlx-audio
|
13 |
+
- transformers
|
14 |
+
---
|
15 |
+
|
16 |
+
# Introduction
|
17 |
+
[[code](https://github.com/Marvis-Labs/marvis-tts)]
|
18 |
+
|
19 |
+
Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.
|
20 |
+
|
21 |
+
## Key Features
|
22 |
+
|
23 |
+
- **Real-time Streaming**: Stream audio chunks as text is processed, enabling natural conversational flow
|
24 |
+
- **Compact Size**: Only 500MB when quantized, enabling on-device inference
|
25 |
+
- **Edge deployment**: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
|
26 |
+
- **Natural Audio Flow**: Process entire text context for coherent speech synthesis without chunking artifacts
|
27 |
+
- **Multimodal Architecture**: Seamlessly handles interleaved text and audio tokens
|
28 |
+
|
29 |
+
## Supported Languages
|
30 |
+
|
31 |
+
Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.
|
32 |
+
|
33 |
+
# Quick Start
|
34 |
+
|
35 |
+
## Using MLX
|
36 |
+
|
37 |
+
```bash
|
38 |
+
pip install -U mlx-audio
|
39 |
+
python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream \
|
40 |
+
--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
|
41 |
+
```
|
42 |
+
|
43 |
+
## Using transformers
|
44 |
+
|
45 |
+
**Without Voice Cloning**
|
46 |
+
```python
|
47 |
+
import torch
|
48 |
+
from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
|
49 |
+
from tokenizers.processors import TemplateProcessing
|
50 |
+
import soundfile as sf
|
51 |
+
|
52 |
+
model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
|
53 |
+
device = "cuda"if torch.cuda.is_available() else "cpu"
|
54 |
+
|
55 |
+
# load the model and the processor
|
56 |
+
processor = AutoProcessor.from_pretrained(model_id)
|
57 |
+
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
|
58 |
+
|
59 |
+
# prepare the inputs
|
60 |
+
text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
|
61 |
+
inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device).pop("token_type_ids")
|
62 |
+
# infer the model
|
63 |
+
audio = model.generate(**inputs, output_audio=True)
|
64 |
+
sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")
|
65 |
+
|
66 |
+
```
|
67 |
+
|
68 |
+
|
69 |
+
# Model Description
|
70 |
+
|
71 |
+
Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
|
72 |
+
|
73 |
+
- **Multimodal Backbone (250M parameters)**: Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
|
74 |
+
|
75 |
+
- **Audio Decoder (60M parameters)**: A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.
|
76 |
+
|
77 |
+
|
78 |
+
Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.
|
79 |
+
|
80 |
+
# Training Details
|
81 |
+
|
82 |
+
**Pretraining**:
|
83 |
+
- Dataset: Emilia-YODAS
|
84 |
+
- Training Steps: 2M steps
|
85 |
+
- Hardware: 1x NVIDIA GH200 96GB
|
86 |
+
- Precision: bfloat16
|
87 |
+
- Learning Rate: 3e-4
|
88 |
+
- Batch Size: 64
|
89 |
+
|
90 |
+
**Post-training**:
|
91 |
+
- Dataset: Expressive Speech
|
92 |
+
- Training Steps: 200K steps
|
93 |
+
- Expressiveness Setting: 0.5
|
94 |
+
- Hardware: 1x NVIDIA GH200 96GB
|
95 |
+
- Precision: bfloat16
|
96 |
+
- Learning Rate: 1e-4
|
97 |
+
- Batch Size: 64
|
98 |
+
|
99 |
+
**Total Training Cost**: ~$2,000
|
100 |
+
- Pretraining and fine-tuning: $246.69 (1x GH200)
|
101 |
+
- Post-training data generation: $167.94 (RTX6000 Ada)
|
102 |
+
- Additional experimentation: ~$1,500 across various GPU configurations
|
103 |
+
- Platforms: Prime-Intellect and Jarvis-Labs
|
104 |
+
|
105 |
+
## Use Cases
|
106 |
+
|
107 |
+
- **Real-time Voice Assistants**: Deploy natural-sounding voice interfaces with custom voices
|
108 |
+
- **Content Creation**: Generate voiceovers and narration with personalized voices
|
109 |
+
- **Accessibility Tools**: Create personalized speech synthesis for communication aids
|
110 |
+
- **Interactive Applications**: Build conversational AI with consistent voice identity
|
111 |
+
- **Podcast & Media**: Generate natural-sounding speech for automated content
|
112 |
+
|
113 |
+
### Local & Cloud Deployment
|
114 |
+
|
115 |
+
**Local Deployment:**
|
116 |
+
- Minimum Requirements: 1GB RAM, GPU recommended for real-time inference
|
117 |
+
- Quantized Model: 500MB download
|
118 |
+
- Platforms: iOS, Android, Windows, macOS, Linux
|
119 |
+
|
120 |
+
**Cloud Deployment:**
|
121 |
+
- API-ready architecture
|
122 |
+
- Scalable inference pipeline
|
123 |
+
- Low-latency streaming support
|
124 |
+
|
125 |
+
### Technical Limitations
|
126 |
+
|
127 |
+
- Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
|
128 |
+
- Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
|
129 |
+
- Background Noise: Performance degrades with noisy reference audio or inference environments
|
130 |
+
- Hallucinations: The model might hallucinate words specially for new words or short sentences.
|
131 |
+
|
132 |
+
### Legal and Ethical Considerations:
|
133 |
+
|
134 |
+
- Users are responsible for complying with local laws regarding voice synthesis and impersonation
|
135 |
+
- Consider intellectual property rights when cloning voices of public figures
|
136 |
+
- Respect privacy laws and regulations in your jurisdiction
|
137 |
+
- Obtain appropriate consent and permissions before deployment
|
138 |
+
|
139 |
+
## License & Agreement
|
140 |
+
|
141 |
+
* Apache 2.0
|
142 |
+
|
143 |
+
## Citation
|
144 |
+
|
145 |
+
If you use Marvis in your research or applications, please cite:
|
146 |
+
|
147 |
+
```bibtex
|
148 |
+
@misc{marvis-tts-2025,
|
149 |
+
title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
|
150 |
+
author={Prince Canuma and Lucas Newman},
|
151 |
+
year={2025}
|
152 |
+
}
|
153 |
+
```
|
154 |
+
|
155 |
+
## Acknowledgments
|
156 |
+
|
157 |
+
Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.
|
158 |
+
|
159 |
+
---
|
160 |
+
|
161 |
+
**Version**: 0.1
|
162 |
+
|
163 |
+
**Release Date**: 26/08/2025
|
164 |
+
|
165 |
+
**Creators**: Prince Canuma & Lucas Newman
|