prince-canuma commited on
Commit
35689a3
·
verified ·
1 Parent(s): 4b596e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -3
README.md CHANGED
@@ -1,3 +1,165 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - amphion/Emilia-Dataset
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Marvis-AI/marvis-tts-250m-v0.1-base
9
+ library_name: transformers
10
+ tags:
11
+ - mlx
12
+ - mlx-audio
13
+ - transformers
14
+ ---
15
+
16
+ # Introduction
17
+ [[code](https://github.com/Marvis-Labs/marvis-tts)]
18
+
19
+ Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.
20
+
21
+ ## Key Features
22
+
23
+ - **Real-time Streaming**: Stream audio chunks as text is processed, enabling natural conversational flow
24
+ - **Compact Size**: Only 500MB when quantized, enabling on-device inference
25
+ - **Edge deployment**: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
26
+ - **Natural Audio Flow**: Process entire text context for coherent speech synthesis without chunking artifacts
27
+ - **Multimodal Architecture**: Seamlessly handles interleaved text and audio tokens
28
+
29
+ ## Supported Languages
30
+
31
+ Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.
32
+
33
+ # Quick Start
34
+
35
+ ## Using MLX
36
+
37
+ ```bash
38
+ pip install -U mlx-audio
39
+ python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream \
40
+ --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
41
+ ```
42
+
43
+ ## Using transformers
44
+
45
+ **Without Voice Cloning**
46
+ ```python
47
+ import torch
48
+ from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
49
+ from tokenizers.processors import TemplateProcessing
50
+ import soundfile as sf
51
+
52
+ model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
53
+ device = "cuda"if torch.cuda.is_available() else "cpu"
54
+
55
+ # load the model and the processor
56
+ processor = AutoProcessor.from_pretrained(model_id)
57
+ model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
58
+
59
+ # prepare the inputs
60
+ text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
61
+ inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device).pop("token_type_ids")
62
+ # infer the model
63
+ audio = model.generate(**inputs, output_audio=True)
64
+ sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")
65
+
66
+ ```
67
+
68
+
69
+ # Model Description
70
+
71
+ Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
72
+
73
+ - **Multimodal Backbone (250M parameters)**: Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
74
+
75
+ - **Audio Decoder (60M parameters)**: A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.
76
+
77
+
78
+ Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.
79
+
80
+ # Training Details
81
+
82
+ **Pretraining**:
83
+ - Dataset: Emilia-YODAS
84
+ - Training Steps: 2M steps
85
+ - Hardware: 1x NVIDIA GH200 96GB
86
+ - Precision: bfloat16
87
+ - Learning Rate: 3e-4
88
+ - Batch Size: 64
89
+
90
+ **Post-training**:
91
+ - Dataset: Expressive Speech
92
+ - Training Steps: 200K steps
93
+ - Expressiveness Setting: 0.5
94
+ - Hardware: 1x NVIDIA GH200 96GB
95
+ - Precision: bfloat16
96
+ - Learning Rate: 1e-4
97
+ - Batch Size: 64
98
+
99
+ **Total Training Cost**: ~$2,000
100
+ - Pretraining and fine-tuning: $246.69 (1x GH200)
101
+ - Post-training data generation: $167.94 (RTX6000 Ada)
102
+ - Additional experimentation: ~$1,500 across various GPU configurations
103
+ - Platforms: Prime-Intellect and Jarvis-Labs
104
+
105
+ ## Use Cases
106
+
107
+ - **Real-time Voice Assistants**: Deploy natural-sounding voice interfaces with custom voices
108
+ - **Content Creation**: Generate voiceovers and narration with personalized voices
109
+ - **Accessibility Tools**: Create personalized speech synthesis for communication aids
110
+ - **Interactive Applications**: Build conversational AI with consistent voice identity
111
+ - **Podcast & Media**: Generate natural-sounding speech for automated content
112
+
113
+ ### Local & Cloud Deployment
114
+
115
+ **Local Deployment:**
116
+ - Minimum Requirements: 1GB RAM, GPU recommended for real-time inference
117
+ - Quantized Model: 500MB download
118
+ - Platforms: iOS, Android, Windows, macOS, Linux
119
+
120
+ **Cloud Deployment:**
121
+ - API-ready architecture
122
+ - Scalable inference pipeline
123
+ - Low-latency streaming support
124
+
125
+ ### Technical Limitations
126
+
127
+ - Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
128
+ - Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
129
+ - Background Noise: Performance degrades with noisy reference audio or inference environments
130
+ - Hallucinations: The model might hallucinate words specially for new words or short sentences.
131
+
132
+ ### Legal and Ethical Considerations:
133
+
134
+ - Users are responsible for complying with local laws regarding voice synthesis and impersonation
135
+ - Consider intellectual property rights when cloning voices of public figures
136
+ - Respect privacy laws and regulations in your jurisdiction
137
+ - Obtain appropriate consent and permissions before deployment
138
+
139
+ ## License & Agreement
140
+
141
+ * Apache 2.0
142
+
143
+ ## Citation
144
+
145
+ If you use Marvis in your research or applications, please cite:
146
+
147
+ ```bibtex
148
+ @misc{marvis-tts-2025,
149
+ title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
150
+ author={Prince Canuma and Lucas Newman},
151
+ year={2025}
152
+ }
153
+ ```
154
+
155
+ ## Acknowledgments
156
+
157
+ Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.
158
+
159
+ ---
160
+
161
+ **Version**: 0.1
162
+
163
+ **Release Date**: 26/08/2025
164
+
165
+ **Creators**: Prince Canuma & Lucas Newman