File size: 6,274 Bytes
35689a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
---
license: apache-2.0
datasets:
- amphion/Emilia-Dataset
language:
- en
base_model:
- Marvis-AI/marvis-tts-250m-v0.1-base
library_name: transformers
tags:
- mlx
- mlx-audio
- transformers
---

# Introduction
[[code](https://github.com/Marvis-Labs/marvis-tts)]

Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.

## Key Features

- **Real-time Streaming**: Stream audio chunks as text is processed, enabling natural conversational flow
- **Compact Size**: Only 500MB when quantized, enabling on-device inference
- **Edge deployment**: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
- **Natural Audio Flow**: Process entire text context for coherent speech synthesis without chunking artifacts
- **Multimodal Architecture**: Seamlessly handles interleaved text and audio tokens

## Supported Languages

Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.

# Quick Start

## Using MLX

```bash
pip install -U mlx-audio
python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1  --stream \
 --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
```

## Using transformers

**Without Voice Cloning**
```python
import torch
from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
from tokenizers.processors import TemplateProcessing
import soundfile as sf

model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
device = "cuda"if torch.cuda.is_available() else "cpu"

# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)

# prepare the inputs
text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device).pop("token_type_ids")
# infer the model
audio = model.generate(**inputs, output_audio=True)
sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")

```


# Model Description

Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:

- **Multimodal Backbone (250M parameters)**: Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.

- **Audio Decoder (60M parameters)**: A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.


Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.

# Training Details

**Pretraining**: 
- Dataset: Emilia-YODAS 
- Training Steps: 2M steps
- Hardware: 1x NVIDIA GH200 96GB
- Precision: bfloat16
- Learning Rate: 3e-4
- Batch Size: 64

**Post-training**: 
- Dataset: Expressive Speech
- Training Steps: 200K steps
- Expressiveness Setting: 0.5
- Hardware: 1x NVIDIA GH200 96GB
- Precision: bfloat16
- Learning Rate: 1e-4
- Batch Size: 64

**Total Training Cost**: ~$2,000 
- Pretraining and fine-tuning: $246.69 (1x GH200)
- Post-training data generation: $167.94 (RTX6000 Ada)
- Additional experimentation: ~$1,500 across various GPU configurations
- Platforms: Prime-Intellect and Jarvis-Labs

## Use Cases

- **Real-time Voice Assistants**: Deploy natural-sounding voice interfaces with custom voices
- **Content Creation**: Generate voiceovers and narration with personalized voices
- **Accessibility Tools**: Create personalized speech synthesis for communication aids
- **Interactive Applications**: Build conversational AI with consistent voice identity
- **Podcast & Media**: Generate natural-sounding speech for automated content

### Local & Cloud Deployment

**Local Deployment:**
- Minimum Requirements: 1GB RAM, GPU recommended for real-time inference
- Quantized Model: 500MB download
- Platforms: iOS, Android, Windows, macOS, Linux

**Cloud Deployment:**
- API-ready architecture
- Scalable inference pipeline
- Low-latency streaming support

### Technical Limitations

- Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
- Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
- Background Noise: Performance degrades with noisy reference audio or inference environments
- Hallucinations: The model might hallucinate words specially for new words or short sentences.

### Legal and Ethical Considerations:

- Users are responsible for complying with local laws regarding voice synthesis and impersonation
- Consider intellectual property rights when cloning voices of public figures
- Respect privacy laws and regulations in your jurisdiction
- Obtain appropriate consent and permissions before deployment

## License & Agreement

* Apache 2.0

## Citation

If you use Marvis in your research or applications, please cite:

```bibtex
@misc{marvis-tts-2025,
  title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
  author={Prince Canuma and Lucas Newman},
  year={2025}
}
```

## Acknowledgments

Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.

---

**Version**: 0.1 

**Release Date**: 26/08/2025  

**Creators**: Prince Canuma & Lucas Newman