Marvis-AI
/

marvis-tts-250m-v0.1-transformers

+---
+license: apache-2.0
+datasets:
+- amphion/Emilia-Dataset
+language:
+- en
+base_model:
+- Marvis-AI/marvis-tts-250m-v0.1-base
+library_name: transformers
+tags:
+- mlx
+- mlx-audio
+- transformers
+---
+# Introduction
+[[code](https://github.com/Marvis-Labs/marvis-tts)]
+Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.
+## Key Features
+- **Real-time Streaming**: Stream audio chunks as text is processed, enabling natural conversational flow
+- **Compact Size**: Only 500MB when quantized, enabling on-device inference
+- **Edge deployment**: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
+- **Natural Audio Flow**: Process entire text context for coherent speech synthesis without chunking artifacts
+- **Multimodal Architecture**: Seamlessly handles interleaved text and audio tokens
+## Supported Languages
+Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.
+# Quick Start
+## Using MLX
+```bash
+pip install -U mlx-audio
+python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1  --stream \
+ --text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
+```
+## Using transformers
+**Without Voice Cloning**
+```python
+import torch
+from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
+from tokenizers.processors import TemplateProcessing
+import soundfile as sf
+model_id = "Marvis-AI/marvis-tts-250m-v0.1-transformers"
+device = "cuda"if torch.cuda.is_available() else "cpu"
+# load the model and the processor
+processor = AutoProcessor.from_pretrained(model_id)
+model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
+# prepare the inputs
+text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
+inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device).pop("token_type_ids")
+# infer the model
+audio = model.generate(**inputs, output_audio=True)
+sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")
+```
+# Model Description
+Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:
+- **Multimodal Backbone (250M parameters)**: Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.
+- **Audio Decoder (60M parameters)**: A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.
+Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.
+# Training Details
+**Pretraining**:
+- Dataset: Emilia-YODAS
+- Training Steps: 2M steps
+- Hardware: 1x NVIDIA GH200 96GB
+- Precision: bfloat16
+- Learning Rate: 3e-4
+- Batch Size: 64
+**Post-training**:
+- Dataset: Expressive Speech
+- Training Steps: 200K steps
+- Expressiveness Setting: 0.5
+- Hardware: 1x NVIDIA GH200 96GB
+- Precision: bfloat16
+- Learning Rate: 1e-4
+- Batch Size: 64
+**Total Training Cost**: ~$2,000
+- Pretraining and fine-tuning: $246.69 (1x GH200)
+- Post-training data generation: $167.94 (RTX6000 Ada)
+- Additional experimentation: ~$1,500 across various GPU configurations
+- Platforms: Prime-Intellect and Jarvis-Labs
+## Use Cases
+- **Real-time Voice Assistants**: Deploy natural-sounding voice interfaces with custom voices
+- **Content Creation**: Generate voiceovers and narration with personalized voices
+- **Accessibility Tools**: Create personalized speech synthesis for communication aids
+- **Interactive Applications**: Build conversational AI with consistent voice identity
+- **Podcast & Media**: Generate natural-sounding speech for automated content
+### Local & Cloud Deployment
+**Local Deployment:**
+- Minimum Requirements: 1GB RAM, GPU recommended for real-time inference
+- Quantized Model: 500MB download
+- Platforms: iOS, Android, Windows, macOS, Linux
+**Cloud Deployment:**
+- API-ready architecture
+- Scalable inference pipeline
+- Low-latency streaming support
+### Technical Limitations
+- Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
+- Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
+- Background Noise: Performance degrades with noisy reference audio or inference environments
+- Hallucinations: The model might hallucinate words specially for new words or short sentences.
+### Legal and Ethical Considerations:
+- Users are responsible for complying with local laws regarding voice synthesis and impersonation
+- Consider intellectual property rights when cloning voices of public figures
+- Respect privacy laws and regulations in your jurisdiction
+- Obtain appropriate consent and permissions before deployment
+## License & Agreement
+* Apache 2.0
+## Citation
+If you use Marvis in your research or applications, please cite:
+```bibtex
+@misc{marvis-tts-2025,
+  title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
+  author={Prince Canuma and Lucas Newman},
+  year={2025}
+}
+```
+## Acknowledgments
+Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.
+---
+**Version**: 0.1
+**Release Date**: 26/08/2025
+**Creators**: Prince Canuma & Lucas Newman