marvis-tts-250m-v0.1-transformers / README.md

Updated the inference code. Added a notebook and a demo audio. (#1)

86abd7c verified 1 day ago

6.51 kB

	---
	license: apache-2.0
	datasets:
	- amphion/Emilia-Dataset
	language:
	- en
	base_model:
	- Marvis-AI/marvis-tts-250m-v0.1-base
	library_name: transformers
	tags:
	- mlx
	- mlx-audio
	- transformers
	---

	# Introduction
	[[code](https://github.com/Marvis-Labs/marvis-tts)]

	Marvis is a cutting-edge conversational speech model designed to enable real-time streaming text-to-speech synthesis. Built with efficiency and accessibility in mind, Marvis addresses the growing need for high-quality, real-time voice synthesis that can run on consumer devices such as Apple Silicon, iPhones, iPads, Macs and others.

	## Key Features

	- Real-time Streaming: Stream audio chunks as text is processed, enabling natural conversational flow
	- Compact Size: Only 500MB when quantized, enabling on-device inference
	- Edge deployment: Optimized for real-time Speech-to-Speech (STS) on mobile devices (i.e., iPad, iPhone and etc)
	- Natural Audio Flow: Process entire text context for coherent speech synthesis without chunking artifacts
	- Multimodal Architecture: Seamlessly handles interleaved text and audio tokens

	## Supported Languages

	Currently optimized for English with support for expressive speech synthesis with additional languages such as German, Portuguese, French and Mandarin coming soon.

	# Quick Start

	## Using MLX

	```bash
	pip install -U mlx-audio
	python -m mlx_audio.tts.generate --model Marvis-AI/marvis-tts-250m-v0.1 --stream \
	--text "Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices."
	```

	## Using transformers

	Without Voice Cloning([Colab Notebook](https://colab.research.google.com/drive/1m9pdNFGlWMZW8gyXwkN9MNgbBEWP5lfO?usp=sharing))
	```python
	import torch
	from transformers import AutoTokenizer, AutoProcessor, CsmForConditionalGeneration
	from tokenizers.processors import TemplateProcessing
	import soundfile as sf

	model_id = "Marvis-AI/marvis-tts-0.25m-v0.1-transformers"
	device = "cuda"if torch.cuda.is_available() else "cpu"

	# load the model and the processor
	processor = AutoProcessor.from_pretrained(model_id)
	model = CsmForConditionalGeneration.from_pretrained(model_id).to(device)

	# prepare the inputs
	text = "[0]Marvis TTS is a new text-to-speech model that provides fast streaming on edge devices." # `[0]` for speaker id 0
	inputs = processor(text, add_special_tokens=True, return_tensors="pt").to(device)
	# infer the model
	audio = model.generate(input_ids=inputs['input_ids'], output_audio=True)
	sf.write("example_without_context.wav", audio[0].cpu(), samplerate=24_000, subtype="PCM_16")

	```

	Output:

	<audio controls>
	<source src="https://audio.jukehost.co.uk/gqWAk28VaBoRaX3UPdnMBedGWgXLJ8Mt" type="audio/mpeg">
	</audio>

	---

	# Model Description

	Marvis is built on the [Sesame CSM-1B](https://huggingface.co/sesame/csm-1b) (Conversational Speech Model) architecture, a multimodal transformer that operates directly on Residual Vector Quantization (RVQ) tokens and uses [Kyutai's mimi codec](https://huggingface.co/kyutai/mimi). The architecture enables end-to-end training while maintaining low-latency generation and employs a dual-transformer approach:

	- Multimodal Backbone (250M parameters): Processes interleaved text and audio sequences to model the zeroth codebook level, providing semantic understanding and context.

	- Audio Decoder (60M parameters): A smaller, specialized transformer that models the remaining 31 codebook levels to reconstruct high-quality speech from the backbone's representations.


	Unlike models that require text chunking based on regex patterns, Marvis processes entire text sequences contextually, resulting in more natural speech flow and intonation.

	# Training Details

	Pretraining:
	- Dataset: Emilia-YODAS
	- Training Steps: 2M steps
	- Hardware: 1x NVIDIA GH200 96GB
	- Precision: bfloat16
	- Learning Rate: 3e-4
	- Batch Size: 64

	Post-training:
	- Dataset: Expressive Speech
	- Training Steps: 200K steps
	- Expressiveness Setting: 0.5
	- Hardware: 1x NVIDIA GH200 96GB
	- Precision: bfloat16
	- Learning Rate: 1e-4
	- Batch Size: 64

	Total Training Cost: ~$2,000
	- Pretraining and fine-tuning: $246.69 (1x GH200)
	- Post-training data generation: $167.94 (RTX6000 Ada)
	- Additional experimentation: ~$1,500 across various GPU configurations
	- Platforms: Prime-Intellect and Jarvis-Labs

	## Use Cases

	- Real-time Voice Assistants: Deploy natural-sounding voice interfaces with custom voices
	- Content Creation: Generate voiceovers and narration with personalized voices
	- Accessibility Tools: Create personalized speech synthesis for communication aids
	- Interactive Applications: Build conversational AI with consistent voice identity
	- Podcast & Media: Generate natural-sounding speech for automated content

	### Local & Cloud Deployment

	Local Deployment:
	- Minimum Requirements: 1GB RAM, GPU recommended for real-time inference
	- Quantized Model: 500MB download
	- Platforms: iOS, Android, Windows, macOS, Linux

	Cloud Deployment:
	- API-ready architecture
	- Scalable inference pipeline
	- Low-latency streaming support

	### Technical Limitations

	- Language Support: Currently optimized primarily for English. Performance on other languages may be suboptimal
	- Audio Quality Dependency: Voice cloning quality is dependent on the clarity and quality of the 10-second reference audio
	- Background Noise: Performance degrades with noisy reference audio or inference environments
	- Hallucinations: The model might hallucinate words specially for new words or short sentences.

	### Legal and Ethical Considerations:

	- Users are responsible for complying with local laws regarding voice synthesis and impersonation
	- Consider intellectual property rights when cloning voices of public figures
	- Respect privacy laws and regulations in your jurisdiction
	- Obtain appropriate consent and permissions before deployment

	## License & Agreement

	* Apache 2.0

	## Citation

	If you use Marvis in your research or applications, please cite:

	```bibtex
	@misc{marvis-tts-2025,
	title={Marvis-TTS: Efficient Real-time Voice Cloning with Streaming Speech Synthesis},
	author={Prince Canuma and Lucas Newman},
	year={2025}
	}
	```

	## Acknowledgments

	Special thanks to Sesame and Kyutai for their groundbreaking open-source contributions that inspired our work, and to the broader open-source community for their unwavering support and collaboration.

	---

	Version: 0.1

	Release Date: 26/08/2025

	Creators: Prince Canuma & Lucas Newman