Pretrained models

Model	Download link	File size
Speech synthesis model, based on MMAudio small 16kHz	v2c_s16.pt	1.3G
Speech synthesis model, based on MMAudio small 44.1kHz	v2c_s44.pt	1.3G
Speech synthesis model, based on MMAudio medium 44.1kHz	v2c_m44.pt	1.3G
Speech synthesis model, based on MMAudio large 44.1kHz	v2c_l44.pt	1.3G
MMAduio, small 16kHz	mmaudio_small_16k.pth	601M
MMAduio, small 44.1kHz	mmaudio_small_44k.pth	601M
MMAduio, medium 44.1kHz	mmaudio_medium_44k.pth	2.4G
MMAduio, large 44.1kHz	mmaudio_large_44k.pth	3.9G
MMAduio, large 44.1kHz, v2	mmaudio_large_44k_v2.pth	3.9G
16kHz VAE	v1-16.pth	655M
16kHz BigVGAN vocoder (from Make-An-Audio 2)	best_netG.pt	429M
44.1kHz VAE	v1-44.pth	1.2G
Synchformer visual encoder	synchformer_state_dict.pth	907M
Whisper model for WER evaluation	faster-whisper-large-v3	2.9G
WavLM model for SIM-O evaluation	wavlm_large_finetune.pth	1.2G

The expected directory structure:

F5-TTS
├── ckpts
│   ├── v2c
│   │   ├── v2c_s16.pt
│   │   ├── v2c_s44.pt
│   │   ├── v2c_m44.pt
│   │   └── v2c_l44.pt
│   ├── faster-whisper-large-v3
│   └── wavlm_large_finetune.pth
└── ...
MMAudio
├── ext_weights
│   ├── best_netG.pt
│   ├── synchformer_state_dict.pth
│   ├── v1-16.pth
│   └── v1-44.pth
├── weights
│   ├── mmaudio_small_16k.pth
│   ├── mmaudio_small_44k.pth
│   ├── mmaudio_medium_44k.pth
│   ├── mmaudio_large_44k.pth
│   └── mmaudio_large_44k_v2.pth
└── ...