# Pretrained models
| Model | Download link | File size |
| -------- | ------- | ------- |
| Speech synthesis model, based on MMAudio small 16kHz | v2c_s16.pt | 1.3G |
| Speech synthesis model, based on MMAudio small 44.1kHz | v2c_s44.pt | 1.3G |
| Speech synthesis model, based on MMAudio medium 44.1kHz | v2c_m44.pt | 1.3G |
| Speech synthesis model, based on MMAudio large 44.1kHz | v2c_l44.pt | 1.3G |
| MMAduio, small 16kHz | mmaudio_small_16k.pth | 601M |
| MMAduio, small 44.1kHz | mmaudio_small_44k.pth | 601M |
| MMAduio, medium 44.1kHz | mmaudio_medium_44k.pth | 2.4G |
| MMAduio, large 44.1kHz | mmaudio_large_44k.pth | 3.9G |
| MMAduio, large 44.1kHz, v2 | mmaudio_large_44k_v2.pth | 3.9G |
| 16kHz VAE | v1-16.pth | 655M |
| 16kHz BigVGAN vocoder (from Make-An-Audio 2) |best_netG.pt | 429M |
| 44.1kHz VAE |v1-44.pth | 1.2G |
| Synchformer visual encoder |synchformer_state_dict.pth | 907M |
| Whisper model for WER evaluation | faster-whisper-large-v3 | 2.9G |
| WavLM model for SIM-O evaluation | wavlm_large_finetune.pth | 1.2G |
The expected directory structure:
```bash
F5-TTS
├── ckpts
│ ├── v2c
│ │ ├── v2c_s16.pt
│ │ ├── v2c_s44.pt
│ │ ├── v2c_m44.pt
│ │ └── v2c_l44.pt
│ ├── faster-whisper-large-v3
│ └── wavlm_large_finetune.pth
└── ...
MMAudio
├── ext_weights
│ ├── best_netG.pt
│ ├── synchformer_state_dict.pth
│ ├── v1-16.pth
│ └── v1-44.pth
├── weights
│ ├── mmaudio_small_16k.pth
│ ├── mmaudio_small_44k.pth
│ ├── mmaudio_medium_44k.pth
│ ├── mmaudio_large_44k.pth
│ └── mmaudio_large_44k_v2.pth
└── ...
```