MingTok: A Unified Tokenizer for Visual Understanding and Generation without Vector Quantization

πŸ“‘ Technical Report | πŸ“– Project Page | πŸ€— Hugging Face | πŸ€– ModelScope | πŸ’Ύ GitHub

Key Features

  • πŸ–ΌοΈ First Continuous Unified Vision Tokenizer: MingTok enables unified vision understanding and generation via a continuous latent space, eliminating quantization while preserving semantic and perceptual fidelity.
  • 🎯 High-Fidelity Image Reconstruction: A three-stage architecture (encoding, expansion, reconstruction) captures fine details and global structure for accurate, high-quality image recovery.
  • ⚑ Accelerated Autoregressive Convergence: Masked modeling with multi-level supervision shapes a compact, semantically rich latent space, enabling faster and more stable autoregressive training.
Model Architecture

Figure 1: Conceptual comparison and qualitative examples of MingTok.

Usage

# build MingTok

from mingtok.modeling_mingtok import MingTok

mingtok_model = MingTok.from_pretrained("inclusionAI/MingTok-Vision")
mingtok_model = mingtok_model.cuda()

img_path = "mingtok/asset/mingtok.png"
save_path = "mingtok/asset/mingtok_recon.png"

# loading original image
image = Image.open(img_path).convert("RGB")
processor = CenterCropProcessor(image_size=512, mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
image = processor(image).cuda().unsqueeze(0)

# performing reconstruction
with torch.no_grad():
  image_recon = mingtok_model.forward_enc_dec(image)
  # latent = mingtok_model.low_level_encoder(image)
  # semantic_feat = mingtok_model.semantic_decoder(latent)['x_norm_patchtokens']
  # image_recon = mingtok_model.forward_pixel_decoder(semantic_feat)


output_mean = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
output_std = torch.Tensor([0.5,0.5,0.5]).view(1,-1,1,1).cuda()
output_image = (image_recon*output_std + output_mean)[0]
output_image = T.ToPILImage()(output_image)
output_image.save(save_path)

Performance

Image Reconstruction

Tokenizer Res. # Tokens rFID ↓ PSNR ↑ SSIM ↑ LPIPS ↓
Specialized tokenizers
SD-VAE 256 1024 1.06 28.62 0.86 -
GigaTok 256 256 0.51 21.32 0.69 0.21
VA-VAE 256 256 0.26 28.59 0.80 0.09
HieraTok 256 256 1.04 23.90 0.72 0.09
DC-AE 512 64 0.22 26.15 0.71 0.08
MAE-Tok 512 128 0.62 - - -
TexTok 512 256 0.73 24.45 0.66 0.19
Unified tokenizers
UniTok 256 256 0.38 - - -
TokenFlow 384 729 0.63 22.77 0.73 -
MingTok-Vision 512 256 0.54 30.77 0.62 0.14
MingTok-Vision † 512 256 0.38 31.09 0.64 0.12
† denotes using semantic decoder after joint pre-training.

Reference

@article{huang2025mingunivision,
  title={Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer},
  author={Huang, Ziyuan and Zheng, DanDan and Zou, Cheng and Liu, Rui and Wang, Xiaolong and Ji, Kaixiang and Chai, Weilong and Sun, Jianxin and Wang, Libin and Lv, Yongjie and Huang, Taozhi and Liu, Jiajia and Guo, Qingpei and Yang, Ming and Chen, Jingdong and Zhou, Jun},
  journal={arXiv preprint arXiv:2510.06590},
  year={2025}
}
Downloads last month
638
Safetensors
Model size
698M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including inclusionAI/MingTok-Vision