A newer version of this model is available: inclusionAI/Ming-flash-omni-Preview

Ming-Lite-Omni

📑 Technical Report｜📖Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope

Introduction

Ming-lite-omni, a light version of Ming-omni, which is derived from Ling-lite and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities. Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.

📌 Updates

[2025.06.12] 🔥 Our Technical Report is in public on arxiv.
[2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.
[2025.05.04] 🔥 We release the test version of Ming-lite-omni：Ming-lite-omni-Preview.

Key Features

Unified Omni-Modality Perception: Ming-lite-omni, built on Ling, an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.
Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.
Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.

Evaluation

Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.

Image benchmark

Benchmarks	Ming-lite-omni	Qwen2.5-VL-7B-Instruct	InternVL2.5-8B-MPO
AI2D	83.1	84.4	84.5
HallusionBench	55.0	55.8	51.7
MMBench_TEST_V11	80.8	82.8	82.0
MMMU	56.3	56.6	54.8
MMStar	64.7	65.3	65.2
MMVet	71.3	71.6	68.1
MathVista	71.6	68.1	67.9
OCRBench	88.4	87.8	88.2
Average	71.4	71.5	70.3

Encyclopedia Benchmarks

Object Recognition	Ming-lite-omni	Qwen2.5-VL-7B-Instruct
Plants	54.96	47.8
Animals	56.7	50.85
Vehicles	41.91	42.29
Food & Ingredients	62.28	54.09
Dishes	44.3	39.07
General	91.08	92.42
Average	58.54	54.43

Video benchmark

Benchmarks	Ming-lite-omni	Qwen2.5VL-7B-Instruct
VideoMME	67.0	67.3
MVBench	67.7	67.4
Video-MMMU	46.3	47.4
LongVideoBench	56.6	54.7
Average	59.4	59.2

Note: All models are evaluated based on 128 uniformly sampled frames.

Audio benchmark

SpeechQA

Model	Average	AlpacaEval	CommonEval	SD-QA	MMSU	OpenBookQA	IFEval	AdvBench
Qwen2-Audio-chat	3.545	3.69	3.40	35.35	35.43	49.01	22.57	98.85
Baichuan-Audio	3.695	4.00	3.39	49.64	48.80	63.30	41.32	86.73
GLM-4-Voice	3.77	4.06	3.48	43.31	40.11	52.97	24.91	88.08
Kimi-Audio	4.215	4.46	3.97	63.12	62.17	83.52	61.10	100.00
Qwen2.5-Omni	4.21	4.49	3.93	55.71	61.32	81.10	52.87	99.42
Ming-lite-omni	4.34	4.63	4.06	58.84	47.53	61.98	58.36	99.04

ASR

Model	aishell1	aishell2_android	aishell2_ios	cv15_zh	fleurs_zh	wenetspeech_meeting	wenetspeech_net	librispeech_test_clean	librispeech_test_other	multilingual_librispeech	cv15_en	fleurs_en	voxpopuli_v1.0_en
Ming-lite-omni	1.47	2.55	2.52	6.31	2.96	5.95	5.46	1.44	2.80	4.15	6.89	3.39	5.80
Qwen2.-Omni	1.18	2.75	2.63	5.20	3.00	5.90	7.70	1.80	3.40	7.56	7.60	4.10	5.80
Qwen2-Audio	1.53	2.92	2.92	6.90	7.50	7.16	8.42	1.60	3.60	5.40	8.60	6.90	6.84
Kimi-Audio	0.60	2.64	2.56	7.21	2.69	6.28	5.37	1.28	2.42	5.88	10.31	4.44	7.97

Information-Seeking Benchmark

Model	InfoSeek_H-mean	InfoSeek_unseen_question	InfoSeek_unseen_entity
GPT-4o	36.05	-	-
PaLI-X	22.06	23.5	20.8
Qwen2.5-vl-32B	19.35	20.55	18.28
Ming-lite-omni	27.7	30.4	25.4

OCR

Model	Ming-lite-omni	Qwen2.5-VL-7B-Instruct
ChartQA_TEST	85.1	87.3
DocVQA_TEST	93	95.7
OCRBenchV2_en/zh	53.3/52	56.3/57.2
OmniDocBench↓	34/34.4	30.8/39.8
TextVQA_VAL	82.8	84.9

GUI

Model	Ming-lite-omni	InternVL3 8B	Qwen2.5-VL-7B-Instruct
ScreenSpot	82.1	79.5	78.9*
ScreenSpot-V2	84.1	81.4	-
AITZ(EM)	66.6	-	57.6*

Note: * denotes the reproduced results.

Unified Generation Benchmark

Model	single_object	two_object	counting	colors	position	color_attr	GENEVAL	DPGBench	FID↓
Ming-lite-omni	0.9875	0.7727	0.6812	0.7872	0.31	0.29	0.64	81.72	4.85
Metaquery-XL	-	-	-	-	-	-	0.61	82.05	6.02
SDv2.1	0.98	0.51	0.44	0.85	0.07	0.17	0.50	68.09	26.96
Emu3-Gen	0.98	0.71	0.34	0.81	0.17	0.21	0.54	80.60	-
SDXL	0.98	0.74	0.39	0.85	0.15	0.23	0.55	74.65	8.76
Janus	0.97	0.68	0.30	0.84	0.46	0.42	0.61	79.68	10.10
JanusFlow	-	-	-	-	-	-	0.63	80.09	9.51

Please refer to our technical report for more comprehensive evaluation results.

Model Downloads

You can download the model from both Huggingface and ModelScope.

Model	Input modality	Oput modality	Download
Ming-Lite-Omni	Image,text,viedio,audio	Image,text,audio	🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

Use Cases

Additional demonstration cases are available on our project page.

Example Usage

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-lite-omni model.

Python environment dependency installation.

pip install -r requirements.txt
pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
pip install diffusers==0.33.0
pip install nvidia-cublas-cu12==12.4.5.8  # for H20

Note: We test following examples on hardware of NVIDIA H800-80GB with CUDA 12.2. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 40890MB memory.

import os
import torch
from transformers import AutoProcessor, GenerationConfig
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)

# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类，它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍：
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区，包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同，但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物，它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮，能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子，以帮助消化和补充矿物质。
# ......

# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n

# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里？"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少？有多少常住人口？"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里，常住人口约为21,542,000人。

# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)

Audio tasks

# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
# we use whisper encoder for ASR task, so need modify code above
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
    audio_kwargs={'use_whisper_encoder': True}
)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
    use_whisper_encoder=True
)

# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/speechQA_sample.wav'},
        ],
    },
]
generation_config = GenerationConfig.from_dict({
    'output_hidden_states': True,
    'return_dict_in_generate': True,
    'no_repeat_ngram_size': 10}
)

outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=True,
    eos_token_id=processor.gen_terminator,
    generation_config=generation_config,
    use_whisper_encoder=False
)

generated_ids = outputs.sequences
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]

# speechQA result
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]

# for TTS
from modeling_bailing_talker import AudioDetokenizer

model_name_or_path = model.config._name_or_path
audio_detokenizer = AudioDetokenizer(
    f'{model_name_or_path}/talker/audio_detokenizer.yaml',
    flow_model_path=f'{model_name_or_path}/talker/flow.pt',
    hifigan_model_path=f'{model_name_or_path}/talker/hift.pt'
)
spk_input = torch.load('data/spks/luna.pt')
thinker_reply_part = outputs.hidden_states[0][0] + outputs.hidden_states[0][-1]
# Setting thinker_reply_part to None allows the talker to operate as a standalone TTS model, independent of the language model.
audio_tokens = model.talker.omni_audio_generation(
    output_text, 
    thinker_reply_part=thinker_reply_part, **spk_input)
waveform = audio_detokenizer.token2wav(audio_tokens, save_path='out.wav', **spk_input)

# zero-shot TTS
from modeling_bailing_talker import AudioDetokenizer
from audio_detokenizer.cli.frontend import TTSFrontEnd
from hyperpyyaml import load_hyperpyyaml

model_name_or_path = model.config._name_or_path
audio_detokenizer = AudioDetokenizer(
    f'{model_name_or_path}/talker/audio_detokenizer.yaml',
    flow_model_path=f'{model_name_or_path}/talker/flow.pt',
    hifigan_model_path=f'{model_name_or_path}/talker/hift.pt'
)

with open(f'{model_name_or_path}/talker/audio_detokenizer.yaml', 'r') as f:
    configs = load_hyperpyyaml(f)
audio_frontend = TTSFrontEnd(
    configs["feat_extractor"],
    f'{model_name_or_path}/talker/campplus.onnx',
    f'{model_name_or_path}/talker/speech_tokenizer_v1.onnx',
)

tts_text = "这是一条测试语句。"
spk_input = audio_frontend.frontend_zero_shot(prompt_text="感谢你的认可。", prompt_wav_path="data/spks/prompt.wav")
audio_tokens = model.talker.omni_audio_generation(tts_text, **spk_input)
waveform = audio_detokenizer.token2wav(audio_tokens, save_path='out.wav', **spk_input)

For detailed usage for ASR, SpeechQA, and TTS tasks, please refer to test_audio_tasks.py

Image Generation & Edit

Ming-omni natively supports image generation and image editing. To use this function, you only need to add the corresponding parameters in the generate function.

# Image generation mode currently limits the range of input pixels.
gen_input_pixels = 451584
processor.max_pixels = gen_input_pixels
processor.min_pixels = gen_input_pixels

def generate(messages, processor, model, **image_gen_param):
    text = processor.apply_chat_template(messages, add_generation_prompt=True)
    image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        audios=audio_inputs,
        return_tensors="pt",
    ).to(model.device)

    for k in inputs.keys():
        if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
            inputs[k] = inputs[k].to(dtype=torch.bfloat16)
    
    print(image_gen_param)
    image = model.generate(
        **inputs,
        image_gen=True,
        **image_gen_param,
    )
    return image

Text-to-image

messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Draw a girl with short hair."},
        ],
    }
]
image = generate(
   messages=messages, processor=processor, model=model, 
   image_gen_cfg=6.0, image_gen_steps=20, image_gen_width=480, image_gen_height=544
)
image.save("./t2i.jpg")

Edit

messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": "samples/cake.jpg"},
            {"type": "text", "text": "add a candle on top of the cake"},
        ],
    }
]
image = generate(
   messages=messages, processor=processor, model=model, 
   image_gen_cfg=6.0, image_gen_steps=20, image_gen_width=512, image_gen_height=512
)
image.save("./edit.jpg")

License and Legal Disclaimer

This code repository is licensed under the MIT License, and the Legal Disclaimer is located in the LEGAL.md file under the project's root directory.

Citation

If you find our work helpful, feel free to give us a cite.


@misc{Mingomni2025,
      title  = {Ming-Omni: A Unified Multimodal Model for Perception and Generation}, 
      author = {Inclusion AI},
      year = {2025},
      eprint = {2506.09344},
      archivePrefix = {arXiv},
      url = {https://arxiv.org/abs/2506.09344}
}