Ming-Lite-Omni-Preview

Model Description

Ming-Lite-Omni-Preview employs a unified Mixture-of-Experts (MoE) framework for multimodal sequence modeling, which empowers Ling LLMs to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, Ming-Lite-Omni-Preview can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive realtime experience. To naturely handle the diverse modalities, we have enhanced Ling-Lite-MoE by incorporating modality-specific routers for each modality. As a result, Ming-Lite-Omni-Preview excels at handling information from diverse modalities and is highly scalable.

Key Features

  • Omni and Novel MoE Architecture: An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.

  • Video understanding: Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.

  • Natural Speech Generation and Fine-grained Voice Dialogue: Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression

Model Downloads

You can download the model from both Huggingface and ModelScope.

Model Input modality Oput modality Download
Ming-Lite-Omni-Preview Image,text,viedio,audio Image,text,audio 🤗 HuggingFace
🤖 ModelScope

Quickstart

Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.

import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration

# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
    "inclusionAI/Ming-Lite-Omni-Preview",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True
).to("cuda")

assets_path = YOUR_ASSETS_PATH

# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni-Preview", trust_remote_code=True)
# qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
        ],
    },
]
# Output:

# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......
# image qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
            {"type": "text", "text": "What kind of flower is this?"},
        ],
    },
]
# Output:

# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white. 

To enable thinking before response, adding the following system prompt before your question:

cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
            {"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
        ],
    },
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
            {"type": "text", "text": "What is the woman doing?"},
        ],
    },
]
# Output:

# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.
# multi-turn chat
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "中国的首都是哪里?"},
        ],
    },
    {
        "role": "ASSISTANT",
        "content": [
            {"type": "text", "text": "北京"},
        ],
    },
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
        ],
    },
]
# Output:

# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    audios=audio_inputs,
    return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
    if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
        inputs[k] = inputs[k].to(dtype=torch.bfloat16)

# call generate
generated_ids = model.generate(
    **inputs,
    max_new_tokens=512,
    use_cache=False,
    eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
    {
        "role": "HUMAN",
        "content": [
            {"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
        ],
    },
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)

Evaluation

Image benchmark

Benchmarks Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct InternVL2.5-8B-MPO
AI2D 83.84 83.9 84.5
HallusionBench 54.68 51.9 51.7
MMBench_TEST_V11 79.63 84.3 82.0
MMMU 57.0 58.6 54.8
MMStar 62.0 63.9 65.2
MMVet 73.6 67.1 68.1
MathVista 69.0 68.2 67.9
OCRBench 87.9 86.4 88.2
Average 70.96 70.5 70.3

Object Recognition

Object Recognition Ming-Lite-Omni-Preview Qwen2.5-VL-7B InternVL-2.5-8B
Plants 52.1 55.3 32.8
Animals 52.6 54.8 36.5
Home appliances & furniture 93.5 97.4 90.9
Personal Electronics 96.1 95.1 93.2
Food & Ingredients 57.5 60.0 48.7
Tableware 96.6 94.9 88.1
Vehicles 31.9 40.9 31.9
Average 68.6 71.2 60.3

Video benchmark

Benchmarks Ming-Lite-Omni-Preview Qwen2.5VL-7B
VideoMME wo/w sub. 63.9/67.6 65.1/71.6
MVBench 67.0 72.0
Video-MMMU 45.4 47.44
LongVideoBench 53.7 60.0

Audio benchmark

SpeechQA

Model AlpacaEval CommonEval SD-QA MMSU OpenBookQA IFEval AdvBench
Qwen2-Audio-chat 3.69 3.40 35.35 35.43 49.01 22.57 98.85
Baichuan-Audio 4.00 3.39 49.64 48.80 63.30 41.32 86.73
GLM-4-Voice 4.06 3.48 43.31 40.11 52.97 24.91 88.08
Kimi-Audio 4.46 3.97 63.12 62.17 83.52 61.10 100.00
Qwen2.5-Omni 4.49 3.93 55.71 61.32 81.10 52.87 99.42
Ming-Lite-Omni-Preview 4.25 3.88 58.95 46.06 60.00 46.71 96.53

ASR

Model Aishell-1 Aishell-2 ios Wenetspeech test-net Wenet test-meeting Librispeech test-clean Librispeech test-other
Whisper Large-v3 5.14 4.76 9.68 18.54 1.9 3.65
Qwen2-Audio 1.53 3.06 7.72 8.4 1.6 3.6
GLM-4-voice Base 2.46 - - - 2.82 7.66
Baichuan-Omni-1.5 - - 6.9 8.4 - -
Qwen2.5-Omni 1.18 2.36 5.9 7.7 1.8 3.4
Ming-Lite-Omni-Preview 1.62 2.82 6.23 6.9 2.34 5.74

Knowledge

Model InfoSeek_H-mean InfoSeek_unseen_question InfoSeek_unseen_entity
GPT-4o 36.05 - -
PaLI-X 22.06 23.5 20.8
Qwen2.5-vl-32B 19.35 20.55 18.28
Ming-Lite-Omni-Preview 27.3 28.9 25.9

OCR&GUI

Model Ming-Lite-Omni-Preview Qwen2.5-VL-7B-Instruct
ChartQA_TEST 85.2 87.3
DocVQA_TEST 93.2 95.7
OCRBenchV2_en/zh 52.2/51.6 56.3/57.2
OmniDocBench↓ 34.7/34.5 30.8/39.8
TextVQA_VAL 82.36 84.9
ScreenSpot 79.3 84.7

Model Sources

Downloads last month
18
Safetensors
Model size
18.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for inclusionAI/Ming-Lite-Omni-Preview

Finetuned
(3)
this model

Collection including inclusionAI/Ming-Lite-Omni-Preview