Ming-Lite-Omni-Preview
Model Description
Ming-Lite-Omni-Preview employs a unified Mixture-of-Experts (MoE) framework for multimodal sequence modeling, which empowers Ling LLMs to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, Ming-Lite-Omni-Preview can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive realtime experience. To naturely handle the diverse modalities, we have enhanced Ling-Lite-MoE by incorporating modality-specific routers for each modality. As a result, Ming-Lite-Omni-Preview excels at handling information from diverse modalities and is highly scalable.
Key Features
Omni and Novel MoE Architecture: An innovative Omni architecture based on Mixture of Experts (MoE) that achieves competive performance across multiple modality benchmarks.
Video understanding: Supports KV-Cache dynamic compression of visual tokens. While supporting the ability to understand long videos of hours, it can also provide more detailed understanding of short videos of a few seconds.
Natural Speech Generation and Fine-grained Voice Dialogue: Supports dialect understanding and generation in end-to-end conversations, enables one-shot voice cloning, and enhances prosody through audio tokenizer compression
Model Downloads
You can download the model from both Huggingface and ModelScope.
Model | Input modality | Oput modality | Download |
---|---|---|---|
Ming-Lite-Omni-Preview | Image,text,viedio,audio | Image,text,audio | 🤗 HuggingFace 🤖 ModelScope |
Quickstart
Please download our model following Model Downloads, then you can refer to the following codes to run Ming-Lite-Omni-Preview model.
import os
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
# build model
model = BailingMMNativeForConditionalGeneration.from_pretrained(
"inclusionAI/Ming-Lite-Omni-Preview",
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True
).to("cuda")
assets_path = YOUR_ASSETS_PATH
# build processor
processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni-Preview", trust_remote_code=True)
# qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
],
},
]
# Output:
# 鹦鹉是一种非常聪明和社交性强的鸟类,它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍:
# ### 1. **栖息地**
# 鹦鹉主要分布在热带和亚热带地区,包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同,但大多数鹦鹉喜欢有丰富植被和水源的地方。
# ### 2. **饮食**
# 鹦鹉是杂食性动物,它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮,能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子,以帮助消化和补充矿物质。
# ......
# image qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
{"type": "text", "text": "What kind of flower is this?"},
],
},
]
# Output:
# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.
To enable thinking before response, adding the following system prompt before your question:
cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
# And your input message should be like this:
messages = [
{
"role": "HUMAN",
"content": [
{"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
{"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
],
},
]
# Output:
# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
# video qa
messages = [
{
"role": "HUMAN",
"content": [
{"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
{"type": "text", "text": "What is the woman doing?"},
],
},
]
# Output:
# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.
# multi-turn chat
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "中国的首都是哪里?"},
],
},
{
"role": "ASSISTANT",
"content": [
{"type": "text", "text": "北京"},
],
},
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "它的占地面积是多少?有多少常住人口?"},
],
},
]
# Output:
# 北京市的总面积约为16,410.54平方公里,常住人口约为21,542,000人。
# Preparation for inference
text = processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
)
inputs = inputs.to(model.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
# call generate
generated_ids = model.generate(
**inputs,
max_new_tokens=512,
use_cache=False,
eos_token_id=processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(output_text)
# ASR
messages = [
{
"role": "HUMAN",
"content": [
{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512)
print(outputs)
# speech2speech
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
],
},
]
outputs = model.generate(messages, max_new_tokens=512, speaker='luna', output_audio_path='out.wav', output_audio=True)
print(outputs)
Evaluation
Image benchmark
Benchmarks | Ming-Lite-Omni-Preview | Qwen2.5-VL-7B-Instruct | InternVL2.5-8B-MPO |
---|---|---|---|
AI2D | 83.84 | 83.9 | 84.5 |
HallusionBench | 54.68 | 51.9 | 51.7 |
MMBench_TEST_V11 | 79.63 | 84.3 | 82.0 |
MMMU | 57.0 | 58.6 | 54.8 |
MMStar | 62.0 | 63.9 | 65.2 |
MMVet | 73.6 | 67.1 | 68.1 |
MathVista | 69.0 | 68.2 | 67.9 |
OCRBench | 87.9 | 86.4 | 88.2 |
Average | 70.96 | 70.5 | 70.3 |
Object Recognition
Object Recognition | Ming-Lite-Omni-Preview | Qwen2.5-VL-7B | InternVL-2.5-8B |
---|---|---|---|
Plants | 52.1 | 55.3 | 32.8 |
Animals | 52.6 | 54.8 | 36.5 |
Home appliances & furniture | 93.5 | 97.4 | 90.9 |
Personal Electronics | 96.1 | 95.1 | 93.2 |
Food & Ingredients | 57.5 | 60.0 | 48.7 |
Tableware | 96.6 | 94.9 | 88.1 |
Vehicles | 31.9 | 40.9 | 31.9 |
Average | 68.6 | 71.2 | 60.3 |
Video benchmark
Benchmarks | Ming-Lite-Omni-Preview | Qwen2.5VL-7B |
---|---|---|
VideoMME wo/w sub. | 63.9/67.6 | 65.1/71.6 |
MVBench | 67.0 | 72.0 |
Video-MMMU | 45.4 | 47.44 |
LongVideoBench | 53.7 | 60.0 |
Audio benchmark
SpeechQA
Model | AlpacaEval | CommonEval | SD-QA | MMSU | OpenBookQA | IFEval | AdvBench |
---|---|---|---|---|---|---|---|
Qwen2-Audio-chat | 3.69 | 3.40 | 35.35 | 35.43 | 49.01 | 22.57 | 98.85 |
Baichuan-Audio | 4.00 | 3.39 | 49.64 | 48.80 | 63.30 | 41.32 | 86.73 |
GLM-4-Voice | 4.06 | 3.48 | 43.31 | 40.11 | 52.97 | 24.91 | 88.08 |
Kimi-Audio | 4.46 | 3.97 | 63.12 | 62.17 | 83.52 | 61.10 | 100.00 |
Qwen2.5-Omni | 4.49 | 3.93 | 55.71 | 61.32 | 81.10 | 52.87 | 99.42 |
Ming-Lite-Omni-Preview | 4.25 | 3.88 | 58.95 | 46.06 | 60.00 | 46.71 | 96.53 |
ASR
Model | Aishell-1 | Aishell-2 ios | Wenetspeech test-net | Wenet test-meeting | Librispeech test-clean | Librispeech test-other |
---|---|---|---|---|---|---|
Whisper Large-v3 | 5.14 | 4.76 | 9.68 | 18.54 | 1.9 | 3.65 |
Qwen2-Audio | 1.53 | 3.06 | 7.72 | 8.4 | 1.6 | 3.6 |
GLM-4-voice Base | 2.46 | - | - | - | 2.82 | 7.66 |
Baichuan-Omni-1.5 | - | - | 6.9 | 8.4 | - | - |
Qwen2.5-Omni | 1.18 | 2.36 | 5.9 | 7.7 | 1.8 | 3.4 |
Ming-Lite-Omni-Preview | 1.62 | 2.82 | 6.23 | 6.9 | 2.34 | 5.74 |
Knowledge
Model | InfoSeek_H-mean | InfoSeek_unseen_question | InfoSeek_unseen_entity |
---|---|---|---|
GPT-4o | 36.05 | - | - |
PaLI-X | 22.06 | 23.5 | 20.8 |
Qwen2.5-vl-32B | 19.35 | 20.55 | 18.28 |
Ming-Lite-Omni-Preview | 27.3 | 28.9 | 25.9 |
OCR&GUI
Model | Ming-Lite-Omni-Preview | Qwen2.5-VL-7B-Instruct |
---|---|---|
ChartQA_TEST | 85.2 | 87.3 |
DocVQA_TEST | 93.2 | 95.7 |
OCRBenchV2_en/zh | 52.2/51.6 | 56.3/57.2 |
OmniDocBench↓ | 34.7/34.5 | 30.8/39.8 |
TextVQA_VAL | 82.36 | 84.9 |
ScreenSpot | 79.3 | 84.7 |
Model Sources
- Github Repository: https://github.com/inclusionAI/Ming
- Downloads last month
- 18
Model tree for inclusionAI/Ming-Lite-Omni-Preview
Base model
inclusionAI/Ling-lite