Ming-Lite-Omni / README.md

Update README.md

e368e29 verified 2 days ago

24.1 kB

	---
	license: mit
	base_model:
	- inclusionAI/Ling-lite
	---

	# Ming-Lite-Omni

	<p align="center">
	<img src="./figures/ant-bailing.png" width="100"/>
	<p>

	<p align="center">📑 <a href="https://arxiv.org/abs/2506.09344">Technical Report</a>｜📖<a href="https://lucaria-academy.github.io/Ming-Omni/">Project Page</a> ｜🤗 <a href="https://huggingface.co/inclusionAI/Ming-Lite-Omni">Hugging Face</a>｜ 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni">ModelScope</a>



	## Introduction

	Ming-lite-omni, a light version of Ming-omni, which is derived from [Ling-lite](https://github.com/inclusionAI/Ling) and features 2.8 billion activated parameter. Ming-lite-omni is a unified multimodal model capable of processing images, text, audio, and video, while demonstrating strong proficiency in both speech and image generation. Ming-lite-omni employs dedicated encoders to extract tokens from different modalities, which are then processed by Ling, an MoE architecture equipped with newly proposed modality-specific routers. This design enables a single model to efficiently process and fuse multimodal inputs within a unified framework, thereby facilitating diverse tasks without requiring separate models, task-specific fine-tuning, or structural redesign. Importantly, Ming-lite-omni extends beyond conventional multimodal models by supporting audio and image generation. This is achieved through the integration of an advanced audio decoder for natural-sounding speech and Ming-Lite-Uni for high-quality image generation, which also allow the model to engage in context-aware chatting, perform text-to-speech conversion, and conduct versatile image editing. Our experimental results showcase Ming-lite-omni offers a powerful solution for unified perception and generation across all modalities.
	Notably, Ming-lite-omni is the first open-source model we are aware of to match GPT-4o in modality support, and we release all code and model weights to encourage further research and development in the community.


	<p align="center">
	<img src="./figures/ming.png" width="800"/>
	<p>

	## 📌 Updates

	* [2025.06.12] 🔥 Our [Technical Report](https://arxiv.org/abs/2506.09344) is in public on arxiv.
	* [2025.05.28] 🔥 The official version of Ming-lite-omni is released, with better performance and image generation support.
	* [2025.05.04] 🔥 We release the test version of Ming-lite-omni：[Ming-lite-omni-Preview](https://github.com/inclusionAI/Ming/tree/Ming-Lite-Omni-Preview).


	## Key Features

	- Unified Omni-Modality Perception: Ming-lite-omni, built on [Ling](https://github.com/inclusionAI/Ling), an MoE architecture LLM, resolves task conflicts and ensures coherent integration of tokens from different modalities through modality-specific routers.

	- Unified Perception and Generation: Ming-lite-omni achieves unified understanding and generation, enabling the model to interpret multimodal instructions and user intent during generation, which helps enhance generation quality and improves usability across multiple tasks.

	- Innovative Generation Capabilities: Ming-lite-omni can perceive all modalities and generate high-quality text, real-time speech, and vivid images simultaneously, delivering exceptional cross-modal performance across diverse tasks including image perception, audio-visual interaction, and image generation.


	## Evaluation
	Ming-lite-omni delivers exceptional cross-modal performance, as validated across image perception, audio-visual interaction, and image generation tasks. Specifically, in the image perception task, Ming-lite-omni attained performance comparable to that of Qwen2.5-VL-7B by activating only 2.8B parameters. It delivers superior performance in end-to-end speech understanding and instruction following, surpassing Qwen2.5-Omni and Kimi-Audio. It also supports native-resolution image generation, editing, and style transfer, achieving a GenEval score of 0.64, outperforming mainstream models such as SDXL. In terms of FID, Ming-lite-omni reaches 4.85, setting a new SOTA across existing methods.
	<p align="center">
	<img src="./figures/performance.png" width="800"/>
	<p>


	### Image benchmark
	<div align="center">

	\| Benchmarks \| Ming-lite-omni \| Qwen2.5-VL-7B-Instruct \| InternVL2.5-8B-MPO \|
	\|:------------------\|:--------------:\|:----------------------------:\|:------------------:\|
	\| AI2D \| 83.1 \| 84.4 \| <b>84.5</b> \|
	\| HallusionBench \| <b>55.0</b> \| 55.8 \| 51.7 \|
	\| MMBench_TEST_V11 \| 80.8 \| <b>82.8</b> \| 82.0 \|
	\| MMMU \| 56.3 \| <b>56.6</b> \| 54.8 \|
	\| MMStar \| 64.7 \| 65.3 \| <b>65.2</b> \|
	\| MMVet \| 71.3 \| 71.6 \| 68.1 \|
	\| MathVista \| <b>71.6</b> \| 68.1 \| 67.9 \|
	\| OCRBench \| <b>88.4</b> \| 87.8 \| 88.2 \|
	\| Average \| 71.4 \| <b>71.5</b> \| 70.3 \|

	</div>


	#### Encyclopedia Benchmarks
	<div align="center">

	\| Object Recognition \| Ming-lite-omni \| Qwen2.5-VL-7B-Instruct \|
	\|:---------------------\|:--------------:\|:------------------------:\|
	\| Plants \| 54.96 \| 47.8 \|
	\| Animals \| 56.7 \| 50.85 \|
	\| Vehicles \| 41.91 \| 42.29 \|
	\| Food & Ingredients \| 62.28 \| 54.09 \|
	\| Dishes \| 44.3 \| 39.07 \|
	\| General \| 91.08 \| 92.42 \|
	\| Average \| 58.54 \| 54.43 \|

	</div>

	### Video benchmark

	<div align="center">

	\| Benchmarks \| Ming-lite-omni \| Qwen2.5VL-7B-Instruct \|
	\|:------------------------\|:--------------:\|:---------------------:\|
	\| VideoMME \| 67.0 \| <b>67.3</b> \|
	\| MVBench \| 67.7 \| <b>67.4</b> \|
	\| Video-MMMU \| 46.3 \| <b>47.4</b> \|
	\| LongVideoBench \| 56.6 \| 54.7 \|
	\| Average \| <b>59.4</b> \| 59.2 \|

	</div>
	Note: All models are evaluated based on 128 uniformly sampled frames.

	### Audio benchmark
	#### SpeechQA

	<div align="center">

	\| Model \| Average \| AlpacaEval \| CommonEval \| SD-QA \| MMSU \| OpenBookQA \| IFEval \| AdvBench \|
	\|:-----------------\|:-------------:\|:-----------:\|:-----------:\|:------------:\|:------------:\|:------------:\|:------------:\|:-------------:\|
	\| Qwen2-Audio-chat \| 3.545 \| 3.69 \| 3.40 \| 35.35 \| 35.43 \| 49.01 \| 22.57 \| 98.85 \|
	\| Baichuan-Audio \| 3.695 \| 4.00 \| 3.39 \| 49.64 \| 48.80 \| 63.30 \| 41.32 \| 86.73 \|
	\| GLM-4-Voice \| 3.77 \| 4.06 \| 3.48 \| 43.31 \| 40.11 \| 52.97 \| 24.91 \| 88.08 \|
	\| Kimi-Audio \| 4.215 \| 4.46 \| 3.97 \| <b>63.12</b> \| <b>62.17</b> \| <b>83.52</b> \| <b>61.10</b> \| <b>100.00</b> \|
	\| Qwen2.5-Omni \| 4.21 \| 4.49 \| 3.93 \| 55.71 \| 61.32 \| 81.10 \| 52.87 \| 99.42 \|
	\| Ming-lite-omni \| <b>4.34</b> \| <b>4.63</b> \| <b>4.06</b> \| 58.84 \| 47.53 \| 61.98 \| 58.36 \| 99.04 \|
	</div>

	#### ASR

	<div align="center">

	\| Model \| aishell1 \| aishell2_android \| aishell2_ios \| cv15_zh \| fleurs_zh \| wenetspeech_meeting \| wenetspeech_net \| librispeech_test_clean \| librispeech_test_other \| multilingual_librispeech \| cv15_en \| fleurs_en \| voxpopuli_v1.0_en \|
	\|:--------------:\|:--------:\|:----------------:\|:------------:\|:--------:\|:---------:\|:-------------------:\|:---------------:\|:----------------------:\|:----------------------:\|:------------------------:\|:--------:\|:---------:\|:--------------------:\|
	\| Ming-lite-omni \| 1.47 \| 2.55 \| 2.52 \| 6.31 \| 2.96 \| 5.95 \| 5.46 \| 1.44 \| 2.80 \| 4.15 \| 6.89 \| 3.39 \| 5.80 \|
	\| Qwen2.-Omni \| 1.18 \| 2.75 \| 2.63 \| 5.20 \| 3.00 \| 5.90 \| 7.70 \| 1.80 \| 3.40 \| 7.56 \| 7.60 \| 4.10 \| 5.80 \|
	\| Qwen2-Audio \| 1.53 \| 2.92 \| 2.92 \| 6.90 \| 7.50 \| 7.16 \| 8.42 \| 1.60 \| 3.60 \| 5.40 \| 8.60 \| 6.90 \| 6.84 \|
	\| Kimi-Audio \| 0.60 \| 2.64 \| 2.56 \| 7.21 \| 2.69 \| 6.28 \| 5.37 \| 1.28 \| 2.42 \| 5.88 \| 10.31 \| 4.44 \| 7.97 \|

	</div>



	### Information-Seeking Benchmark
	<div align="center">

	\| Model \| InfoSeek_H-mean \| InfoSeek_unseen_question \| InfoSeek_unseen_entity \|
	\|:---------------\|:---------------:\|:------------------------:\|:----------------------:\|
	\| GPT-4o \| <b>36.05</b> \| - \| - \|
	\| PaLI-X \| 22.06 \| 23.5 \| 20.8 \|
	\| Qwen2.5-vl-32B \| 19.35 \| 20.55 \| 18.28 \|
	\| Ming-lite-omni \| 27.7 \| 30.4 \| 25.4 \|
	</div>



	### OCR
	<div align="center">

	\| Model \| Ming-lite-omni \| Qwen2.5-VL-7B-Instruct \|
	\|:-------------------\|:--------------:\|:-----------------------:\|
	\| ChartQA_TEST \| 85.1 \| <b>87.3</b> \|
	\| DocVQA_TEST \| 93 \| <b>95.7</b> \|
	\| OCRBenchV2_en/zh \| 53.3/52 \| <b>56.3/57.2</b> \|
	\| OmniDocBench↓ \| 34/<b>34.4</b> \| <b>30.8</b>/39.8 \|
	\| TextVQA_VAL \| 82.8 \| <b>84.9</b> \|
	</div>

	### GUI
	<div align="center">

	\| Model \| Ming-lite-omni \| InternVL3 8B \| Qwen2.5-VL-7B-Instruct \|
	\|:---------------------------\|:--------------:\|:------------:\|:----------------------:\|
	\| ScreenSpot \| <b>82.1</b> \| 79.5 \| 78.9* \|
	\| ScreenSpot-V2 \| <b>84.1</b> \| 81.4 \| - \|
	\| AITZ(EM) \| <b>66.6</b> \| - \| 57.6* \|
	</div>
	Note: * denotes the reproduced results.



	### Unified Generation Benchmark

	<div align="center">

	\| Model \| single_object \| two_object \| counting \| colors \| position \| color_attr \| GENEVAL \| DPGBench \| FID↓ \|
	\|:---------------\|:-------------:\|:----------:\|:----------:\|:--------:\|:--------:\|:----------:\|:--------:\|:---------:\|:-------------:\|
	\| Ming-lite-omni \| 0.9875 \| 0.7727 \| 0.6812 \| 0.7872 \| 0.31 \| 0.29 \| 0.64 \| 81.72 \| 4.85 \|
	\| Metaquery-XL \| - \| - \| - \| - \| - \| - \| 0.61 \| 82.05 \| 6.02 \|
	\| SDv2.1 \| 0.98 \| 0.51 \| 0.44 \| 0.85 \| 0.07 \| 0.17 \| 0.50 \| 68.09 \| 26.96 \|
	\| Emu3-Gen \| 0.98 \| 0.71 \| 0.34 \| 0.81 \| 0.17 \| 0.21 \| 0.54 \| 80.60 \| - \|
	\| SDXL \| 0.98 \| 0.74 \| 0.39 \| 0.85 \| 0.15 \| 0.23 \| 0.55 \| 74.65 \| 8.76 \|
	\| Janus \| 0.97 \| 0.68 \| 0.30 \| 0.84 \| 0.46 \| 0.42 \| 0.61 \| 79.68 \| 10.10 \|
	\| JanusFlow \| - \| - \| - \| - \| - \| - \| 0.63 \| 80.09 \| 9.51 \|

	</div>

	Please refer to our technical report for more comprehensive evaluation results.


	## Model Downloads

	You can download the model from both Huggingface and ModelScope.

	<div align="center">

	\| Model \| Input modality \| Oput modality \| Download \|
	\|:---------------\| :---------------------: \| :---------------: \|:----------------------------------------------------------------------------------------------------------------------------------------------------:\|
	\| Ming-Lite-Omni \| Image,text,viedio,audio \| Image,text,audio \| [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ming-Lite-Omni) <br>[🤖 ModelScope](https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni) \|
	</div>
	If you're in mainland China, we strongly recommend you to download our model from 🤖 <a href="https://www.modelscope.cn/models/inclusionAI/Ming-Lite-Omni">ModelScope</a>.


	## Use Cases

	Additional demonstration cases are available on our project [page](https://lucaria-academy.github.io/Ming-Omni/).




	## Example Usage

	Please download our model following [Model Downloads](#model-downloads), then you can refer to the following codes to run Ming-lite-omni model.

	Python environment dependency installation.
	```shell
	pip install -r requirements.txt
	pip install data/matcha_tts-0.0.5.1-cp38-cp38-linux_x86_64.whl
	pip install diffusers==0.33.0
	pip install nvidia-cublas-cu12==12.4.5.8 # for H20
	```
	Note: We test following examples on hardware of NVIDIA H800-80GB with CUDA 12.2. Loading inclusionAI/Ming-Lite-Omni in bfloat16 takes about 40890MB memory.


	```python
	import os
	import torch
	from transformers import AutoProcessor, GenerationConfig
	from modeling_bailingmm import BailingMMNativeForConditionalGeneration

	# build model
	model = BailingMMNativeForConditionalGeneration.from_pretrained(
	"inclusionAI/Ming-Lite-Omni",
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True
	).to("cuda")

	assets_path = YOUR_ASSETS_PATH

	# build processor
	processor = AutoProcessor.from_pretrained("inclusionAI/Ming-Lite-Omni", trust_remote_code=True)
	```

	```python
	# qa
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "text", "text": "请详细介绍鹦鹉的生活习性。"}
	],
	},
	]
	# Output:

	# 鹦鹉是一种非常聪明和社交性强的鸟类，它们的生活习性非常丰富和有趣。以下是一些关于鹦鹉生活习性的详细介绍：
	# ### 1. 栖息地
	# 鹦鹉主要分布在热带和亚热带地区，包括非洲、亚洲、澳大利亚和南美洲。它们通常生活在森林、草原、沙漠和城市环境中。不同种类的鹦鹉对栖息地的要求有所不同，但大多数鹦鹉喜欢有丰富植被和水源的地方。
	# ### 2. 饮食
	# 鹦鹉是杂食性动物，它们的饮食非常多样化。它们的食物包括种子、坚果、水果、蔬菜、花蜜和昆虫。鹦鹉的喙非常强壮，能够轻松地打开坚硬的果壳和坚果。一些鹦鹉还会吃泥土或沙子，以帮助消化和补充矿物质。
	# ......

	```

	```python
	# image qa
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "image", "image": os.path.join(assets_path, "flowers.jpg")},
	{"type": "text", "text": "What kind of flower is this?"},
	],
	},
	]
	# Output:

	# The flowers in this image are forget-me-nots. These delicate blooms are known for their small, five-petaled flowers that come in various shades of blue, pink, and white.
	```

	To enable thinking before response, adding the following system prompt before your question:

	```python
	cot_prompt = "SYSTEM: You are a helpful assistant. When the user asks a question, your response must include two parts: first, the reasoning process enclosed in <thinking>...</thinking> tags, then the final answer enclosed in <answer>...</answer> tags. The critical answer or key result should be placed within \\boxed{}.\n"
	# And your input message should be like this:
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "image", "image": os.path.join(assets_path, "reasoning.png")},
	{"type": "text", "text": cot_prompt + "In the rectangle $A B C D$ pictured, $M_{1}$ is the midpoint of $D C, M_{2}$ the midpoint of $A M_{1}, M_{3}$ the midpoint of $B M_{2}$ and $M_{4}$ the midpoint of $C M_{3}$. Determine the ratio of the area of the quadrilateral $M_{1} M_{2} M_{3} M_{4}$ to the area of the rectangle $A B C D$.\nChoices:\n(A) $\frac{7}{16}$\n(B) $\frac{3}{16}$\n(C) $\frac{7}{32}$\n(D) $\frac{9}{32}$\n(E) $\frac{1}{5}$"},
	],
	},
	]
	# Output:
	# \<think\>\nOkay, so I have this problem about a rectangle ABCD ... (thinking process omitted) ... So, the correct answer is C.\n\</think\>\n\<answer\>\\boxed{C}\</answer\>\n\n
	```

	```python
	# video qa
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "video", "video": os.path.join(assets_path, "yoga.mp4")},
	{"type": "text", "text": "What is the woman doing?"},
	],
	},
	]
	# Output:

	# The image shows a woman performing a yoga pose on a rooftop. She's in a dynamic yoga pose, with her arms and legs extended in various positions.

	```

	```python
	# multi-turn chat
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "text", "text": "中国的首都是哪里？"},
	],
	},
	{
	"role": "ASSISTANT",
	"content": [
	{"type": "text", "text": "北京"},
	],
	},
	{
	"role": "HUMAN",
	"content": [
	{"type": "text", "text": "它的占地面积是多少？有多少常住人口？"},
	],
	},
	]
	# Output:

	# 北京市的总面积约为16,410.54平方公里，常住人口约为21,542,000人。
	```

	```python
	# Preparation for inference
	text = processor.apply_chat_template(messages, add_generation_prompt=True)
	image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	audios=audio_inputs,
	return_tensors="pt",
	)
	inputs = inputs.to(model.device)
	for k in inputs.keys():
	if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
	inputs[k] = inputs[k].to(dtype=torch.bfloat16)

	# call generate
	generation_config = GenerationConfig.from_dict({'no_repeat_ngram_size': 10})
	generated_ids = model.generate(
	**inputs,
	max_new_tokens=512,
	use_cache=True,
	eos_token_id=processor.gen_terminator,
	generation_config=generation_config,
	)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(output_text)
	```

	### Audio tasks

	```python
	# ASR
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "text", "text": "Please recognize the language of this speech and transcribe it. Format: oral."},
	{"type": "audio", "audio": 'data/wavs/BAC009S0915W0292.wav'},
	],
	},
	]
	# we use whisper encoder for ASR task, so need modify code above
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	audios=audio_inputs,
	return_tensors="pt",
	audio_kwargs={'use_whisper_encoder': True}
	)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	use_cache=True,
	eos_token_id=processor.gen_terminator,
	generation_config=generation_config,
	use_whisper_encoder=True
	)

	```

	```python
	# speech2speech
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "audio", "audio": 'data/wavs/speechQA_sample.wav'},
	],
	},
	]
	generation_config = GenerationConfig.from_dict({
	'output_hidden_states': True,
	'return_dict_in_generate': True,
	'no_repeat_ngram_size': 10}
	)

	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	use_cache=True,
	eos_token_id=processor.gen_terminator,
	generation_config=generation_config,
	use_whisper_encoder=False
	)

	generated_ids = outputs.sequences
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]

	# speechQA result
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]

	# for TTS
	from modeling_bailing_talker import AudioDetokenizer

	model_name_or_path = model.config._name_or_path
	audio_detokenizer = AudioDetokenizer(
	f'{model_name_or_path}/talker/audio_detokenizer.yaml',
	flow_model_path=f'{model_name_or_path}/talker/flow.pt',
	hifigan_model_path=f'{model_name_or_path}/talker/hift.pt'
	)
	spk_input = torch.load('data/spks/luna.pt')
	thinker_reply_part = outputs.hidden_states[0][0] + outputs.hidden_states[0][-1]
	# Setting thinker_reply_part to None allows the talker to operate as a standalone TTS model, independent of the language model.
	audio_tokens = model.talker.omni_audio_generation(
	output_text,
	thinker_reply_part=thinker_reply_part, **spk_input)
	waveform = audio_detokenizer.token2wav(audio_tokens, save_path='out.wav', **spk_input)

	```
	For detailed usage for ASR, SpeechQA, and TTS tasks, please refer to `test_audio_tasks.py`

	### Image Generation & Edit

	Ming-omni natively supports image generation and image editing. To use this function, you only need to add the corresponding parameters in the generate function.

	```python
	# Image generation mode currently limits the range of input pixels.
	gen_input_pixels = 451584
	processor.max_pixels = gen_input_pixels
	processor.min_pixels = gen_input_pixels

	def generate(messages, processor, model, **image_gen_param):
	text = processor.apply_chat_template(messages, add_generation_prompt=True)
	image_inputs, video_inputs, audio_inputs = processor.process_vision_info(messages)

	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	audios=audio_inputs,
	return_tensors="pt",
	).to(model.device)

	for k in inputs.keys():
	if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
	inputs[k] = inputs[k].to(dtype=torch.bfloat16)

	print(image_gen_param)
	image = model.generate(
	**inputs,
	image_gen=True,
	**image_gen_param,
	)
	return image

	```

	Text-to-image
	```python
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "text", "text": "Draw a girl with short hair."},
	],
	}
	]
	image = generate(
	messages=messages, processor=processor, model=model,
	image_gen_cfg=6.0, image_gen_steps=20, image_gen_width=480, image_gen_height=544
	)
	image.save("./t2i.jpg")
	```

	Edit
	```python
	messages = [
	{
	"role": "HUMAN",
	"content": [
	{"type": "image", "image": "samples/cake.jpg"},
	{"type": "text", "text": "add a candle on top of the cake"},
	],
	}
	]
	image = generate(
	messages=messages, processor=processor, model=model,
	image_gen_cfg=6.0, image_gen_steps=20, image_gen_width=512, image_gen_height=512
	)
	image.save("./edit.jpg")
	```


	## License and Legal Disclaimer

	This code repository is licensed under the [MIT License](../LICENSE), and the Legal Disclaimer is located in the [LEGAL.md file](../LEGAL.md) under the project's root directory.

	## Citation

	If you find our work helpful, feel free to give us a cite.

	```bibtex

	@misc{Mingomni2025,
	title = {Ming-Omni: A Unified Multimodal Model for Perception and Generation},
	author = {Inclusion AI},
	year = {2025},
	eprint = {2506.09344},
	archivePrefix = {arXiv},
	url = {https://arxiv.org/abs/2506.09344}
	}
	```