Ming-UniAudio
๐ Technical Report๏ฝ๐Project Page ๏ฝ๐ค Hugging Face๏ฝ ๐ค ModelScope
Introduction
Ming-UniAudio is a novel framework that unifies speech understanding, generation, and editing. Its core is a unified continuous speech tokenizer that effectively unifies semantic and acoustic features within an end-to-end model. We developed a speech language model that strikes a balance between generation and understanding capabilities based on the unified continuous audio tokenizer. Leveraging this foundational model, which exhibits robust performance in both domains, we further trained a dedicated speech editing model built upon Ming-Lite-Omni. Crucially, Ming-UniAudio is the first to enable universal, free-form speech editing guided solely by natural language instructions, handling complex semantic and acoustic modifications without manual region specification.
- ๐ฅ First unified continuous speech tokenizer for both understanding and generation tasks: MingTok-Audio
- ๐ฅ First Speech LLM with unifed continuous tokenizer for both understanding and generation: Ming-UniAudio
- ๐ฅ First universal free-form speech editing model for various semantic and acoustic editing task without any temporal regime: Ming-UniAudio-Edit
- ๐ฅ First benchmark for free-form speech editing: Ming-Freeform-Audio-Edit-Benchmark
๐ Updates
- [2025.09.30] ๐ฅ We release Ming-UniAudio with significant improvements across speech understanding, generation, and free-form editing tasks.
Key Features
Ming-UniAudio features key optimizations as follows, compared to other audio-assisted LLMs:
- Unified Continuous Speech Tokenizer: Ming-UniAudio proposes a unified continuous speech tokenizer MingTok-Audio based on a VAE framework with a causal Transformer architecture, the first continuous speech tokenizer to effectively integrate semantic and acoustic features, and enables a closed-loop system with LLMs through hierarchical feature representations, makes it suitable for both understanding and generation tasks
- Unified Speech Language Model for Generation and Understanding: We pretrain an end-to-end unified speech language model with a single LLM backbone for both understanding and generation tasks, enhanced with a Diffusion Head to ensure high-fidelity speech synthesis.
- Instruction-Guided Free-Form Speech Editing: We introduce the first instruction-guided, free-form speech editing framework that supports comprehensive semantic and acoustic edits without requiring explicit edit regions, along with Ming-Freeform-Audio-Edit, the first open-source evaluation set for such tasks.
Evaluation
In various benchmark tests, Ming-UniAudio demonstrates highly competitive results compared to industry-leading models of similar scale.
Speech Understanding
Datasets | Model | Performance | |||
---|---|---|---|---|---|
Speech-English WER | NE-WER | NE-FNR |
Dialogue-English WER | NE-WER | NE-FNR |
Speech-Mandarin WER | NE-WER | NE-FNR |
Dialogue-Mandarin WER | NE-WER | NE-FNR |
||
Understanding Context ASR |
Qwen2-Audio | 11.49 | 27.27 | 35.08 | 13.99 | 33.02 | 32.92 | 9.92 | 24.10 | 30.02 | 7.00 | 22.76 | 26.17 |
Baichuan-Audio | 7.52 | 5.87 | 4.55 | 5.66 | 10.01 | 3.64 | 2.16 | 6.65 | 2.35 | 2.96 | 11.48 | 3.94 | |
Kimi-Audio | 2.90 | 6.68 | 8.01 | 4.67 | 13.50 | 11.31 | 1.95 | 11.13 | 15.28 | 2.90 | 15.91 | 16.68 | |
Baichuan-Omni-1.5 | 8.16 | 7.69 | 6.53 | 9.91 | 14.40 | 5.54 | 2.98 | 8.39 | 4.71 | 5.00 | 16.83 | 7.84 | |
Qwen2.5-Omni-3B | 3.99 | 7.80 | 9.69 | 4.83 | 14.36 | 12.85 | 2.13 | 10.55 | 14.11 | 3.12 | 15.07 | 15.17 | |
Qwen2.5-Omni-7B | 3.96 | 7.38 | 8.72 | 5.32 | 11.83 | 9.24 | 1.84 | 9.80 | 12.19 | 2.40 | 14.06 | 13.17 | |
Ming-UniAudio-16B-A3B-Edit(ours) | 4.00 | 3.56 | 3.69 | 5.34 | 8.73 | 2.53 | 1.58 | 5.98 | 2.40 | 3.04 | 9.50 | 1.48 |
Speech Editing
Datasets | Model | Performance | |||
---|---|---|---|---|---|
Deletion-basic Deletion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 11.89 | 14.85 22.92 | 27.60 |
ACC zh | en 100 | 82.22 82.92 | 85 |
SIM zh | en 0.78 | 0.76 0.81 | 0.74 |
no-edit WER(%) zh | en 11.49 | 24.26 17.50 | 35.21 |
Insertion-basic Insertion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 3.42 | 6.63 3.89 | 7.592 |
ACC zh | en 80 | 71.43 79.31 | 62.31 |
SIM zh | en 0.83 | 0.79 0.83 | 0.79 |
no-edit WER(%) zh | en 3.52 | 17.70 4.10 | 18.84 |
Substitution-basic Substitution |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 4.52 | 8.99 4.56 | 7.64 |
ACC zh | en 78.62 | 59.78 76.62 | 65.62 |
SIM zh | en 0.82 | 0.78 0.83 | 0.77 |
no-edit WER(%) zh | en 4.63 | 19.28 4.75 | 18.39 |
Dialect Conversion |
Ming-UniAudio-16B-A3B-Edit |
WER(%) 8.93 |
ACC 0.50 |
SIM 0.66 |
- |
Speed changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 5.88 | 17.53 |
SIM zh | en 0.66 | 0.57 |
RDE(%) zh | en 6.36 | 5.92 |
- |
Pitch changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 7.45 | 13.37 |
SIM zh | en 0.36 | 0.24 |
- | - |
Volume changing |
Ming-UniAudio-16B-A3B-Edit |
WER(%) zh | en 1.71 | 1.35 |
SIM zh | en 0.86 | 0.80 |
RAE(%) zh | en 14.9 | 11.7 |
- |
Denoise
Datasets | Model | Model Type | DNSMOS OVRL | DNSMOS SIG | DNSMOS BAK |
---|---|---|---|---|---|
Denoise | FullSubNet | specialized | 2.93 | 3.05 | 3.51 |
Inter-Subnet | 2.98 | 3.17 | 3.15 | ||
CDiffuSE | 2.84 | 3.37 | 3.52 | ||
SGMSE | 3.11 | 3.47 | 3.41 | ||
StoRM | 3.15 | 3.54 | 3.69 | ||
GenSE | 3.43 | 3.65 | 4.18 | ||
MiMo-Audio | general | 3.30 | 3.56 | 4.10 | |
Ming-UniAudio-16B-A3B-Edit(ours) | 3.26 | 3.59 | 3.97 |
Model & Benchmark Downloads
You can download our latest model and Benchmark from both Huggingface and ModelScope.
Type | Model | Input modality | Oput modality | Download |
---|---|---|---|---|
Tokenizer | MingTok-Audio | audio | audio | ๐ค HuggingFace ๐ค ModelScope |
SpeechLLM | Ming-UniAudio-16B-A3B | audio | audio | ๐ค HuggingFace ๐ค ModelScope |
SpeechLLM | Ming-UniAudio-16B-A3B-Edit | text, audio | text, audio | ๐ค HuggingFace ๐ค ModelScope |
Benchmark | Ming-Freeform-Audio-Edit | - | - | ๐ค HuggingFace ๐ค ModelScope Eval tools |
pip install modelscope
modelscope download --model inclusionAI/Ming-UniAudio-16B-A3B-Edit --local_dir inclusionAI/Ming-UniAudio-16B-A3B-Edit --revision master
Note: This download process will take several minutes to several hours, depending on your network conditions.
Use Cases
Additional demonstration cases are available on our project page.
Environment Preparation
Installation with pip
pip install -r requirements.txt
Installation with docker
You can also initialize the environment by building the docker image. First clone this repository:
git clone --depth 1 https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
Then build the docker image with the provided Dockerfile in docker/docker-py310-cu121
. This step might take a while:
docker build -t ming:py310-cu121 docker/docker-py310-cu121
At last, start the container with the current repo directory mounted:
docker run -it --gpus all -v "$(pwd)":/workspace/Ming-UniAudio ming:py310-cu121 ming:py310-cu121 /bin/bash
You can run the model with python interface. You may download the huggingface model in the repo directory first (.../Ming-UniAudio/
) or mount the downloaded model path when starting the container.
Example Usage
We provide a step-by-step running example:
Step 1 - Download the source code
git clone https://github.com/inclusionAI/Ming-UniAudio
cd Ming-UniAudio
Step 2 - Download the Ming-UniAudio model weights and create a soft link to the source code directory
Download our model following Model & Benchmark Downloads
mkdir inclusionAI
ln -s /path/to/inclusionAI/Ming-UniAudio-16B-A3B-Edit inclusionAI/Ming-UniAudio-16B-A3B-Edit
Step 3 - Enter the code directory, you can refer to the following codes to run the Ming-UniAudio model.
jupyter notebook cookbooks/demo.ipynb
We also provide a simple example on the usage of this repo. For detailed usage, please refer to demobook.ipynb.
import warnings
import torch
from transformers import AutoProcessor
from modeling_bailingmm import BailingMMNativeForConditionalGeneration
import random
import numpy as np
from loguru import logger
def seed_everything(seed=1895):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
seed_everything()
warnings.filterwarnings("ignore")
class MingAudio:
def __init__(self, model_path, device="cuda:0"):
self.device = device
self.model = BailingMMNativeForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
).eval().to(torch.bfloat16).to(self.device)
self.processor = AutoProcessor.from_pretrained(".", trust_remote_code=True)
self.tokenizer = self.processor.tokenizer
self.sample_rate = self.processor.audio_processor.sample_rate
self.patch_size = self.processor.audio_processor.patch_size
def speech_understanding(self, messages):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
generated_ids = self.model.generate(
**inputs,
max_new_tokens=512,
eos_token_id=self.processor.gen_terminator,
)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
return output_text
def speech_generation(
self,
text,
prompt_wav_path,
prompt_text,
lang='zh',
output_wav_path='out.wav'
):
waveform = self.model.generate_tts(
text=text,
prompt_wav_path=prompt_wav_path,
prompt_text=prompt_text,
patch_size=self.patch_size,
tokenizer=self.tokenizer,
lang=lang,
output_wav_path=output_wav_path,
sample_rate=self.sample_rate,
device=self.device
)
return waveform
def speech_edit(
self,
messages,
output_wav_path='out.wav'
):
text = self.processor.apply_chat_template(messages, add_generation_prompt=True)
image_inputs, video_inputs, audio_inputs = self.processor.process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
audios=audio_inputs,
return_tensors="pt",
).to(self.device)
ans = torch.tensor([self.tokenizer.encode('<answer>')]).to(inputs['input_ids'].device)
inputs['input_ids'] = torch.cat([inputs['input_ids'], ans], dim=1)
attention_mask = inputs['attention_mask']
inputs['attention_mask'] = torch.cat((attention_mask, attention_mask[:, :1]), dim=-1)
for k in inputs.keys():
if k == "pixel_values" or k == "pixel_values_videos" or k == "audio_feats":
inputs[k] = inputs[k].to(dtype=torch.bfloat16)
logger.info(f"input: {self.tokenizer.decode(inputs['input_ids'].cpu().numpy().tolist()[0])}")
edited_speech, edited_text = self.model.generate_edit(
**inputs,
tokenizer=self.tokenizer,
output_wav_path=output_wav_path
)
return edited_speech, edited_text
if __name__ == "__main__":
model = MingAudio("inclusionAI/Ming-UniAudio-16B-A3B-Edit")
# Edit
messages = [
{
"role": "HUMAN",
"content": [
{"type": "audio", "audio": "data/wavs/00004768-00000024.wav", "target_sample_rate": 16000},
{
"type": "text",
"text": "<prompt>Please recognize the language of this speech and transcribe it. And insert 'ๅฎ็ฐ' before the character or word at index 3.\n</prompt>",
},
],
},
]
response = model.speech_edit(messages=messages)
logger.info(f"Generated Response: {response}")
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
Citation
If you find our work helpful, feel free to give us a cite.
- Downloads last month
- 131