Putting Qwen2.5-Omni to Work: Practical Examples
Qwen2.5-Omni stands out for its ability to understand and generate content across text, images, audio, and video. But how does this translate into practical use cases? The official Qwen2.5-Omni cookbooks provide excellent, hands-on demonstrations. This article walks through several key examples, showing step-by-step how to leverage this powerful multimodal model.
Core Setup: Installation and Model Loading
Before diving into specific examples, let's cover the essential setup required for all scenarios.
Installation: You'll need the
transformers
library (specifically the preview version supporting Qwen2.5-Omni),qwen-omni-utils
for handling multimodal inputs easily,torch
,accelerate
for efficient loading, andsoundfile
for audio handling. Using the[decord]
extra forqwen-omni-utils
is recommended for faster video loading.# Recommended installation steps pip install transformers accelerate torch soundfile qwen-omni-utils[decord] -U # Install the specific transformers version pip install git+https://github.com/huggingface/[email protected]
Model Loading: Load the Qwen2.5-Omni model (we'll use the 7B variant here) and its processor. For efficiency, especially with the 7B model, use
bfloat16
precision and enable Flash Attention 2 if your hardware supports it (NVIDIA Ampere or newer).device_map="auto"
helps distribute the model across available GPUs.import torch import soundfile as sf from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor from qwen_omni_utils import process_mm_info # Define model identifier model_path = "Qwen/Qwen2.5-Omni-7B" print("Loading model and processor...") # Load with optimizations model = Qwen2_5OmniForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, # Use BF16 device_map="auto", # Auto-distribute across GPUs attn_implementation="flash_attention_2" # Use Flash Attention 2 ) processor = Qwen2_5OmniProcessor.from_pretrained(model_path) print("Setup complete. Model and processor ready.")
Now, let's explore specific cookbook examples.
Example 1: Engaging in Voice Chat (voice_chatting.ipynb
)
This cookbook demonstrates how Qwen2.5-Omni can simulate a natural voice conversation. The model needs to understand spoken input and generate both a text reply and a spoken audio response.
Prepare Input: We'll use a sample audio file representing the user's spoken input. The conversation structure includes the standard system prompt (crucial for enabling audio output) and the user's turn containing the audio and a text prompt asking the model to respond.
# Assume user's voice input is in 'user_voice_input.wav' # Or use a sample URL: user_audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/hello.wav" # Example input conversation_voice_chat = [ { "role": "system", "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team..."}] }, { "role": "user", "content": [ {"type": "audio", "audio": user_audio_path}, {"type": "text", "text": "Please listen to this audio and provide a spoken response."} ] } ]
Process Input: Use
process_mm_info
and theprocessor
to prepare the model inputs.use_audio_in_video
isFalse
here.print("Processing voice chat input...") USE_AUDIO_IN_VIDEO_FLAG = False text_prompt_vc = processor.apply_chat_template(conversation_voice_chat, add_generation_prompt=True, tokenize=False) audios_vc, _, _ = process_mm_info(conversation_voice_chat, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG) inputs_vc = processor( text=text_prompt_vc, audio=audios_vc, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG ) inputs_vc = inputs_vc.to(model.device).to(model.dtype) print("Voice chat input ready.")
Generate Response (Text and Audio): Call
model.generate
, ensuringreturn_audio=True
is set to get the synthesized speech. Choose a speaker voice like 'Chelsie' or 'Ethan'.print("Generating voice chat response...") with torch.no_grad(): text_ids_vc, audio_output_vc = model.generate( **inputs_vc, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG, return_audio=True, # Request audio output speaker="Chelsie", # Specify voice max_new_tokens=256 ) # Decode text response text_response_vc = processor.batch_decode(text_ids_vc, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(f"\n--- Voice Chat Response (Text) ---") print(text_response_vc) # Save audio response if audio_output_vc is not None: output_audio_path_vc = "qwen_voice_response.wav" sf.write( output_audio_path_vc, audio_output_vc.reshape(-1).detach().cpu().numpy(), samplerate=24000, ) print(f"--- Voice Chat Response (Audio) Saved to: {output_audio_path_vc} ---") else: print("--- No audio response generated. ---")
Example 2: Analyzing Screen Recordings (screen_recording_interaction.ipynb
)
This use case involves understanding the content of a screen recording video and answering questions about it.
Prepare Input: Provide the video file (e.g., a tutorial recording) and a text question. For many screen recordings without narration, the audio track might not be relevant, so
use_audio_in_video
could beFalse
.# Assume screen recording is at 'screen_recording.mp4' # Or use a sample URL if available (replace with an actual screen recording URL if possible) # Using a generic video URL as a placeholder: screen_rec_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4" # Placeholder URL conversation_screen_rec = [ {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant analyzing screen recordings."}]}, # Modified prompt { "role": "user", "content": [ {"type": "video", "video": screen_rec_path}, {"type": "text", "text": "What is the main application being demonstrated in this recording?"} ] } ]
Process Input: Decide whether to include the audio track. Let's assume no relevant audio for this example (
use_audio_in_video=False
).print("\nProcessing screen recording input...") # Set based on whether screen recording audio is needed USE_AUDIO_IN_VIDEO_FLAG = False text_prompt_sr = processor.apply_chat_template(conversation_screen_rec, add_generation_prompt=True, tokenize=False) audios_sr, images_sr, videos_sr = process_mm_info(conversation_screen_rec, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG) inputs_sr = processor( text=text_prompt_sr, audio=audios_sr, images=images_sr, videos=videos_sr, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG ) inputs_sr = inputs_sr.to(model.device).to(model.dtype) print("Screen recording input ready.")
Generate Response (Text): Typically, a textual answer is sufficient for this task, so
return_audio=False
.print("Generating screen recording analysis...") with torch.no_grad(): text_ids_sr = model.generate( **inputs_sr, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG, return_audio=False, # Text response is usually sufficient max_new_tokens=512 ) # Decode text response text_response_sr = processor.batch_decode(text_ids_sr, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(f"\n--- Screen Recording Analysis ---") print(f"Video Source: {screen_rec_path}") print(f"Response: {text_response_sr}")
Example 3: Solving Math Problems Multimodally (omni_chatting_for_math.ipynb
)
Qwen2.5-Omni can tackle math problems presented visually, perhaps in an image containing an equation or diagram.
Prepare Input: Use an image containing the math problem. An accompanying audio explanation could also be included if relevant.
# Assume math problem image is at 'math_problem.png' # Using a sample image URL from Qwen resources as placeholder: math_image_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" # Placeholder image conversation_math = [ {"role": "system", "content": [{"type": "text", "text": "You are a math assistant capable of understanding visual problems."}]}, { "role": "user", "content": [ {"type": "image", "image": math_image_path}, # {"type": "audio", "audio": "explanation.wav"}, # Optional audio input {"type": "text", "text": "Please solve the problem shown in the image."} ] } ]
Process Input: Handle the image input using
process_mm_info
and theprocessor
.print("\nProcessing math problem input...") USE_AUDIO_IN_VIDEO_FLAG = False # No video involved text_prompt_math = processor.apply_chat_template(conversation_math, add_generation_prompt=True, tokenize=False) # Assuming only image input for simplicity here audios_math, images_math, videos_math = process_mm_info(conversation_math, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG) inputs_math = processor( text=text_prompt_math, audio=audios_math, images=images_math, videos=videos_math, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG ) inputs_math = inputs_math.to(model.device).to(model.dtype) print("Math problem input ready.")
Generate Response (Text): Generate the solution or explanation as text.
print("Generating math solution...") with torch.no_grad(): text_ids_math = model.generate( **inputs_math, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG, return_audio=False, max_new_tokens=1024 # Allow longer solutions ) # Decode text response text_response_math = processor.batch_decode(text_ids_math, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(f"\n--- Math Problem Solution ---") print(f"Image Source: {math_image_path}") print(f"Solution/Explanation: {text_response_math}")
Conclusion
These examples, derived from the official Qwen2.5-Omni cookbooks, highlight the model's remarkable flexibility. Whether engaging in spoken dialogue, analyzing visual recordings, or interpreting multimodal problem statements, Qwen2.5-Omni provides a powerful toolkit. By following these patterns – structuring the conversation, processing inputs with qwen-omni-utils
, and utilizing the generate
function with appropriate flags – developers can begin building sophisticated applications that truly embrace the richness of multimodal interaction. Don't hesitate to explore the other cookbooks (like omni_chatting_for_music.ipynb
or multi_round_omni_chatting.ipynb
) for even more advanced use cases.