Putting Qwen2.5-Omni to Work: Practical Examples

Community Article Published May 1, 2025

Qwen2.5-Omni stands out for its ability to understand and generate content across text, images, audio, and video. But how does this translate into practical use cases? The official Qwen2.5-Omni cookbooks provide excellent, hands-on demonstrations. This article walks through several key examples, showing step-by-step how to leverage this powerful multimodal model.

Core Setup: Installation and Model Loading

Before diving into specific examples, let's cover the essential setup required for all scenarios.

Installation: You'll need the transformers library (specifically the preview version supporting Qwen2.5-Omni), qwen-omni-utils for handling multimodal inputs easily, torch, accelerate for efficient loading, and soundfile for audio handling. Using the [decord] extra for qwen-omni-utils is recommended for faster video loading.
```
# Recommended installation steps
pip install transformers accelerate torch soundfile qwen-omni-utils[decord] -U
# Install the specific transformers version
pip install git+https://github.com/huggingface/[email protected]
```

Model Loading: Load the Qwen2.5-Omni model (we'll use the 7B variant here) and its processor. For efficiency, especially with the 7B model, use bfloat16 precision and enable Flash Attention 2 if your hardware supports it (NVIDIA Ampere or newer). device_map="auto" helps distribute the model across available GPUs.

import torch
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# Define model identifier
model_path = "Qwen/Qwen2.5-Omni-7B"
print("Loading model and processor...")

# Load with optimizations
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16, # Use BF16
    device_map="auto",         # Auto-distribute across GPUs
    attn_implementation="flash_attention_2" # Use Flash Attention 2
)
processor = Qwen2_5OmniProcessor.from_pretrained(model_path)

print("Setup complete. Model and processor ready.")

Now, let's explore specific cookbook examples.

Example 1: Engaging in Voice Chat (voice_chatting.ipynb)

This cookbook demonstrates how Qwen2.5-Omni can simulate a natural voice conversation. The model needs to understand spoken input and generate both a text reply and a spoken audio response.

Prepare Input: We'll use a sample audio file representing the user's spoken input. The conversation structure includes the standard system prompt (crucial for enabling audio output) and the user's turn containing the audio and a text prompt asking the model to respond.

# Assume user's voice input is in 'user_voice_input.wav'
# Or use a sample URL:
user_audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/hello.wav" # Example input

conversation_voice_chat = [
    {
        "role": "system",
        "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team..."}]
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": user_audio_path},
            {"type": "text", "text": "Please listen to this audio and provide a spoken response."}
        ]
    }
]

Process Input: Use process_mm_info and the processor to prepare the model inputs. use_audio_in_video is False here.

print("Processing voice chat input...")
USE_AUDIO_IN_VIDEO_FLAG = False
text_prompt_vc = processor.apply_chat_template(conversation_voice_chat, add_generation_prompt=True, tokenize=False)
audios_vc, _, _ = process_mm_info(conversation_voice_chat, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)

inputs_vc = processor(
    text=text_prompt_vc,
    audio=audios_vc,
    return_tensors="pt", padding=True,
    use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
)
inputs_vc = inputs_vc.to(model.device).to(model.dtype)
print("Voice chat input ready.")

Generate Response (Text and Audio): Call model.generate, ensuring return_audio=True is set to get the synthesized speech. Choose a speaker voice like 'Chelsie' or 'Ethan'.

print("Generating voice chat response...")
with torch.no_grad():
    text_ids_vc, audio_output_vc = model.generate(
        **inputs_vc,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
        return_audio=True,  # Request audio output
        speaker="Chelsie", # Specify voice
        max_new_tokens=256
    )

# Decode text response
text_response_vc = processor.batch_decode(text_ids_vc, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f"\n--- Voice Chat Response (Text) ---")
print(text_response_vc)

# Save audio response
if audio_output_vc is not None:
    output_audio_path_vc = "qwen_voice_response.wav"
    sf.write(
        output_audio_path_vc,
        audio_output_vc.reshape(-1).detach().cpu().numpy(),
        samplerate=24000,
    )
    print(f"--- Voice Chat Response (Audio) Saved to: {output_audio_path_vc} ---")
else:
    print("--- No audio response generated. ---")

Example 2: Analyzing Screen Recordings (screen_recording_interaction.ipynb)

This use case involves understanding the content of a screen recording video and answering questions about it.

Prepare Input: Provide the video file (e.g., a tutorial recording) and a text question. For many screen recordings without narration, the audio track might not be relevant, so use_audio_in_video could be False.

# Assume screen recording is at 'screen_recording.mp4'
# Or use a sample URL if available (replace with an actual screen recording URL if possible)
# Using a generic video URL as a placeholder:
screen_rec_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4" # Placeholder URL

conversation_screen_rec = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant analyzing screen recordings."}]}, # Modified prompt
    {
        "role": "user",
        "content": [
            {"type": "video", "video": screen_rec_path},
            {"type": "text", "text": "What is the main application being demonstrated in this recording?"}
        ]
    }
]

Process Input: Decide whether to include the audio track. Let's assume no relevant audio for this example (use_audio_in_video=False).

print("\nProcessing screen recording input...")
# Set based on whether screen recording audio is needed
USE_AUDIO_IN_VIDEO_FLAG = False
text_prompt_sr = processor.apply_chat_template(conversation_screen_rec, add_generation_prompt=True, tokenize=False)
audios_sr, images_sr, videos_sr = process_mm_info(conversation_screen_rec, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)

inputs_sr = processor(
    text=text_prompt_sr,
    audio=audios_sr, images=images_sr, videos=videos_sr,
    return_tensors="pt", padding=True,
    use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
)
inputs_sr = inputs_sr.to(model.device).to(model.dtype)
print("Screen recording input ready.")

Generate Response (Text): Typically, a textual answer is sufficient for this task, so return_audio=False.

print("Generating screen recording analysis...")
with torch.no_grad():
    text_ids_sr = model.generate(
        **inputs_sr,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
        return_audio=False, # Text response is usually sufficient
        max_new_tokens=512
    )

# Decode text response
text_response_sr = processor.batch_decode(text_ids_sr, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f"\n--- Screen Recording Analysis ---")
print(f"Video Source: {screen_rec_path}")
print(f"Response: {text_response_sr}")

Example 3: Solving Math Problems Multimodally (omni_chatting_for_math.ipynb)

Qwen2.5-Omni can tackle math problems presented visually, perhaps in an image containing an equation or diagram.

Prepare Input: Use an image containing the math problem. An accompanying audio explanation could also be included if relevant.

# Assume math problem image is at 'math_problem.png'
# Using a sample image URL from Qwen resources as placeholder:
math_image_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" # Placeholder image

conversation_math = [
    {"role": "system", "content": [{"type": "text", "text": "You are a math assistant capable of understanding visual problems."}]},
    {
        "role": "user",
        "content": [
            {"type": "image", "image": math_image_path},
            # {"type": "audio", "audio": "explanation.wav"}, # Optional audio input
            {"type": "text", "text": "Please solve the problem shown in the image."}
        ]
    }
]

Process Input: Handle the image input using process_mm_info and the processor.

print("\nProcessing math problem input...")
USE_AUDIO_IN_VIDEO_FLAG = False # No video involved
text_prompt_math = processor.apply_chat_template(conversation_math, add_generation_prompt=True, tokenize=False)
# Assuming only image input for simplicity here
audios_math, images_math, videos_math = process_mm_info(conversation_math, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)

inputs_math = processor(
    text=text_prompt_math,
    audio=audios_math, images=images_math, videos=videos_math,
    return_tensors="pt", padding=True,
    use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
)
inputs_math = inputs_math.to(model.device).to(model.dtype)
print("Math problem input ready.")

Generate Response (Text): Generate the solution or explanation as text.

print("Generating math solution...")
with torch.no_grad():
    text_ids_math = model.generate(
        **inputs_math,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
        return_audio=False,
        max_new_tokens=1024 # Allow longer solutions
    )

# Decode text response
text_response_math = processor.batch_decode(text_ids_math, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f"\n--- Math Problem Solution ---")
print(f"Image Source: {math_image_path}")
print(f"Solution/Explanation: {text_response_math}")

Conclusion

These examples, derived from the official Qwen2.5-Omni cookbooks, highlight the model's remarkable flexibility. Whether engaging in spoken dialogue, analyzing visual recordings, or interpreting multimodal problem statements, Qwen2.5-Omni provides a powerful toolkit. By following these patterns – structuring the conversation, processing inputs with qwen-omni-utils, and utilizing the generate function with appropriate flags – developers can begin building sophisticated applications that truly embrace the richness of multimodal interaction. Don't hesitate to explore the other cookbooks (like omni_chatting_for_music.ipynb or multi_round_omni_chatting.ipynb) for even more advanced use cases.

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote