Putting Qwen2.5-Omni to Work: Practical Examples

Community Article Published May 1, 2025

Qwen2.5-Omni stands out for its ability to understand and generate content across text, images, audio, and video. But how does this translate into practical use cases? The official Qwen2.5-Omni cookbooks provide excellent, hands-on demonstrations. This article walks through several key examples, showing step-by-step how to leverage this powerful multimodal model.

Core Setup: Installation and Model Loading

Before diving into specific examples, let's cover the essential setup required for all scenarios.

  1. Installation: You'll need the transformers library (specifically the preview version supporting Qwen2.5-Omni), qwen-omni-utils for handling multimodal inputs easily, torch, accelerate for efficient loading, and soundfile for audio handling. Using the [decord] extra for qwen-omni-utils is recommended for faster video loading.

    # Recommended installation steps
    pip install transformers accelerate torch soundfile qwen-omni-utils[decord] -U
    # Install the specific transformers version
    pip install git+https://github.com/huggingface/[email protected]
    
  2. Model Loading: Load the Qwen2.5-Omni model (we'll use the 7B variant here) and its processor. For efficiency, especially with the 7B model, use bfloat16 precision and enable Flash Attention 2 if your hardware supports it (NVIDIA Ampere or newer). device_map="auto" helps distribute the model across available GPUs.

    import torch
    import soundfile as sf
    from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
    from qwen_omni_utils import process_mm_info
    
    # Define model identifier
    model_path = "Qwen/Qwen2.5-Omni-7B"
    print("Loading model and processor...")
    
    # Load with optimizations
    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16, # Use BF16
        device_map="auto",         # Auto-distribute across GPUs
        attn_implementation="flash_attention_2" # Use Flash Attention 2
    )
    processor = Qwen2_5OmniProcessor.from_pretrained(model_path)
    
    print("Setup complete. Model and processor ready.")
    

Now, let's explore specific cookbook examples.

Example 1: Engaging in Voice Chat (voice_chatting.ipynb)

This cookbook demonstrates how Qwen2.5-Omni can simulate a natural voice conversation. The model needs to understand spoken input and generate both a text reply and a spoken audio response.

  1. Prepare Input: We'll use a sample audio file representing the user's spoken input. The conversation structure includes the standard system prompt (crucial for enabling audio output) and the user's turn containing the audio and a text prompt asking the model to respond.

    # Assume user's voice input is in 'user_voice_input.wav'
    # Or use a sample URL:
    user_audio_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/hello.wav" # Example input
    
    conversation_voice_chat = [
        {
            "role": "system",
            "content": [{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team..."}]
        },
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": user_audio_path},
                {"type": "text", "text": "Please listen to this audio and provide a spoken response."}
            ]
        }
    ]
    
  2. Process Input: Use process_mm_info and the processor to prepare the model inputs. use_audio_in_video is False here.

    print("Processing voice chat input...")
    USE_AUDIO_IN_VIDEO_FLAG = False
    text_prompt_vc = processor.apply_chat_template(conversation_voice_chat, add_generation_prompt=True, tokenize=False)
    audios_vc, _, _ = process_mm_info(conversation_voice_chat, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)
    
    inputs_vc = processor(
        text=text_prompt_vc,
        audio=audios_vc,
        return_tensors="pt", padding=True,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
    )
    inputs_vc = inputs_vc.to(model.device).to(model.dtype)
    print("Voice chat input ready.")
    
  3. Generate Response (Text and Audio): Call model.generate, ensuring return_audio=True is set to get the synthesized speech. Choose a speaker voice like 'Chelsie' or 'Ethan'.

    print("Generating voice chat response...")
    with torch.no_grad():
        text_ids_vc, audio_output_vc = model.generate(
            **inputs_vc,
            use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
            return_audio=True,  # Request audio output
            speaker="Chelsie", # Specify voice
            max_new_tokens=256
        )
    
    # Decode text response
    text_response_vc = processor.batch_decode(text_ids_vc, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    print(f"\n--- Voice Chat Response (Text) ---")
    print(text_response_vc)
    
    # Save audio response
    if audio_output_vc is not None:
        output_audio_path_vc = "qwen_voice_response.wav"
        sf.write(
            output_audio_path_vc,
            audio_output_vc.reshape(-1).detach().cpu().numpy(),
            samplerate=24000,
        )
        print(f"--- Voice Chat Response (Audio) Saved to: {output_audio_path_vc} ---")
    else:
        print("--- No audio response generated. ---")
    

Example 2: Analyzing Screen Recordings (screen_recording_interaction.ipynb)

This use case involves understanding the content of a screen recording video and answering questions about it.

  1. Prepare Input: Provide the video file (e.g., a tutorial recording) and a text question. For many screen recordings without narration, the audio track might not be relevant, so use_audio_in_video could be False.

    # Assume screen recording is at 'screen_recording.mp4'
    # Or use a sample URL if available (replace with an actual screen recording URL if possible)
    # Using a generic video URL as a placeholder:
    screen_rec_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-Omni/draw.mp4" # Placeholder URL
    
    conversation_screen_rec = [
        {"role": "system", "content": [{"type": "text", "text": "You are a helpful assistant analyzing screen recordings."}]}, # Modified prompt
        {
            "role": "user",
            "content": [
                {"type": "video", "video": screen_rec_path},
                {"type": "text", "text": "What is the main application being demonstrated in this recording?"}
            ]
        }
    ]
    
  2. Process Input: Decide whether to include the audio track. Let's assume no relevant audio for this example (use_audio_in_video=False).

    print("\nProcessing screen recording input...")
    # Set based on whether screen recording audio is needed
    USE_AUDIO_IN_VIDEO_FLAG = False
    text_prompt_sr = processor.apply_chat_template(conversation_screen_rec, add_generation_prompt=True, tokenize=False)
    audios_sr, images_sr, videos_sr = process_mm_info(conversation_screen_rec, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)
    
    inputs_sr = processor(
        text=text_prompt_sr,
        audio=audios_sr, images=images_sr, videos=videos_sr,
        return_tensors="pt", padding=True,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
    )
    inputs_sr = inputs_sr.to(model.device).to(model.dtype)
    print("Screen recording input ready.")
    
  3. Generate Response (Text): Typically, a textual answer is sufficient for this task, so return_audio=False.

    print("Generating screen recording analysis...")
    with torch.no_grad():
        text_ids_sr = model.generate(
            **inputs_sr,
            use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
            return_audio=False, # Text response is usually sufficient
            max_new_tokens=512
        )
    
    # Decode text response
    text_response_sr = processor.batch_decode(text_ids_sr, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    print(f"\n--- Screen Recording Analysis ---")
    print(f"Video Source: {screen_rec_path}")
    print(f"Response: {text_response_sr}")
    

Example 3: Solving Math Problems Multimodally (omni_chatting_for_math.ipynb)

Qwen2.5-Omni can tackle math problems presented visually, perhaps in an image containing an equation or diagram.

  1. Prepare Input: Use an image containing the math problem. An accompanying audio explanation could also be included if relevant.

    # Assume math problem image is at 'math_problem.png'
    # Using a sample image URL from Qwen resources as placeholder:
    math_image_path = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" # Placeholder image
    
    conversation_math = [
        {"role": "system", "content": [{"type": "text", "text": "You are a math assistant capable of understanding visual problems."}]},
        {
            "role": "user",
            "content": [
                {"type": "image", "image": math_image_path},
                # {"type": "audio", "audio": "explanation.wav"}, # Optional audio input
                {"type": "text", "text": "Please solve the problem shown in the image."}
            ]
        }
    ]
    
  2. Process Input: Handle the image input using process_mm_info and the processor.

    print("\nProcessing math problem input...")
    USE_AUDIO_IN_VIDEO_FLAG = False # No video involved
    text_prompt_math = processor.apply_chat_template(conversation_math, add_generation_prompt=True, tokenize=False)
    # Assuming only image input for simplicity here
    audios_math, images_math, videos_math = process_mm_info(conversation_math, use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG)
    
    inputs_math = processor(
        text=text_prompt_math,
        audio=audios_math, images=images_math, videos=videos_math,
        return_tensors="pt", padding=True,
        use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG
    )
    inputs_math = inputs_math.to(model.device).to(model.dtype)
    print("Math problem input ready.")
    
  3. Generate Response (Text): Generate the solution or explanation as text.

    print("Generating math solution...")
    with torch.no_grad():
        text_ids_math = model.generate(
            **inputs_math,
            use_audio_in_video=USE_AUDIO_IN_VIDEO_FLAG,
            return_audio=False,
            max_new_tokens=1024 # Allow longer solutions
        )
    
    # Decode text response
    text_response_math = processor.batch_decode(text_ids_math, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    print(f"\n--- Math Problem Solution ---")
    print(f"Image Source: {math_image_path}")
    print(f"Solution/Explanation: {text_response_math}")
    

Conclusion

These examples, derived from the official Qwen2.5-Omni cookbooks, highlight the model's remarkable flexibility. Whether engaging in spoken dialogue, analyzing visual recordings, or interpreting multimodal problem statements, Qwen2.5-Omni provides a powerful toolkit. By following these patterns – structuring the conversation, processing inputs with qwen-omni-utils, and utilizing the generate function with appropriate flags – developers can begin building sophisticated applications that truly embrace the richness of multimodal interaction. Don't hesitate to explore the other cookbooks (like omni_chatting_for_music.ipynb or multi_round_omni_chatting.ipynb) for even more advanced use cases.

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment