Qwen2.5-Omni-3B-ASR / README.md
Luigi's picture
initial commit
920ed31

A newer version of the Gradio SDK is available: 5.46.0

Upgrade
metadata
title: Qwen2.5 Omni 3B ASR
emoji: 
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Qwen2.5 Omni 3B ASR DEMO

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Qwen2.5-Omni ASR (ZeroGPU) Gradio App

A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.


Overview

  • Model: Qwen2.5-Omni-3B
  • Processor: Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
  • Audio/Video Preprocessing: qwen-omni-utils (handles loading and resampling)
  • Simplified→Traditional Conversion: opencc
  • Web UI: Gradio v5 (blocks API)
  • ZeroGPU: Hugging Face’s offload wrapper (spaces package) to transparently dispatch tensors between CPU and available GPU (if any)

When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.


Features

  1. Audio-to-Text with Qwen2.5-Omni

    • Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
  2. ZeroGPU Acceleration

    • Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
  3. Simplified→Traditional Chinese Conversion

    • Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
  4. Clean Transcript Output

    • Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
  5. Gradio Blocks UI (v5)

    • Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.

Demo

App Screenshot

  1. Upload Audio: Click “Browse” or drag & drop a WAV/MP3/… file.

  2. User Prompt: By default, it is set to

    Transcribe the attached audio to text with punctuation.
    

    You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).

  3. Transcribe: Hit “Transcribe” (ZeroGPU handles device placement automatically).

  4. Output: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.


Installation & Local Run

  1. Clone the Repository

    git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git
    cd qwen2-omni-asr-zerogpu
    
  2. Create a Python Virtual Environment (recommended)

    python3 -m venv venv
    source venv/bin/activate
    
  3. Install Dependencies

    pip install --upgrade pip
    pip install -r requirements.txt
    
  4. Run the App Locally

    python app.py
    
    • This starts a Gradio server on http://127.0.0.1:7860/ (by default).
    • ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.

Deployment on Hugging Face Spaces

  1. Create a new Space on Hugging Face (use the Python/Jupyter template).

  2. Ensure you select “Hardware Accelerator: None” (Spaces will use ZeroGPU to offload automatically).

  3. Push (or upload) the repository contents, including:

    • app.py
    • requirements.txt
    • Any other config files (e.g., README.md itself).
  4. Spaces will install dependencies via requirements.txt, and automatically launch app.py under ZeroGPU.

  5. Visit your Space’s URL to try it out.

No explicit Dockerfile or server config is needed; ZeroGPU handles the backend. Just ensure spaces is in requirements.txt.


File Structure

├── app.py
├── requirements.txt
├── README.md
└── LICENSE  (optional)
  • app.py

    • Entry point for the Gradio app.
    • Defines run_asr(...) decorated with @spaces.GPU to enable ZeroGPU offload.
    • Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
    • Builds a Gradio Blocks UI (two-column layout).
  • requirements.txt

    # ZeroGPU for CPU-/GPU offload acceleration
    spaces
    
    # PyTorch + Transformers
    torch
    transformers
    
    # Qwen Omni utilities (for audio preprocessing)
    qwen-omni-utils
    
    # OpenCC (simplified→traditional conversion)
    opencc
    
    # Gradio v5
    gradio>=5.0.0
    
  • README.md

    • (You’re reading it.)

How It Works

  1. Model & Processor Loading

    MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
    model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
        MODEL_ID, torch_dtype="auto", device_map="auto"
    )
    model.disable_talker()
    processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
    model.eval()
    
    • device_map="auto" + @spaces.GPU (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.
    • disable_talker() removes any “talker” head to focus purely on ASR.
  2. Message Construction for ASR

    sys_prompt = (
        "You are Qwen, a virtual human developed by the Qwen Team, "
        "Alibaba Group, capable of perceiving auditory and visual inputs, "
        "as well as generating text and speech."
    )
    messages = [
        {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
        {
            "role": "user",
            "content": [
                {"type": "audio", "audio": audio_path},
                {"type": "text", "text": user_prompt}
            ],
        },
    ]
    
    • This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.
  3. Apply Chat Template & Preprocess

    text_input = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
    inputs = processor(
        text=text_input,
        audio=audios,
        images=images,
        videos=videos,
        return_tensors="pt",
        padding=True,
        use_audio_in_video=True
    ).to(model.device).to(model.dtype)
    
    • apply_chat_template(...) formats the messages into a single input string.
    • process_mm_info(...) handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).
    • The final inputs tensor dict is ready for model.generate().
  4. Inference & Post-Processing

    output_tokens = model.generate(
        **inputs,
        use_audio_in_video=True,
        return_audio=False,
        thinker_max_new_tokens=512,
        thinker_do_sample=False
    )
    full_decoded = processor.batch_decode(
        output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0].strip()
    asr_only = _strip_prompts(full_decoded)
    return cc.convert(asr_only)
    
    • model.generate(...) runs a greedy (no sampling) decoding over up to 512 new tokens.
    • batch_decode(...) yields a single string that includes all “system … user … assistant” markers.
    • _strip_prompts(...) finds the first occurrence of assistant in that output and returns only the substring after it, so that the UI sees just the raw transcript.
    • Finally, opencc converts that transcript from simplified to Traditional Chinese.

Dependencies

All required dependencies are listed in requirements.txt. Briefly:

  • spaces: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
  • torch & transformers: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
  • qwen-omni-utils: Utility functions to preprocess audio/video for Qwen2.5-Omni.
  • opencc: Simplified→Traditional Chinese converter (uses the “s2t” config).
  • gradio >= 5.0.0: For building the web UI.

When you run pip install -r requirements.txt, all dependencies will be pulled from PyPI.


Configuration

  • Model ID

    • Defined in app.py as MODEL_ID = "Qwen/Qwen2.5-Omni-3B".
    • If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., "Qwen/Qwen2.5-Omni-1B"), then re-deploy.
  • ZeroGPU Offload

    • The @spaces.GPU decorator on run_asr(...) is all you need to enable transparent offloading.
    • No extra config or environment variables are required. Spaces will detect this, install spaces, and manage CPU/GPU placement.
  • Prompt Customization

    • By default, the textbox placeholder is

      “Transcribe the attached audio to text with punctuation.”

    • You can customize this string directly in the Gradio component. If you omit the prompt entirely, run_asr will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.


Project Structure

qwen2-omni-asr-zerogpu/
├── app.py            # Main application code (Gradio + inference logic)
├── requirements.txt  # All Python dependencies
├── README.md         # This file
└── LICENSE           # (Optional) License, if you wish to open-source
  • app.py

    • Imports: spaces, torch, transformers, qwen_omni_utils, opencc, gradio.
    • Defines a helper _strip_prompts() to remove system/user/assistant markers.
    • Implements run_asr(...) decorated with @spaces.GPU.
    • Builds Gradio Blocks UI (with gr.Row(), gr.Column(), etc.).
  • requirements.txt

    • Must include exactly what’s needed to run on Spaces (and locally).
    • ZeroGPU (the spaces package) should be first, so that Spaces’s auto-offload wrapper is installed.

Usage Examples

  1. Local Testing

    python app.py
    
    • Open your browser to http://127.0.0.1:7860/
    • Upload a short .wav or .mp3 file (in Chinese) and click “Transcribe.”
    • Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.
  2. Command-Line Invocation Although the main interface is Gradio, you can also import run_asr directly in a Python shell to run a single file:

    from app import run_asr
    
    transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.")
    print(transcript)  # → Traditional Chinese transcript
    
  3. Hugging Face Spaces

    • Ensure the repo is pushed to a Space (no special hardware required).
    • The web UI will appear under your Space’s URL (e.g., https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu).
    • End users simply upload audio and click “Transcribe.”

Troubleshooting

  • “Please upload an audio file first.”

    • This warning is returned if you click “Transcribe” without uploading a valid audio path.
  • Model-not-registered / FunASR Errors

    • If you see errors about “model not registered,” make sure you have the latest qwen-omni-utils version and check your internet connectivity (HF model downloads).
  • ZeroGPU Fallback

    • If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
  • Output Contains “system … user … assistant”

    • If you still see system/user/assistant text, check that _strip_prompts() is present in app.py and is being applied to full_decoded.

Contributing

  1. Fork the Repository

  2. Create a New Branch

    git checkout -b feature/my-enhancement
    
  3. Make Your Changes

    • Improve prompt-stripping logic, add new model IDs, or enhance the UI.
    • If you add new Python dependencies, remember to update requirements.txt.
  4. Test Locally

    python app.py
    
  5. Push & Open a Pull Request

    • Describe your changes in detail.
    • Ensure the README is updated if new features are added.

License

This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”


Acknowledgments

  • Qwen Team (Alibaba) for the Qwen2.5-Omni model.
  • Hugging Face for Transformers, Gradio, and ZeroGPU infrastructure (spaces package).
  • OpenCC for reliable Simplified→Traditional Chinese conversion.
  • qwen-omni-utils for audio/video preprocessing helpers.

Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.