Spaces:

Luigi
/

Qwen2.5-Omni-3B-ASR

Runtime error

File size: 13,576 Bytes

---
title: Qwen2.5 Omni 3B ASR
emoji: ⚡
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Qwen2.5 Omni 3B ASR DEMO
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Qwen2.5-Omni ASR (ZeroGPU) Gradio App

A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.

---

## Overview

* **Model:** Qwen2.5-Omni-3B
* **Processor:** Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
* **Audio/Video Preprocessing:** `qwen-omni-utils` (handles loading and resampling)
* **Simplified→Traditional Conversion:** `opencc`
* **Web UI:** Gradio v5 (blocks API)
* **ZeroGPU:** Hugging Face’s offload wrapper (`spaces` package) to transparently dispatch tensors between CPU and available GPU (if any)

When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.

---

## Features

1. **Audio-to-Text with Qwen2.5-Omni**

   * Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
2. **ZeroGPU Acceleration**

   * Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
3. **Simplified→Traditional Chinese Conversion**

   * Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
4. **Clean Transcript Output**

   * Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
5. **Gradio Blocks UI (v5)**

   * Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.

---

## Demo

![App Screenshot](https://user-provide-your-own-screenshot-url) <!-- Optional: insert a screenshot link or remove this line -->

1. **Upload Audio**: Click “Browse” or drag & drop a WAV/MP3/… file.
2. **User Prompt**: By default, it is set to

   ```
   Transcribe the attached audio to text with punctuation.
   ```

   You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).
3. **Transcribe**: Hit “Transcribe” (ZeroGPU handles device placement automatically).
4. **Output**: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.

---

## Installation & Local Run

1. **Clone the Repository**

   ```bash
   git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git
   cd qwen2-omni-asr-zerogpu
   ```

2. **Create a Python Virtual Environment** (recommended)

   ```bash
   python3 -m venv venv
   source venv/bin/activate
   ```

3. **Install Dependencies**

   ```bash
   pip install --upgrade pip
   pip install -r requirements.txt
   ```

4. **Run the App Locally**

   ```bash
   python app.py
   ```

   * This starts a Gradio server on `http://127.0.0.1:7860/` (by default).
   * ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.

---

## Deployment on Hugging Face Spaces

1. Create a new Space on Hugging Face (use the Python/Jupyter template).
2. Ensure you select **“Hardware Accelerator: None”** (Spaces will use ZeroGPU to offload automatically).
3. Push (or upload) the repository contents, including:

   * `app.py`
   * `requirements.txt`
   * Any other config files (e.g., `README.md` itself).
4. Spaces will install dependencies via `requirements.txt`, and automatically launch `app.py` under ZeroGPU.
5. Visit your Space’s URL to try it out.

*No explicit `Dockerfile` or server config is needed; ZeroGPU handles the backend. Just ensure `spaces` is in `requirements.txt`.*

---

## File Structure

```
├── app.py
├── requirements.txt
├── README.md
└── LICENSE  (optional)
```

* **app.py**

  * Entry point for the Gradio app.
  * Defines `run_asr(...)` decorated with `@spaces.GPU` to enable ZeroGPU offload.
  * Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
  * Builds a Gradio Blocks UI (two-column layout).

* **requirements.txt**

  ```text
  # ZeroGPU for CPU-/GPU offload acceleration
  spaces

  # PyTorch + Transformers
  torch
  transformers

  # Qwen Omni utilities (for audio preprocessing)
  qwen-omni-utils

  # OpenCC (simplified→traditional conversion)
  opencc

  # Gradio v5
  gradio>=5.0.0
  ```

* **README.md**

  * (You’re reading it.)

---

## How It Works

1. **Model & Processor Loading**

   ```python
   MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
   model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
       MODEL_ID, torch_dtype="auto", device_map="auto"
   )
   model.disable_talker()
   processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
   model.eval()
   ```

   * `device_map="auto"` + `@spaces.GPU` (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.
   * `disable_talker()` removes any “talker” head to focus purely on ASR.

2. **Message Construction for ASR**

   ```python
   sys_prompt = (
       "You are Qwen, a virtual human developed by the Qwen Team, "
       "Alibaba Group, capable of perceiving auditory and visual inputs, "
       "as well as generating text and speech."
   )
   messages = [
       {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
       {
           "role": "user",
           "content": [
               {"type": "audio", "audio": audio_path},
               {"type": "text", "text": user_prompt}
           ],
       },
   ]
   ```

   * This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.

3. **Apply Chat Template & Preprocess**

   ```python
   text_input = processor.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )
   audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
   inputs = processor(
       text=text_input,
       audio=audios,
       images=images,
       videos=videos,
       return_tensors="pt",
       padding=True,
       use_audio_in_video=True
   ).to(model.device).to(model.dtype)
   ```

   * `apply_chat_template(...)` formats the messages into a single input string.
   * `process_mm_info(...)` handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).
   * The final `inputs` tensor dict is ready for `model.generate()`.

4. **Inference & Post-Processing**

   ```python
   output_tokens = model.generate(
       **inputs,
       use_audio_in_video=True,
       return_audio=False,
       thinker_max_new_tokens=512,
       thinker_do_sample=False
   )
   full_decoded = processor.batch_decode(
       output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
   )[0].strip()
   asr_only = _strip_prompts(full_decoded)
   return cc.convert(asr_only)
   ```

   * `model.generate(...)` runs a greedy (no sampling) decoding over up to 512 new tokens.
   * `batch_decode(...)` yields a single string that includes all “system … user … assistant” markers.
   * `_strip_prompts(...)` finds the first occurrence of `assistant` in that output and returns only the substring after it, so that the UI sees just the raw transcript.
   * Finally, `opencc` converts that transcript from simplified to Traditional Chinese.

---

## Dependencies

All required dependencies are listed in `requirements.txt`. Briefly:

* **spaces**: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
* **torch** & **transformers**: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
* **qwen-omni-utils**: Utility functions to preprocess audio/video for Qwen2.5-Omni.
* **opencc**: Simplified→Traditional Chinese converter (uses the “s2t” config).
* **gradio >= 5.0.0**: For building the web UI.

When you run `pip install -r requirements.txt`, all dependencies will be pulled from PyPI.

---

## Configuration

* **Model ID**

  * Defined in `app.py` as `MODEL_ID = "Qwen/Qwen2.5-Omni-3B"`.
  * If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., `"Qwen/Qwen2.5-Omni-1B"`), then re-deploy.

* **ZeroGPU Offload**

  * The `@spaces.GPU` decorator on `run_asr(...)` is all you need to enable transparent offloading.
  * No extra config or environment variables are required. Spaces will detect this, install `spaces`, and manage CPU/GPU placement.

* **Prompt Customization**

  * By default, the textbox placeholder is

    > “Transcribe the attached audio to text with punctuation.”
  * You can customize this string directly in the Gradio component. If you omit the prompt entirely, `run_asr` will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.

---

## Project Structure

```text
qwen2-omni-asr-zerogpu/
├── app.py            # Main application code (Gradio + inference logic)
├── requirements.txt  # All Python dependencies
├── README.md         # This file
└── LICENSE           # (Optional) License, if you wish to open-source
```

* **app.py**

  * Imports: `spaces`, `torch`, `transformers`, `qwen_omni_utils`, `opencc`, `gradio`.
  * Defines a helper `_strip_prompts()` to remove system/user/assistant markers.
  * Implements `run_asr(...)` decorated with `@spaces.GPU`.
  * Builds Gradio Blocks UI (with `gr.Row()`, `gr.Column()`, etc.).

* **requirements.txt**

  * Must include exactly what’s needed to run on Spaces (and locally).
  * ZeroGPU (the `spaces` package) should be first, so that Spaces’s auto-offload wrapper is installed.

---

## Usage Examples

1. **Local Testing**

   ```bash
   python app.py
   ```

   * Open your browser to `http://127.0.0.1:7860/`
   * Upload a short `.wav` or `.mp3` file (in Chinese) and click “Transcribe.”
   * Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.

2. **Command-Line Invocation**
   Although the main interface is Gradio, you can also import `run_asr` directly in a Python shell to run a single file:

   ```python
   from app import run_asr

   transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.")
   print(transcript)  # → Traditional Chinese transcript
   ```

3. **Hugging Face Spaces**

   * Ensure the repo is pushed to a Space (no special hardware required).
   * The web UI will appear under your Space’s URL (e.g., `https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu`).
   * End users simply upload audio and click “Transcribe.”

---

## Troubleshooting

* **“Please upload an audio file first.”**

  * This warning is returned if you click “Transcribe” without uploading a valid audio path.
* **Model-not-registered / FunASR Errors**

  * If you see errors about “model not registered,” make sure you have the latest `qwen-omni-utils` version and check your internet connectivity (HF model downloads).
* **ZeroGPU Fallback**

  * If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
* **Output Contains “system … user … assistant”**

  * If you still see system/user/assistant text, check that `_strip_prompts()` is present in `app.py` and is being applied to `full_decoded`.

---

## Contributing

1. **Fork the Repository**
2. **Create a New Branch**

   ```bash
   git checkout -b feature/my-enhancement
   ```
3. **Make Your Changes**

   * Improve prompt-stripping logic, add new model IDs, or enhance the UI.
   * If you add new Python dependencies, remember to update `requirements.txt`.
4. **Test Locally**

   ```bash
   python app.py
   ```
5. **Push & Open a Pull Request**

   * Describe your changes in detail.
   * Ensure the README is updated if new features are added.

---

## License

This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”

---

## Acknowledgments

* **Qwen Team (Alibaba)** for the Qwen2.5-Omni model.
* **Hugging Face** for Transformers, Gradio, and ZeroGPU infrastructure (`spaces` package).
* **OpenCC** for reliable Simplified→Traditional Chinese conversion.
* **qwen-omni-utils** for audio/video preprocessing helpers.

---

Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.