Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.46.0
title: Qwen2.5 Omni 3B ASR
emoji: ⚡
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Qwen2.5 Omni 3B ASR DEMO
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Qwen2.5-Omni ASR (ZeroGPU) Gradio App
A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.
Overview
- Model: Qwen2.5-Omni-3B
- Processor: Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
- Audio/Video Preprocessing:
qwen-omni-utils
(handles loading and resampling) - Simplified→Traditional Conversion:
opencc
- Web UI: Gradio v5 (blocks API)
- ZeroGPU: Hugging Face’s offload wrapper (
spaces
package) to transparently dispatch tensors between CPU and available GPU (if any)
When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.
Features
Audio-to-Text with Qwen2.5-Omni
- Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
ZeroGPU Acceleration
- Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
Simplified→Traditional Chinese Conversion
- Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
Clean Transcript Output
- Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
Gradio Blocks UI (v5)
- Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.
Demo
Upload Audio: Click “Browse” or drag & drop a WAV/MP3/… file.
User Prompt: By default, it is set to
Transcribe the attached audio to text with punctuation.
You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).
Transcribe: Hit “Transcribe” (ZeroGPU handles device placement automatically).
Output: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.
Installation & Local Run
Clone the Repository
git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git cd qwen2-omni-asr-zerogpu
Create a Python Virtual Environment (recommended)
python3 -m venv venv source venv/bin/activate
Install Dependencies
pip install --upgrade pip pip install -r requirements.txt
Run the App Locally
python app.py
- This starts a Gradio server on
http://127.0.0.1:7860/
(by default). - ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.
- This starts a Gradio server on
Deployment on Hugging Face Spaces
Create a new Space on Hugging Face (use the Python/Jupyter template).
Ensure you select “Hardware Accelerator: None” (Spaces will use ZeroGPU to offload automatically).
Push (or upload) the repository contents, including:
app.py
requirements.txt
- Any other config files (e.g.,
README.md
itself).
Spaces will install dependencies via
requirements.txt
, and automatically launchapp.py
under ZeroGPU.Visit your Space’s URL to try it out.
No explicit Dockerfile
or server config is needed; ZeroGPU handles the backend. Just ensure spaces
is in requirements.txt
.
File Structure
├── app.py
├── requirements.txt
├── README.md
└── LICENSE (optional)
app.py
- Entry point for the Gradio app.
- Defines
run_asr(...)
decorated with@spaces.GPU
to enable ZeroGPU offload. - Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
- Builds a Gradio Blocks UI (two-column layout).
requirements.txt
# ZeroGPU for CPU-/GPU offload acceleration spaces # PyTorch + Transformers torch transformers # Qwen Omni utilities (for audio preprocessing) qwen-omni-utils # OpenCC (simplified→traditional conversion) opencc # Gradio v5 gradio>=5.0.0
README.md
- (You’re reading it.)
How It Works
Model & Processor Loading
MODEL_ID = "Qwen/Qwen2.5-Omni-3B" model = Qwen2_5OmniForConditionalGeneration.from_pretrained( MODEL_ID, torch_dtype="auto", device_map="auto" ) model.disable_talker() processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID) model.eval()
device_map="auto"
+@spaces.GPU
(ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.disable_talker()
removes any “talker” head to focus purely on ASR.
Message Construction for ASR
sys_prompt = ( "You are Qwen, a virtual human developed by the Qwen Team, " "Alibaba Group, capable of perceiving auditory and visual inputs, " "as well as generating text and speech." ) messages = [ {"role": "system", "content": [{"type": "text", "text": sys_prompt}]}, { "role": "user", "content": [ {"type": "audio", "audio": audio_path}, {"type": "text", "text": user_prompt} ], }, ]
- This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.
Apply Chat Template & Preprocess
text_input = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) audios, images, videos = process_mm_info(messages, use_audio_in_video=True) inputs = processor( text=text_input, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True ).to(model.device).to(model.dtype)
apply_chat_template(...)
formats the messages into a single input string.process_mm_info(...)
handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).- The final
inputs
tensor dict is ready formodel.generate()
.
Inference & Post-Processing
output_tokens = model.generate( **inputs, use_audio_in_video=True, return_audio=False, thinker_max_new_tokens=512, thinker_do_sample=False ) full_decoded = processor.batch_decode( output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0].strip() asr_only = _strip_prompts(full_decoded) return cc.convert(asr_only)
model.generate(...)
runs a greedy (no sampling) decoding over up to 512 new tokens.batch_decode(...)
yields a single string that includes all “system … user … assistant” markers._strip_prompts(...)
finds the first occurrence ofassistant
in that output and returns only the substring after it, so that the UI sees just the raw transcript.- Finally,
opencc
converts that transcript from simplified to Traditional Chinese.
Dependencies
All required dependencies are listed in requirements.txt
. Briefly:
- spaces: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
- torch & transformers: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
- qwen-omni-utils: Utility functions to preprocess audio/video for Qwen2.5-Omni.
- opencc: Simplified→Traditional Chinese converter (uses the “s2t” config).
- gradio >= 5.0.0: For building the web UI.
When you run pip install -r requirements.txt
, all dependencies will be pulled from PyPI.
Configuration
Model ID
- Defined in
app.py
asMODEL_ID = "Qwen/Qwen2.5-Omni-3B"
. - If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g.,
"Qwen/Qwen2.5-Omni-1B"
), then re-deploy.
- Defined in
ZeroGPU Offload
- The
@spaces.GPU
decorator onrun_asr(...)
is all you need to enable transparent offloading. - No extra config or environment variables are required. Spaces will detect this, install
spaces
, and manage CPU/GPU placement.
- The
Prompt Customization
By default, the textbox placeholder is
“Transcribe the attached audio to text with punctuation.”
You can customize this string directly in the Gradio component. If you omit the prompt entirely,
run_asr
will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.
Project Structure
qwen2-omni-asr-zerogpu/
├── app.py # Main application code (Gradio + inference logic)
├── requirements.txt # All Python dependencies
├── README.md # This file
└── LICENSE # (Optional) License, if you wish to open-source
app.py
- Imports:
spaces
,torch
,transformers
,qwen_omni_utils
,opencc
,gradio
. - Defines a helper
_strip_prompts()
to remove system/user/assistant markers. - Implements
run_asr(...)
decorated with@spaces.GPU
. - Builds Gradio Blocks UI (with
gr.Row()
,gr.Column()
, etc.).
- Imports:
requirements.txt
- Must include exactly what’s needed to run on Spaces (and locally).
- ZeroGPU (the
spaces
package) should be first, so that Spaces’s auto-offload wrapper is installed.
Usage Examples
Local Testing
python app.py
- Open your browser to
http://127.0.0.1:7860/
- Upload a short
.wav
or.mp3
file (in Chinese) and click “Transcribe.” - Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.
- Open your browser to
Command-Line Invocation Although the main interface is Gradio, you can also import
run_asr
directly in a Python shell to run a single file:from app import run_asr transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.") print(transcript) # → Traditional Chinese transcript
Hugging Face Spaces
- Ensure the repo is pushed to a Space (no special hardware required).
- The web UI will appear under your Space’s URL (e.g.,
https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu
). - End users simply upload audio and click “Transcribe.”
Troubleshooting
“Please upload an audio file first.”
- This warning is returned if you click “Transcribe” without uploading a valid audio path.
Model-not-registered / FunASR Errors
- If you see errors about “model not registered,” make sure you have the latest
qwen-omni-utils
version and check your internet connectivity (HF model downloads).
- If you see errors about “model not registered,” make sure you have the latest
ZeroGPU Fallback
- If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
Output Contains “system … user … assistant”
- If you still see system/user/assistant text, check that
_strip_prompts()
is present inapp.py
and is being applied tofull_decoded
.
- If you still see system/user/assistant text, check that
Contributing
Fork the Repository
Create a New Branch
git checkout -b feature/my-enhancement
Make Your Changes
- Improve prompt-stripping logic, add new model IDs, or enhance the UI.
- If you add new Python dependencies, remember to update
requirements.txt
.
Test Locally
python app.py
Push & Open a Pull Request
- Describe your changes in detail.
- Ensure the README is updated if new features are added.
License
This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”
Acknowledgments
- Qwen Team (Alibaba) for the Qwen2.5-Omni model.
- Hugging Face for Transformers, Gradio, and ZeroGPU infrastructure (
spaces
package). - OpenCC for reliable Simplified→Traditional Chinese conversion.
- qwen-omni-utils for audio/video preprocessing helpers.
Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.