--- title: Qwen2.5 Omni 3B ASR emoji: ⚡ colorFrom: gray colorTo: indigo sdk: gradio sdk_version: 5.32.1 app_file: app.py pinned: false license: mit short_description: Qwen2.5 Omni 3B ASR DEMO --- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference # Qwen2.5-Omni ASR (ZeroGPU) Gradio App A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU. --- ## Overview * **Model:** Qwen2.5-Omni-3B * **Processor:** Qwen2.5-Omni processor (handles tokenization and chat-template formatting) * **Audio/Video Preprocessing:** `qwen-omni-utils` (handles loading and resampling) * **Simplified→Traditional Conversion:** `opencc` * **Web UI:** Gradio v5 (blocks API) * **ZeroGPU:** Hugging Face’s offload wrapper (`spaces` package) to transparently dispatch tensors between CPU and available GPU (if any) When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese. --- ## Features 1. **Audio-to-Text with Qwen2.5-Omni** * Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.). 2. **ZeroGPU Acceleration** * Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU. 3. **Simplified→Traditional Chinese Conversion** * Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step. 4. **Clean Transcript Output** * Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text. 5. **Gradio Blocks UI (v5)** * Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right. --- ## Demo ![App Screenshot](https://user-provide-your-own-screenshot-url) 1. **Upload Audio**: Click “Browse” or drag & drop a WAV/MP3/… file. 2. **User Prompt**: By default, it is set to ``` Transcribe the attached audio to text with punctuation. ``` You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.). 3. **Transcribe**: Hit “Transcribe” (ZeroGPU handles device placement automatically). 4. **Output**: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers. --- ## Installation & Local Run 1. **Clone the Repository** ```bash git clone https://github.com//qwen2-omni-asr-zerogpu.git cd qwen2-omni-asr-zerogpu ``` 2. **Create a Python Virtual Environment** (recommended) ```bash python3 -m venv venv source venv/bin/activate ``` 3. **Install Dependencies** ```bash pip install --upgrade pip pip install -r requirements.txt ``` 4. **Run the App Locally** ```bash python app.py ``` * This starts a Gradio server on `http://127.0.0.1:7860/` (by default). * ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not. --- ## Deployment on Hugging Face Spaces 1. Create a new Space on Hugging Face (use the Python/Jupyter template). 2. Ensure you select **“Hardware Accelerator: None”** (Spaces will use ZeroGPU to offload automatically). 3. Push (or upload) the repository contents, including: * `app.py` * `requirements.txt` * Any other config files (e.g., `README.md` itself). 4. Spaces will install dependencies via `requirements.txt`, and automatically launch `app.py` under ZeroGPU. 5. Visit your Space’s URL to try it out. *No explicit `Dockerfile` or server config is needed; ZeroGPU handles the backend. Just ensure `spaces` is in `requirements.txt`.* --- ## File Structure ``` ├── app.py ├── requirements.txt ├── README.md └── LICENSE (optional) ``` * **app.py** * Entry point for the Gradio app. * Defines `run_asr(...)` decorated with `@spaces.GPU` to enable ZeroGPU offload. * Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion. * Builds a Gradio Blocks UI (two-column layout). * **requirements.txt** ```text # ZeroGPU for CPU-/GPU offload acceleration spaces # PyTorch + Transformers torch transformers # Qwen Omni utilities (for audio preprocessing) qwen-omni-utils # OpenCC (simplified→traditional conversion) opencc # Gradio v5 gradio>=5.0.0 ``` * **README.md** * (You’re reading it.) --- ## How It Works 1. **Model & Processor Loading** ```python MODEL_ID = "Qwen/Qwen2.5-Omni-3B" model = Qwen2_5OmniForConditionalGeneration.from_pretrained( MODEL_ID, torch_dtype="auto", device_map="auto" ) model.disable_talker() processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID) model.eval() ``` * `device_map="auto"` + `@spaces.GPU` (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU. * `disable_talker()` removes any “talker” head to focus purely on ASR. 2. **Message Construction for ASR** ```python sys_prompt = ( "You are Qwen, a virtual human developed by the Qwen Team, " "Alibaba Group, capable of perceiving auditory and visual inputs, " "as well as generating text and speech." ) messages = [ {"role": "system", "content": [{"type": "text", "text": sys_prompt}]}, { "role": "user", "content": [ {"type": "audio", "audio": audio_path}, {"type": "text", "text": user_prompt} ], }, ] ``` * This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction. 3. **Apply Chat Template & Preprocess** ```python text_input = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) audios, images, videos = process_mm_info(messages, use_audio_in_video=True) inputs = processor( text=text_input, audio=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=True ).to(model.device).to(model.dtype) ``` * `apply_chat_template(...)` formats the messages into a single input string. * `process_mm_info(...)` handles loading & resampling of audio (and potentially extracting video frames, if video files are provided). * The final `inputs` tensor dict is ready for `model.generate()`. 4. **Inference & Post-Processing** ```python output_tokens = model.generate( **inputs, use_audio_in_video=True, return_audio=False, thinker_max_new_tokens=512, thinker_do_sample=False ) full_decoded = processor.batch_decode( output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0].strip() asr_only = _strip_prompts(full_decoded) return cc.convert(asr_only) ``` * `model.generate(...)` runs a greedy (no sampling) decoding over up to 512 new tokens. * `batch_decode(...)` yields a single string that includes all “system … user … assistant” markers. * `_strip_prompts(...)` finds the first occurrence of `assistant` in that output and returns only the substring after it, so that the UI sees just the raw transcript. * Finally, `opencc` converts that transcript from simplified to Traditional Chinese. --- ## Dependencies All required dependencies are listed in `requirements.txt`. Briefly: * **spaces**: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU. * **torch** & **transformers**: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni). * **qwen-omni-utils**: Utility functions to preprocess audio/video for Qwen2.5-Omni. * **opencc**: Simplified→Traditional Chinese converter (uses the “s2t” config). * **gradio >= 5.0.0**: For building the web UI. When you run `pip install -r requirements.txt`, all dependencies will be pulled from PyPI. --- ## Configuration * **Model ID** * Defined in `app.py` as `MODEL_ID = "Qwen/Qwen2.5-Omni-3B"`. * If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., `"Qwen/Qwen2.5-Omni-1B"`), then re-deploy. * **ZeroGPU Offload** * The `@spaces.GPU` decorator on `run_asr(...)` is all you need to enable transparent offloading. * No extra config or environment variables are required. Spaces will detect this, install `spaces`, and manage CPU/GPU placement. * **Prompt Customization** * By default, the textbox placeholder is > “Transcribe the attached audio to text with punctuation.” * You can customize this string directly in the Gradio component. If you omit the prompt entirely, `run_asr` will still run but may not add punctuation; it’s highly recommended to always provide a user prompt. --- ## Project Structure ```text qwen2-omni-asr-zerogpu/ ├── app.py # Main application code (Gradio + inference logic) ├── requirements.txt # All Python dependencies ├── README.md # This file └── LICENSE # (Optional) License, if you wish to open-source ``` * **app.py** * Imports: `spaces`, `torch`, `transformers`, `qwen_omni_utils`, `opencc`, `gradio`. * Defines a helper `_strip_prompts()` to remove system/user/assistant markers. * Implements `run_asr(...)` decorated with `@spaces.GPU`. * Builds Gradio Blocks UI (with `gr.Row()`, `gr.Column()`, etc.). * **requirements.txt** * Must include exactly what’s needed to run on Spaces (and locally). * ZeroGPU (the `spaces` package) should be first, so that Spaces’s auto-offload wrapper is installed. --- ## Usage Examples 1. **Local Testing** ```bash python app.py ``` * Open your browser to `http://127.0.0.1:7860/` * Upload a short `.wav` or `.mp3` file (in Chinese) and click “Transcribe.” * Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes. 2. **Command-Line Invocation** Although the main interface is Gradio, you can also import `run_asr` directly in a Python shell to run a single file: ```python from app import run_asr transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.") print(transcript) # → Traditional Chinese transcript ``` 3. **Hugging Face Spaces** * Ensure the repo is pushed to a Space (no special hardware required). * The web UI will appear under your Space’s URL (e.g., `https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu`). * End users simply upload audio and click “Transcribe.” --- ## Troubleshooting * **“Please upload an audio file first.”** * This warning is returned if you click “Transcribe” without uploading a valid audio path. * **Model-not-registered / FunASR Errors** * If you see errors about “model not registered,” make sure you have the latest `qwen-omni-utils` version and check your internet connectivity (HF model downloads). * **ZeroGPU Fallback** * If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical. * **Output Contains “system … user … assistant”** * If you still see system/user/assistant text, check that `_strip_prompts()` is present in `app.py` and is being applied to `full_decoded`. --- ## Contributing 1. **Fork the Repository** 2. **Create a New Branch** ```bash git checkout -b feature/my-enhancement ``` 3. **Make Your Changes** * Improve prompt-stripping logic, add new model IDs, or enhance the UI. * If you add new Python dependencies, remember to update `requirements.txt`. 4. **Test Locally** ```bash python app.py ``` 5. **Push & Open a Pull Request** * Describe your changes in detail. * Ensure the README is updated if new features are added. --- ## License This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.” --- ## Acknowledgments * **Qwen Team (Alibaba)** for the Qwen2.5-Omni model. * **Hugging Face** for Transformers, Gradio, and ZeroGPU infrastructure (`spaces` package). * **OpenCC** for reliable Simplified→Traditional Chinese conversion. * **qwen-omni-utils** for audio/video preprocessing helpers. --- Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.