File size: 13,576 Bytes
b429d7c
 
 
 
 
 
 
 
 
 
 
 
 
 
920ed31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
---
title: Qwen2.5 Omni 3B ASR
emoji: 
colorFrom: gray
colorTo: indigo
sdk: gradio
sdk_version: 5.32.1
app_file: app.py
pinned: false
license: mit
short_description: Qwen2.5 Omni 3B ASR DEMO
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Qwen2.5-Omni ASR (ZeroGPU) Gradio App

A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.

---

## Overview

* **Model:** Qwen2.5-Omni-3B
* **Processor:** Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
* **Audio/Video Preprocessing:** `qwen-omni-utils` (handles loading and resampling)
* **Simplified→Traditional Conversion:** `opencc`
* **Web UI:** Gradio v5 (blocks API)
* **ZeroGPU:** Hugging Face’s offload wrapper (`spaces` package) to transparently dispatch tensors between CPU and available GPU (if any)

When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.

---

## Features

1. **Audio-to-Text with Qwen2.5-Omni**

   * Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
2. **ZeroGPU Acceleration**

   * Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
3. **Simplified→Traditional Chinese Conversion**

   * Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
4. **Clean Transcript Output**

   * Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
5. **Gradio Blocks UI (v5)**

   * Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.

---

## Demo

![App Screenshot](https://user-provide-your-own-screenshot-url) <!-- Optional: insert a screenshot link or remove this line -->

1. **Upload Audio**: Click “Browse” or drag & drop a WAV/MP3/… file.
2. **User Prompt**: By default, it is set to

   ```
   Transcribe the attached audio to text with punctuation.
   ```

   You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).
3. **Transcribe**: Hit “Transcribe” (ZeroGPU handles device placement automatically).
4. **Output**: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.

---

## Installation & Local Run

1. **Clone the Repository**

   ```bash
   git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git
   cd qwen2-omni-asr-zerogpu
   ```

2. **Create a Python Virtual Environment** (recommended)

   ```bash
   python3 -m venv venv
   source venv/bin/activate
   ```

3. **Install Dependencies**

   ```bash
   pip install --upgrade pip
   pip install -r requirements.txt
   ```

4. **Run the App Locally**

   ```bash
   python app.py
   ```

   * This starts a Gradio server on `http://127.0.0.1:7860/` (by default).
   * ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.

---

## Deployment on Hugging Face Spaces

1. Create a new Space on Hugging Face (use the Python/Jupyter template).
2. Ensure you select **“Hardware Accelerator: None”** (Spaces will use ZeroGPU to offload automatically).
3. Push (or upload) the repository contents, including:

   * `app.py`
   * `requirements.txt`
   * Any other config files (e.g., `README.md` itself).
4. Spaces will install dependencies via `requirements.txt`, and automatically launch `app.py` under ZeroGPU.
5. Visit your Space’s URL to try it out.

*No explicit `Dockerfile` or server config is needed; ZeroGPU handles the backend. Just ensure `spaces` is in `requirements.txt`.*

---

## File Structure

```
├── app.py
├── requirements.txt
├── README.md
└── LICENSE  (optional)
```

* **app.py**

  * Entry point for the Gradio app.
  * Defines `run_asr(...)` decorated with `@spaces.GPU` to enable ZeroGPU offload.
  * Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
  * Builds a Gradio Blocks UI (two-column layout).

* **requirements.txt**

  ```text
  # ZeroGPU for CPU-/GPU offload acceleration
  spaces

  # PyTorch + Transformers
  torch
  transformers

  # Qwen Omni utilities (for audio preprocessing)
  qwen-omni-utils

  # OpenCC (simplified→traditional conversion)
  opencc

  # Gradio v5
  gradio>=5.0.0
  ```

* **README.md**

  * (You’re reading it.)

---

## How It Works

1. **Model & Processor Loading**

   ```python
   MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
   model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
       MODEL_ID, torch_dtype="auto", device_map="auto"
   )
   model.disable_talker()
   processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
   model.eval()
   ```

   * `device_map="auto"` + `@spaces.GPU` (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.
   * `disable_talker()` removes any “talker” head to focus purely on ASR.

2. **Message Construction for ASR**

   ```python
   sys_prompt = (
       "You are Qwen, a virtual human developed by the Qwen Team, "
       "Alibaba Group, capable of perceiving auditory and visual inputs, "
       "as well as generating text and speech."
   )
   messages = [
       {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
       {
           "role": "user",
           "content": [
               {"type": "audio", "audio": audio_path},
               {"type": "text", "text": user_prompt}
           ],
       },
   ]
   ```

   * This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.

3. **Apply Chat Template & Preprocess**

   ```python
   text_input = processor.apply_chat_template(
       messages, tokenize=False, add_generation_prompt=True
   )
   audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
   inputs = processor(
       text=text_input,
       audio=audios,
       images=images,
       videos=videos,
       return_tensors="pt",
       padding=True,
       use_audio_in_video=True
   ).to(model.device).to(model.dtype)
   ```

   * `apply_chat_template(...)` formats the messages into a single input string.
   * `process_mm_info(...)` handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).
   * The final `inputs` tensor dict is ready for `model.generate()`.

4. **Inference & Post-Processing**

   ```python
   output_tokens = model.generate(
       **inputs,
       use_audio_in_video=True,
       return_audio=False,
       thinker_max_new_tokens=512,
       thinker_do_sample=False
   )
   full_decoded = processor.batch_decode(
       output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
   )[0].strip()
   asr_only = _strip_prompts(full_decoded)
   return cc.convert(asr_only)
   ```

   * `model.generate(...)` runs a greedy (no sampling) decoding over up to 512 new tokens.
   * `batch_decode(...)` yields a single string that includes all “system … user … assistant” markers.
   * `_strip_prompts(...)` finds the first occurrence of `assistant` in that output and returns only the substring after it, so that the UI sees just the raw transcript.
   * Finally, `opencc` converts that transcript from simplified to Traditional Chinese.

---

## Dependencies

All required dependencies are listed in `requirements.txt`. Briefly:

* **spaces**: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
* **torch** & **transformers**: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
* **qwen-omni-utils**: Utility functions to preprocess audio/video for Qwen2.5-Omni.
* **opencc**: Simplified→Traditional Chinese converter (uses the “s2t” config).
* **gradio >= 5.0.0**: For building the web UI.

When you run `pip install -r requirements.txt`, all dependencies will be pulled from PyPI.

---

## Configuration

* **Model ID**

  * Defined in `app.py` as `MODEL_ID = "Qwen/Qwen2.5-Omni-3B"`.
  * If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., `"Qwen/Qwen2.5-Omni-1B"`), then re-deploy.

* **ZeroGPU Offload**

  * The `@spaces.GPU` decorator on `run_asr(...)` is all you need to enable transparent offloading.
  * No extra config or environment variables are required. Spaces will detect this, install `spaces`, and manage CPU/GPU placement.

* **Prompt Customization**

  * By default, the textbox placeholder is

    > “Transcribe the attached audio to text with punctuation.”
  * You can customize this string directly in the Gradio component. If you omit the prompt entirely, `run_asr` will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.

---

## Project Structure

```text
qwen2-omni-asr-zerogpu/
├── app.py            # Main application code (Gradio + inference logic)
├── requirements.txt  # All Python dependencies
├── README.md         # This file
└── LICENSE           # (Optional) License, if you wish to open-source
```

* **app.py**

  * Imports: `spaces`, `torch`, `transformers`, `qwen_omni_utils`, `opencc`, `gradio`.
  * Defines a helper `_strip_prompts()` to remove system/user/assistant markers.
  * Implements `run_asr(...)` decorated with `@spaces.GPU`.
  * Builds Gradio Blocks UI (with `gr.Row()`, `gr.Column()`, etc.).

* **requirements.txt**

  * Must include exactly what’s needed to run on Spaces (and locally).
  * ZeroGPU (the `spaces` package) should be first, so that Spaces’s auto-offload wrapper is installed.

---

## Usage Examples

1. **Local Testing**

   ```bash
   python app.py
   ```

   * Open your browser to `http://127.0.0.1:7860/`
   * Upload a short `.wav` or `.mp3` file (in Chinese) and click “Transcribe.”
   * Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.

2. **Command-Line Invocation**
   Although the main interface is Gradio, you can also import `run_asr` directly in a Python shell to run a single file:

   ```python
   from app import run_asr

   transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.")
   print(transcript)  # → Traditional Chinese transcript
   ```

3. **Hugging Face Spaces**

   * Ensure the repo is pushed to a Space (no special hardware required).
   * The web UI will appear under your Space’s URL (e.g., `https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu`).
   * End users simply upload audio and click “Transcribe.”

---

## Troubleshooting

* **“Please upload an audio file first.”**

  * This warning is returned if you click “Transcribe” without uploading a valid audio path.
* **Model-not-registered / FunASR Errors**

  * If you see errors about “model not registered,” make sure you have the latest `qwen-omni-utils` version and check your internet connectivity (HF model downloads).
* **ZeroGPU Fallback**

  * If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
* **Output Contains “system … user … assistant”**

  * If you still see system/user/assistant text, check that `_strip_prompts()` is present in `app.py` and is being applied to `full_decoded`.

---

## Contributing

1. **Fork the Repository**
2. **Create a New Branch**

   ```bash
   git checkout -b feature/my-enhancement
   ```
3. **Make Your Changes**

   * Improve prompt-stripping logic, add new model IDs, or enhance the UI.
   * If you add new Python dependencies, remember to update `requirements.txt`.
4. **Test Locally**

   ```bash
   python app.py
   ```
5. **Push & Open a Pull Request**

   * Describe your changes in detail.
   * Ensure the README is updated if new features are added.

---

## License

This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”

---

## Acknowledgments

* **Qwen Team (Alibaba)** for the Qwen2.5-Omni model.
* **Hugging Face** for Transformers, Gradio, and ZeroGPU infrastructure (`spaces` package).
* **OpenCC** for reliable Simplified→Traditional Chinese conversion.
* **qwen-omni-utils** for audio/video preprocessing helpers.

---

Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.