|
--- |
|
language: |
|
- zh |
|
license: cc-by-nc-sa-4.0 |
|
library_name: transformers |
|
tags: |
|
- audio |
|
- automatic-speech-recognition |
|
widget: |
|
- example_title: Model Introduction |
|
src: https://huggingface.co/andybi7676/cool-whisper-hf/resolve/main/sample1.weba |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Cool-Whisper |
|
|
|
### Leave No Knowledge Behind During Knowledge Distillation: Towards Practical and Effective Knowledge Distillation for Code-Switching ASR Using Realistic Data |
|
<span style="font-size: 0.95em;">Liang-Hsuan Tseng, Zih-Ching Chen, Wei-Shun Chang, Cheng-Kuang Lee, Tsung-Ren Huang, Hung-yi Lee</span> |
|
|
|
[](https://arxiv.org/abs/2407.10603) [](https://colab.research.google.com/drive/1ZikUWKch78Jv3Yw7LtUKUn4wMrFCx6lD?usp=sharing) |
|
|
|
> ⚠️ Due to privacy and security concerns, this model will be temporarily taken offline. We are sorry for the inconvenience. |
|
|
|
> ⚠️ 因為隱私安全疑慮,本模型將暫時下架。非常抱歉造成大家困擾。 |
|
|
|
## Introduction |
|
|
|
* Cool-whisper is a distilled version of Whisper, mainly focused on **Mandarin-English** code-switching ASR for people in Taiwan. |
|
* We use 60,000 hours of **unlabeled** audio to train the model. |
|
* Practically, we utilize *knowledge* not only from the large model (Whisper-large-v2) but also from the small model (Whisper-base). |
|
|
|
## Basic Usage |
|
|
|
``` python |
|
import torch |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
from datasets import load_dataset |
|
|
|
device = f"cuda" if torch.cuda.is_available() else "cpu" |
|
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 |
|
|
|
model_id = "andybi7676/cool-whisper-hf" |
|
|
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
model_id, torch_dtype=torch_dtype, use_safetensors=True |
|
) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
max_new_tokens=256, |
|
return_timestamps=True, |
|
torch_dtype=torch_dtype, |
|
device=device, |
|
) |
|
|
|
dataset = load_dataset("andybi7676/ntuml2021_long", "default", split="test") |
|
sample = dataset[0]["audio"] |
|
# or your own audio path |
|
# sample = "/your/path/to/audio.wav" |
|
|
|
result = pipe(sample) |
|
print("Basic Result: ") |
|
print(result["text"]) |
|
# result with timestamps |
|
print("\nResult with timestamps: ") |
|
for chunk in result['chunks']: |
|
print(chunk) |
|
``` |
|
|
|
## Faster-Whisper Support |
|
|
|
[Faster-Whisper](https://github.com/SYSTRAN/faster-whisper) is a commonly used tool to accelerate the transcription generation speed based on [CTranslate2](https://github.com/OpenNMT/CTranslate2/). |
|
We also deploy our model in the form of CTranslate2 to allow using it in faster-whisper. |
|
Please visit [cool-whisper](https://huggingface.co/andybi7676/cool-whisper) for more details. |
|
|
|
|