Commit
·
3264a14
1
Parent(s):
f364131
Speculative Decoding doesn't work yet with Whisper-v3
Browse filesIn the example above we advertise using tiny in a ForCausalLM class which can't work since tiny does **not** share the same encoder as large-v3. We can advertise it as soon as distil-v3 is out.
README.md
CHANGED
|
@@ -258,57 +258,6 @@ result = pipe(sample, return_timestamps=True, generate_kwargs={"language": "fren
|
|
| 258 |
print(result["chunks"])
|
| 259 |
```
|
| 260 |
|
| 261 |
-
## Speculative Decoding
|
| 262 |
-
|
| 263 |
-
Whisper `tiny` can be used as an assistant model to Whisper for speculative decoding. Speculative decoding mathematically
|
| 264 |
-
ensures the exact same outputs as Whisper are obtained while being 2 times faster. This makes it the perfect drop-in
|
| 265 |
-
replacement for existing Whisper pipelines, since the same outputs are guaranteed.
|
| 266 |
-
|
| 267 |
-
In the following code-snippet, we load the assistant Distil-Whisper model standalone to the main Whisper pipeline. We then
|
| 268 |
-
specify it as the "assistant model" for generation:
|
| 269 |
-
|
| 270 |
-
```python
|
| 271 |
-
from transformers import pipeline, AutoModelForCausalLM, AutoModelForSpeechSeq2Seq, AutoProcessor
|
| 272 |
-
import torch
|
| 273 |
-
from datasets import load_dataset
|
| 274 |
-
|
| 275 |
-
device = "cuda:0" if torch.cuda.is_available() else "cpu"
|
| 276 |
-
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
|
| 277 |
-
|
| 278 |
-
assistant_model_id = "openai/whisper-tiny"
|
| 279 |
-
|
| 280 |
-
assistant_model = AutoModelForCausalLM.from_pretrained(
|
| 281 |
-
assistant_model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 282 |
-
)
|
| 283 |
-
assistant_model.to(device)
|
| 284 |
-
|
| 285 |
-
model_id = "openai/whisper-large-v3"
|
| 286 |
-
|
| 287 |
-
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
| 288 |
-
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
|
| 289 |
-
)
|
| 290 |
-
model.to(device)
|
| 291 |
-
|
| 292 |
-
processor = AutoProcessor.from_pretrained(model_id)
|
| 293 |
-
|
| 294 |
-
pipe = pipeline(
|
| 295 |
-
"automatic-speech-recognition",
|
| 296 |
-
model=model,
|
| 297 |
-
tokenizer=processor.tokenizer,
|
| 298 |
-
feature_extractor=processor.feature_extractor,
|
| 299 |
-
max_new_tokens=128,
|
| 300 |
-
generate_kwargs={"assistant_model": assistant_model},
|
| 301 |
-
torch_dtype=torch_dtype,
|
| 302 |
-
device=device,
|
| 303 |
-
)
|
| 304 |
-
|
| 305 |
-
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
|
| 306 |
-
sample = dataset[0]["audio"]
|
| 307 |
-
|
| 308 |
-
result = pipe(sample)
|
| 309 |
-
print(result["text"])
|
| 310 |
-
```
|
| 311 |
-
|
| 312 |
## Additional Speed & Memory Improvements
|
| 313 |
|
| 314 |
You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
|
|
|
|
| 258 |
print(result["chunks"])
|
| 259 |
```
|
| 260 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 261 |
## Additional Speed & Memory Improvements
|
| 262 |
|
| 263 |
You can apply additional speed and memory improvements to Whisper-large-v3 which we cover in the following.
|