|
--- |
|
license: apache-2.0 |
|
tags: |
|
- ctranslate2 |
|
--- |
|
# Fast-Inference with Ctranslate2 |
|
Speedup inference by 2x-8x using int8 inference in C++ |
|
|
|
quantized version of [google/flan-ul2](https://huggingface.co/google/flan-ul2) |
|
```bash |
|
pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0 |
|
``` |
|
|
|
|
|
Checkpoint compatible to [ctranslate2](https://github.com/OpenNMT/CTranslate2) and [hf-hub-ctranslate2](https://github.com/michaelfeil/hf-hub-ctranslate2) |
|
- `compute_type=int8_float16` for `device="cuda"` |
|
- `compute_type=int8` for `device="cpu"` |
|
|
|
```python |
|
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub |
|
|
|
model_name = "michaelfeil/ct2fast-flan-ul2" |
|
model = TranslatorCT2fromHfHub( |
|
# load in int8 on CUDA |
|
model_name_or_path=model_name, |
|
device="cuda", |
|
compute_type="int8_float16" |
|
) |
|
outputs = model.generate( |
|
text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"], |
|
min_decoding_length=24, |
|
max_decoding_length=32, |
|
max_input_length=512, |
|
beam_size=5 |
|
) |
|
print(outputs) |
|
``` |
|
|
|
# Licence and other remarks: |
|
This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo. |