Fast-Inference with Ctranslate2

Speedup inference by 2x-8x using int8 inference in C++

quantized version of google/flan-ul2

pip install hf_hub_ctranslate2>=2.0.6 ctranslate2>=3.13.0

Checkpoint compatible to ctranslate2 and hf-hub-ctranslate2

  • compute_type=int8_float16 for device="cuda"
  • compute_type=int8 for device="cpu"
from hf_hub_ctranslate2 import TranslatorCT2fromHfHub, GeneratorCT2fromHfHub

model_name = "michaelfeil/ct2fast-flan-ul2"
model = TranslatorCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name, 
        device="cuda",
        compute_type="int8_float16"
)
outputs = model.generate(
    text=["How do you call a fast Flan-ingo?", "Translate to german: How are you doing?"],
    min_decoding_length=24,
    max_decoding_length=32,
    max_input_length=512,
    beam_size=5
)
print(outputs)

Licence and other remarks:

This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

Downloads last month
11
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and HF Inference API was unable to determine this model’s pipeline type.