--- license: llama3 --- # Meta-Llama-3-8B-Instruct-ct2-int8 This is a [ctranslate2](https://github.com/OpenNMT/CTranslate2) v4.5.0 int8 conversion of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/tree/main) created with: ``` ct2-transformers-converter --model meta-llama/Meta-Llama-3-8B-Instruct --output_dir Meta-Llama-3-8B-Instruct-ct2-int8 --quantization int8 ``` ## Downloading ct2 doesn't have hf-hub integration, so you'll need to manually download the model files: ``` huggingface-cli download mike-ravkine/Meta-Llama-3-8B-Instruct-ct2-int8 --local-dir Meta-Llama-3-8B-Instruct-ct2-int8/ ``` ## Using Install dependencies: ``` pip install transformers[torch] ctranslate2 ``` Sample inference code: ```python import sys import ctranslate2 from transformers import AutoTokenizer model_dir = sys.argv[1] # download dir tokenizer_dir = meta-llama/Meta-Llama-3-8B-Instruct print("Loading the model...") generator = ctranslate2.Generator(model_dir, device="cuda") tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir) dialog = [{"role": "user", "content": "What is the meaning of life, the universe and everything?"}] max_generation_length = 512 prompt_string = tokenizer.apply_chat_template(dialog, add_generation_prompt=True, tokenize=False) # It seems silly to tokenize=False and then call tokenize, but tokenize=True returns just ids; we need actual tokens prompt_tokens = tokenizer.tokenize(prompt_string) step_results = generator.generate_tokens( prompt_tokens, max_length=max_generation_length, sampling_temperature=0.6, sampling_topk=20, sampling_topp=1, ) for step_result in step_results: word = tokenizer.decode([step_result.token_id]) print(word, end="", flush=True) ```