AL40b-dev-Q8-gguf Model Card
AL40b-dev Q8_0 provides 8-bit GGUF weights for the multilingual 40-billion-parameter model originally released as BSC-LT/AL40b-dev.
Quantization was done with llama.cpp using the Q8_0
method, reducing disk size and RAM usage while keeping model quality very close to the FP16/BF16 original.
DISCLAIMER: This model is an experimental version and is provided for research purposes only. Its use is subject to the terms of the research-only license governing the data used in its post-training, which prohibits commercial use. Access is not public and currently restricted to select partners. If you want access, you can send an email to carlos.rodriguez1(at)bsc.es to argue your case.
How to use
The instruction-following models use the commonly adopted ChatML template:
Chat template
{%- if messages[0]['role'] == 'system' %}{%- set system_message = messages[0]['content'] %}{%- set loop_messages = messages[1:] %}{%- else %}{%- set system_message = 'SYSTEM MESSAGE' %}{%- set loop_messages = messages %}{%- endif %}{%- if not date_string is defined %}{%- set date_string = '2024-09-30' %}{%- endif %}{{ '<|im_start|>system\n' + system_message + '<|im_end|>\n' }}{% for message in loop_messages %}{%- if (message['role'] != 'user') and (message['role'] != 'assistant')%}{{ raise_exception('Only user and assitant roles are suported after the initial optional system message.') }}{% endif %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('After the optional system message, conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}
Where system_message
is used to guide the model during generation and date_string
can be set to allow the model to respond with the current date.
The exact same chat template should be used for an enhanced conversational experience. The easiest way to apply it is by using the tokenizer's built-in functions, as shown in the following snippet.
Transformers (>= 4.39)
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
MODEL = "langtech-innovation/AL40b-dev-Q8-gguf"
GGUF_FILE = "AL40b-dev-Q8_0.gguf"
tokenizer = AutoTokenizer.from_pretrained(MODEL, gguf_file=GGUF_FILE, use_auth_token=True)
model = AutoModelForCausalLM.from_pretrained(
MODEL,
gguf_file=GGUF_FILE,
device_map="auto",
torch_dtype="auto",
use_auth_token=True
)
text = "At what temperature does water boil?"
message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True,
date_string=date_string
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Using this template, each turn is preceded by a <|im_start|>
delimiter and the role of the entity
(either user
, for content supplied by the user, or assistant
for LLM responses), and finished with the <|im_end|>
token.
llama-cpp-python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id = "langtech-innovation/AL40b-dev-Q8-gguf",
filename = "AL40b-dev-Q8_0.gguf",
n_ctx = 4096,
n_threads = 8
)
system = "SYSTEM MESSAGE"
user = "At what temperature does water boil?"
prompt = f"<|im_start|>system\n{system}<|im_end|>\n" \
f"<|im_start|>user\n{user}<|im_end|>\n" \
f"<|im_start|>assistant\n"
print(llm(prompt, max_tokens=128)["choices"][0]["text"])
License
- Downloads last month
- 44
8-bit
Model tree for langtech-innovation/AL40b-dev-Q8-gguf
Base model
BSC-LT/AL40b-dev