metadata
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- allenai/RLVR-MATH
base_model:
- allenai/Llama-3.1-Tulu-3-405B
tags:
- quant
This is an llmcompressor v0.4.0 FP8 Dynamic quant.
You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
max_memory = {
0: "60GiB",
1: "60GiB",
2: "60GiB",
3: "60GiB",
4: "60GiB",
5: "60GiB",
6: "60GiB",
7: "60GiB",
"cpu": "1500GiB",
}
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["LlamaDecoderLayer"],
)
Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B