Llama-3.1-Tulu-3-405B-FP8-Dynamic / README.md

leonardlin

Create README.md

2af98b0 verified 8 months ago

preview code

raw

history blame contribute delete

1.09 kB

metadata

license: llama3.1
language:
  - en
pipeline_tag: text-generation
datasets:
  - allenai/RLVR-MATH
base_model:
  - allenai/Llama-3.1-Tulu-3-405B
tags:
  - quant

This is an llmcompressor v0.4.0 FP8 Dynamic quant.

You can refer to CPU offloading example but for quanting with an H100 node, we used this setup to avoid OOM errors:

config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
    model = AutoModelForCausalLM.from_config(config)

max_memory = {
      0: "60GiB",
      1: "60GiB",
      2: "60GiB",
      3: "60GiB",
      4: "60GiB",
      5: "60GiB",
      6: "60GiB",
      7: "60GiB",
      "cpu": "1500GiB",
}

device_map = infer_auto_device_map(
    model,
    max_memory=max_memory,
    no_split_module_classes=["LlamaDecoderLayer"],
)

Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B