File size: 1,090 Bytes
2af98b0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
---
license: llama3.1
language:
- en
pipeline_tag: text-generation
datasets:
- allenai/RLVR-MATH
base_model:
- allenai/Llama-3.1-Tulu-3-405B
tags:
- quant
---
This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant.
You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors:
```
config = AutoConfig.from_pretrained(model_name)
with init_empty_weights():
model = AutoModelForCausalLM.from_config(config)
max_memory = {
0: "60GiB",
1: "60GiB",
2: "60GiB",
3: "60GiB",
4: "60GiB",
5: "60GiB",
6: "60GiB",
7: "60GiB",
"cpu": "1500GiB",
}
device_map = infer_auto_device_map(
model,
max_memory=max_memory,
no_split_module_classes=["LlamaDecoderLayer"],
)
```
Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B |