|
--- |
|
license: llama3.1 |
|
language: |
|
- en |
|
pipeline_tag: text-generation |
|
datasets: |
|
- allenai/RLVR-MATH |
|
base_model: |
|
- allenai/Llama-3.1-Tulu-3-405B |
|
tags: |
|
- quant |
|
--- |
|
|
|
This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant. |
|
|
|
You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors: |
|
|
|
``` |
|
config = AutoConfig.from_pretrained(model_name) |
|
with init_empty_weights(): |
|
model = AutoModelForCausalLM.from_config(config) |
|
|
|
max_memory = { |
|
0: "60GiB", |
|
1: "60GiB", |
|
2: "60GiB", |
|
3: "60GiB", |
|
4: "60GiB", |
|
5: "60GiB", |
|
6: "60GiB", |
|
7: "60GiB", |
|
"cpu": "1500GiB", |
|
} |
|
|
|
device_map = infer_auto_device_map( |
|
model, |
|
max_memory=max_memory, |
|
no_split_module_classes=["LlamaDecoderLayer"], |
|
) |
|
``` |
|
|
|
Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B |