shisa-ai
/

Llama-3.1-Tulu-3-405B-FP8-Dynamic

Text Generation

compressed-tensors

Model card Files Files and versions

Llama-3.1-Tulu-3-405B-FP8-Dynamic / README.md

leonardlin's picture

Create README.md

2af98b0 verified 8 months ago

|

history blame contribute delete

1.09 kB

	---
	license: llama3.1
	language:
	- en
	pipeline_tag: text-generation
	datasets:
	- allenai/RLVR-MATH
	base_model:
	- allenai/Llama-3.1-Tulu-3-405B
	tags:
	- quant
	---

	This is an [llmcompressor](https://github.com/vllm-project/llm-compressor) v0.4.0 [FP8 Dynamic](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8) quant.

	You can refer to [CPU offloading example](https://github.com/vllm-project/llm-compressor/tree/main/examples/big_models_with_accelerate) but for quanting with an H100 node, we used this setup to avoid OOM errors:

	```
	config = AutoConfig.from_pretrained(model_name)
	with init_empty_weights():
	model = AutoModelForCausalLM.from_config(config)

	max_memory = {
	0: "60GiB",
	1: "60GiB",
	2: "60GiB",
	3: "60GiB",
	4: "60GiB",
	5: "60GiB",
	6: "60GiB",
	7: "60GiB",
	"cpu": "1500GiB",
	}

	device_map = infer_auto_device_map(
	model,
	max_memory=max_memory,
	no_split_module_classes=["LlamaDecoderLayer"],
	)
	```

	Original model here: https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B