--- license: llama3.1 language: - en tags: - llama - nvidia - nemotron - w8a8 - vllm base_model: - nvidia/Llama-3.1-Nemotron-70B-Instruct-HF library_name: transformers datasets: - neuralmagic/LLM_compression_calibration --- # Llama-3.1-Nemotron-70B-Instruct-W8A8-dynamic ## Model Overview - **Model Architecture:** Llama-3.1-Nemotron-70B-Instruct-HF - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT8 - **Activation quantization:** INT8 - **Release Date:** 2/12/2025 - **Version:** 1.0 - **Model Developers:** Elias Oenal Quantized version of [Llama-3.1-Nemotron-70B-Instruct-HF](https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF). ### Model Optimizations This model was obtained by quantizing the weights and activations to W8A8 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) and the [neuralmagic/LLM_compression_calibration](https://huggingface.co/datasets/neuralmagic/LLM_compression_calibration) dataset.