Assistant Llama 2 7B Chat AWQ

This model is a quantitized export of wasertech/assistant-llama2-7b-chat using AWQ.

AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference.

It is also now supported by continuous batching server vLLM, allowing use of Llama AWQ models for high-throughput concurrent inference in multi-user server scenarios.

As of September 25th 2023, preliminary Llama-only AWQ support has also been added to Huggingface Text Generation Inference (TGI).

Downloads last month
87
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Dataset used to train wasertech/assistant-llama2-7b-chat-awq

Collection including wasertech/assistant-llama2-7b-chat-awq