mit-han-lab
/

VILA1.5-8B-QServe-W8A8

@@ -1,38 +1,168 @@
 ---
-license: cc-by-nc-4.0
 library_name: transformers
 pipeline_tag: text-generation
-tags:
-- VILA
-- VLM
 ---
-# VILA Model Card
-## Model details
-**Model type:**
-VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge.
-**Model date:**
-VILA1.5-13b was trained in May 2024.
-**Paper or resources for more information:**
-https://github.com/NVLabs/VILA
 ```
-@misc{lin2023vila,
-      title={VILA: On Pre-training for Visual Language Models},
-      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
-      year={2023},
-      eprint={2312.07533},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV}
-}
 ```
-https://github.com/mit-han-lab/qserve
 ```
 @article{lin2024qserve,
   title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
   author={Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
@@ -41,84 +171,4 @@ https://github.com/mit-han-lab/qserve
 }
 ```
-## License
-- The code is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
-- The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
-- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
-    - [Model License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA
-    - [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI
-    - [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training.
-**Where to send questions or comments about the model:**
-https://github.com/NVLabs/VILA/issues
-## Intended use
-**Primary intended uses:**
-The primary use of VILA is research on large multimodal models and chatbots.
-**Primary intended users:**
-The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
-## Model Architecture:
-**Architecture Type:** Transformer
-**Network Architecture:** siglip, vicuna1.5
-## Input:
-**Input Type:** Image, Video, Text
-**Input Format:** Red, Green, Blue; MP4 ;String
-**Input Parameters:** 2D, 3D
-## Output:
-**Output Type:** Text
-**Output Format:** String
-**Supported Hardware Microarchitecture Compatibility:**
-* Ampere
-* Jetson
-* Hopper
-* Lovelace
-**[Preferred/Supported] Operating System(s):** <br>
-Linux
-## Model Version(s):
-* VILA1.5-3B
-* VILA1.5-3B-s2
-* Llama-3-VILA1.5-8B
-* VILA1.5-13B
-* VILA1.5-40B
-* VILA1.5-3B-AWQ
-* VILA1.5-3B-s2-AWQ
-* Llama-3-VILA1.5-8B-AWQ
-* VILA1.5-13B-AWQ
-* VILA1.5-40B-AWQ
-## Training dataset
-See [Dataset Preparation](https://github.com/NVLabs/VILA/blob/main/data_prepare/README.md) for more details.
-** Data Collection Method by dataset
-* [Hybrid: Automated, Human]
-** Labeling Method by dataset
-* [Hybrid: Automated, Human]
-**Properties (Quantity, Dataset Descriptions, Sensor(s)):**
-53 million image-text pairs or interleaved image text content.
-## Evaluation dataset
-A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.
-## Inference:
-**Engine:** [Tensor(RT), Triton, Or List Other Here]
-* PyTorch
-* TensorRT-LLM
-* TinyChat
-**Test Hardware:**
-* A100
-* Jetson Orin
-* RTX 4090
-## Ethical Considerations
-NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications.  When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

 ---
 library_name: transformers
+license: cc-by-nc-4.0
 pipeline_tag: text-generation
 ---
+# QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
+**[Paper](https://arxiv.org/abs/2405.04532) | [Website](https://hanlab.mit.edu/projects/qserve) | [DeepCompressor Library](https://github.com/mit-han-lab/deepcompressor/tree/lmquant-v0.0.0-deprecated)**
+**QServe: Efficient and accurate LLM serving system** on GPUs with W4A8KV4 quantization (4-bit weights, 8-bit activations, and 4-bit KV cache). Compared with leading industry solution TensorRT-LLM, QServe achieves **1.2x-1.4x higher throughput** when serving Llama-3-8B, and **2.4x-3.5x higher throughput** when serving Qwen1.5-72B, on L40S and A100 GPUs. QServe also allows users to achieve A100-level throughput on **3x cheaper** L40S GPUs.
+QServe is suitable for **large-scale synthetic data generation** with both LLMs and VLMs. Check out our latest [QServe-VLM](#qserve-vlm) release!
+![teaser](assets/qserve_figures/teaser.png)
+![efficiency](assets/qserve_figures/efficiency.png)
+## Introduction
+Quantization can accelerate large language model (LLM) inference. Going beyond INT8 quantization, the research community is actively exploring even lower precision, such as INT4. Nonetheless, state-of-the-art INT4 quantization techniques only accelerate low-batch, edge LLM inference, failing to deliver performance gains in large-batch, cloud-based LLM serving. We uncover a critical issue: existing INT4 quantization methods suffer from significant runtime overhead (20-90%) when **dequantizing either weights or partial sums** on GPUs. To address this challenge, we introduce **QoQ**, a W4A8KV4 quantization algorithm with 4-bit weight, 8-bit activation, and 4-bit KV cache. QoQ stands for **quattuor-octo-quattuor**, which represents 4-8-4 in Latin. QoQ is implemented by the **QServe** inference library that achieves measured speedup. The key insight driving QServe is that the efficiency of LLM serving on GPUs is critically influenced by **operations on low-throughput CUDA cores**. Building upon this insight, in QoQ algorithm, we introduce progressive quantization that can allow low dequantization overhead in W4A8 GEMM. Additionally, we develop SmoothAttention to effectively mitigate the accuracy degradation incurred by 4-bit KV quantization. In the QServe system, we perform compute-aware weight reordering and take advantage of register-level parallelism to reduce dequantization latency. We also make fused attention memory-bound, harnessing the performance gain brought by KV4 quantization. As a result, QServe improves the maximum achievable serving throughput of Llama-3-8B by **1.2×** on A100, **1.4×** on L40S; and Qwen1.5-72B by **2.4×** on A100, **3.5×** on L40S, compared to TensorRT-LLM. Remarkably, QServe on L40S GPU can achieve even higher throughput than TensorRT-LLM on A100. Thus, QServe effectively reduces the dollar cost of LLM serving by **3×**.
+**The current release supports:**
+- Blazingly fast system support for **QoQ W4A8KV4** quantization (Algorithim release: [DeepCompressor Library](https://github.com/mit-han-lab/deepcompressor/tree/lmquant-v0.0.0-deprecated));
+- Pre-quantized QServe model zoo with **W4A8KV4 QoQ** for mainstream LLMs;
+- **Fully PyTorch-based** runtime and user interface for LLM serving, with **TensorRT-LLM-level efficiency** and **PyTorch-level flexibility**;
+- Full support for **in-flight batching** and **paged attention**;
+- Efficient **fused** CUDA kernels for **W4A8**/W8A8 GEMM and **KV4**/KV8 attention;
+- Easy-to-use examples on speed benchmarking and **large-scale end-to-end content generation** (with W4A8KV4, in-flight batching and paged attention).
+## Usage and Examples
+We support both offline benchmarking and online generation (in-flight-batching) in QServe.
+1. Offline speed benchmarking (Batched input sequences, fixed context length = 1024 and generation length = 512). We take Llama-3-8B (per-channel quant) as an example here. Please make sure that you have already downloaded the QoQ-quantized QServe model.
+```bash
+export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-QServe # Please set the path accordingly
+GLOBAL_BATCH_SIZE=128 \
+python qserve_benchmark.py \
+  --model $MODEL_PATH \
+  --benchmarking \
+  --precision w4a8kv4 \
+  --group-size -1
 ```
+If you hope to use larger batch sizes such as 256, you may need to change `NUM_GPU_PAGE_BLOCKS` to a larger value than the automatically-determined value on A100. For example:
+```bash
+export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-QServe # Please set the path accordingly
+GLOBAL_BATCH_SIZE=256 \
+NUM_GPU_PAGE_BLOCKS=6400 \
+python qserve_benchmark.py \
+  --model $MODEL_PATH \
+  --benchmarking \
+  --precision w4a8kv4 \
+  --group-size -1
 ```
+2. This is an online demonstration of batched generation, showcasing in-flight batching, paged attention of W4A8KV4 QoQ LLMs. We will randomly sample a set of safety-moderated conversations from the [WildChat](https://huggingface.co/datasets/allenai/WildChat) dataset and process them efficiently through in-flight batching.
+```bash
+export MODEL_PATH=./qserve_checkpoints/Llama-3-8B-Instruct-QServe # Please set the path accordingly
+python qserve_e2e_generation.py \
+  --model $MODEL_PATH \
+  --ifb-mode \
+  --precision w4a8kv4 \
+  --quant-path $MODEL_PATH \
+  --group-size -1
 ```
+3. Argument list in QServe
+   Below are some frequently used arguments in QServe interface:
+- `--model`: Path to the folder containing hf model configs. Can be the same as `--quant-path` if you directly download the models from QServe model zoo.
+- `--quant-path`: Path to the folder containing quantized LLM checkpoints.
+- `--precision`: The precision for GEMM in QServe, please choose from the following values: `w4a8kv4`, `w4a8kv8`, `w4a8` (means `w4a8kv8`), `w8a8kv4`, `w8a8kv8`, `w8a8` (means `w8a8kv8`). Default: `w4a8kv4`.
+- `--group-size`: Group size for weight quantization, -1 means per-channel quantization. QServe only supports -1 or 128. Please make sure your group size matches the checkpoint.
+- `--max-num-batched-tokens`: Maximum number of batched tokens per iteration. Default: 262144.
+- `--max-num-seqs`: Maximum number of sequences per iteration. Default: 256. Remember to increase it if you want larger batch sizes.
+- `--ifb-mode`: Enable in-flight batching mode. Suggest to activate in e2e generation.
+- `--benchmarking`: Enable speed profiling mode. Benchmark settings aligned with TensorRT-LLM.
+   Environment variables in QServe:
+- `GLOBAL_BATCH_SIZE`: Batch size used in offline speed benchmarking.
+- `NUM_GPU_PAGE_BLOCKS`: Number of pages to be allocated on GPU. If not specified, it will be automatically determined based on available GPU memory. Note that the current automatic GPU page allocation algorithm is very conservative. It is recommended to manually set this value to a larger number if you observe that GPU memory utilization is relatively low.
+4. One-line scripts:
+We also provide sample scripts in QServe.
+- End to end generation: `./scripts/qserve_e2e.sh`;
+- Speed benchmarking: `./scripts/qserve_benchmark/benchmark_a100.sh` or `./scripts/qserve_benchmark/benchmark_l40s.sh`.
+These scripts are expected to be executed in the QServe project folder (not in the `scripts` folder). Please note that `git-lfs` is needed for downloading QServe benchmark config files from huggingface before running the benchmark scripts.
+## Results
+We evaluate QServe W4A8KV4 quantization on a wide range of mainstream LLMs. QServe consistently outperforms existing W4A4 or W4A8 solutions from the accuracy perspective, while providing State-of-the-Art LLM serving efficiency.
+### Efficiency Benchmarks
+When serving the large language models Llama-3-8B and Qwen1.5-72B on L40S and A100 GPUs, QServe demonstrates superior performance, achieving **1.2x-1.4x higher throughput** compared to the leading industry solution, TensorRT-LLM, for Llama-3-8B, and a **2.4x-3.5x higher throughput** for Qwen1.5-72B. It is also able to **deliver higher throughput** and **accomodate the same batch size** on **L40S** compared with TensorRT-LLM on **A100** for six of eight models benchmarked, effectively saving the dollar cost of LLM serving by around 3x.
+Benchmarking setting: the criterion is maximum achieveable throughput on NVIDIA GPUs, and the input context length is 1024 tokens, output generation length is 512 tokens. For all systems that support paged attention, we enable this feature. In-flight batching is turned off in the efficiency benchmarks.
+| L40S (48G)     | Llama-3-8B | Llama-2-7B | Mistral-7B | Llama-2-13B | Llama-30B | Yi-34B | Llama-2-70B | Qwen-1.5-72B |
+|----------------|------------ | ------------|------------|-------------|-----------|--------|-------------|--------------|
+| TRT-LLM-FP16    | 1326 | 444        | 1566       | 92          | OOM       | OOM    | OOM         | OOM          |
+| TRT-LLM-W4A16   | 1431 | 681        | 1457       | 368         | 148       | 313    | 119         | 17           |
+| TRT-LLM-W8A8    | 2634 | 1271       | 2569       | 440         | 123       | 364    | OOM         | OOM          |
+| Atom-W4A4      | -- | 2120       | --          | --           | --         | --      | --           | --            |
+| QuaRot-W4A4    | -- | 805        | --          | 413         | 133       | --      | --           | 15           |
+| QServe-W4A8KV4 | **3656** | **2394**       | **3774**       | **1327**        | **504**       | **869**    | **286**         | **59**           |
+| Throughput Increase*     | **1.39x** | **1.13x**      | **1.47x**      | **3.02x**       | **3.41x**     | **2.39x**  | **2.40x**       | **3.47x**        |
+| A100 (80G)   | Llama-3-8B | Llama-2-7B | Mistral-7B | Llama-2-13B | Llama-30B | Yi-34B | Llama-2-70B | Qwen-1.5-72B |
+|----------------| ------------| ------------|------------|-------------|-----------|--------|-------------|--------------|
+| TRT-LLM-FP16    | 2503 | 1549       | 2371       | 488         | 80        | 145    | OOM         | OOM          |
+| TRT-LLM-W4A16   | 2370 | 1549       | 2403       | 871         | 352       | 569    | 358         | 143          |
+| TRT-LLM-W8A8    | 2396 | 2334       | 2427       | 1277        | 361       | 649    | 235         | 53           |
+| Atom-W4A4      | -- | 1160       | --          | --           | --         | --      | --           | --            |
+| QuaRot-W4A4    | -- | 1370       | --         | 289         | 267       | --      | --           | 68           |
+| QServe-W4A8KV4 | **3005** | **2908**       | **2970**       | **1741**        | **749**       | **803**    | **419**         | **340**          |
+| Throughput Increase*     | **1.20x** | **1.25x**      | **1.22x**      | **1.36x**       | **2.07x**     | **1.23x**  | **1.17x**       | **2.38x**        |
+The absolute token generation throughputs of QServe and baseline systems (Unit: tokens/second. `--` means unsupported). All experiments were
+conducted under the same device memory budget. Throughput increase of QServe is calculated with regard to the best baseline in each column. It is recommended to use QServe-per-channel on high-end datacenter GPUs like A100 and QServe-per-group is recommended on inference GPUs like L40S.
+Max throughput batch sizes used by QServe:
+| Device  | Llama-3-8B | Llama-2-7B | Mistral-7B | Llama-2-13B | Llama-30B | Yi-34B | Llama-2-70B | Qwen-1.5-72B |
+|----------------| ------------| ------------|------------|-------------|-----------|--------|-------------|--------------|
+| L40S    | 128 | 128       | 128       | 75         | 32        | 64    | 24         | 4          |
+| A100   | 256 | 190       | 256       | 128         | 64       | 196    | 96         | 32          |
+We recommend direcly setting the `NUM_GPU_PAGE_BLOCKS` environmental variable to `25 * batch size`, since in our benchmarking setting we have a context length of 1024 and generation length of 512, which corresponds to 24 pages (each page contains 64 tokens). We leave some buffer by allocating one more page for each sequence.
+### Accuracy Evaluation
+QServe also maintains high accuracy thanks to the QoQ algorithm provided in our [DeepCompressor](https://github.com/mit-han-lab/deepcompressor/tree/lmquant-v0.0.0-deprecated) quantization library.
+Below is the WikiText2 perplexity evaluated with 2048 sequence length. The lower is the better.
+| Models      | Precision | Llama-3 8B | Llama-2 7B | Llama-2 13B | Llama-2 70B | Llama 7B | Llama 13B | Llama 30B | Mistral 7B | Yi 34B |
+|-------------|-----------|------------|------------|-------------|-------------|----------|-----------|-----------|------------|--------|
+| FP16        |              | 6.14       | 5.47       | 4.88        | 3.32        | 5.68     | 5.09      | 4.10      | 5.25       | 4.60   |
+| SmoothQuant | W8A8         | 6.28       | 5.54       | 4.95        | 3.36        | 5.73     | 5.13      | 4.23      | 5.29       | 4.69   |
+| GPTQ-R      | W4A16 g128   | 6.56       | 5.63       | 4.99        | 3.43        | 5.83     | 5.20      | 4.22      | 5.39       | 4.68   |
+| AWQ         | W4A16 g128   | 6.54       | 5.60       | 4.97        | 3.41        | 5.78     | 5.19      | 4.21      | 5.37       | 4.67   |
+| QuaRot      | W4A4         | 8.33       | 6.19       | 5.45        | 3.83        | 6.34     | 5.58      | 4.64      | 5.77       | NaN    |
+| Atom        | W4A4 g128    | 7.76       | 6.12       | 5.31        | 3.73        | 6.25     | 5.52      | 4.61      | 5.76       | 4.97   |
+| QoQ         | W4A8KV4      | 6.89       | 5.75       | 5.12        | 3.52        | 5.93     | 5.28      | 4.34      | 5.45       | 4.74   |
+| QoQ         | W4A8KV4 g128 | 6.76       | 5.70       | 5.08        | 3.47        | 5.89     | 5.25      | 4.28      | 5.42       | 4.76   |
+\* SmoothQuant is evaluated with per-tensor static KV cache quantization.
+## Citation
+```bibtex
 @article{lin2024qserve,
   title={QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving},
   author={Lin*, Yujun and Tang*, Haotian and Yang*, Shang and Zhang, Zhekai and Xiao, Guangxuan and Gan, Chuang and Han, Song},
 }
 ```
+Code: https://github.com/mit-han-lab/qserve