AXERA-TECH
/

Qwen2.5-1.5B-Instruct

@@ -4,12 +4,14 @@ language:
 - zh
 - en
 base_model:
-- Qwen/Qwen2.5-1.5B-Instruct
 pipeline_tag: text-generation
 library_name: transformers
 tags:
 - Context
-- Qwen2.5-1.5B
 ---
 # Qwen2.5-1.5B-Instruct-CTX-Int8
@@ -18,7 +20,7 @@ This version of Qwen2.5-1.5B-Instruct-CTX-Int8 has been converted to run on the
 This model has been optimized with the following LoRA:
-Compatible with Pulsar2 version: 4.0(Not released yet)
 ## Feature
@@ -36,6 +38,23 @@ For those who are interested in model conversion, you can try to export axmodel
 [AXera NPU AXCL LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/axcl-context-kvcache)
 ## Support Platform
 - AX650
@@ -47,7 +66,7 @@ For those who are interested in model conversion, you can try to export axmodel
 |Chips|w8a16|w4a16| DDR | Flash |
 |--|--|--|--|--|
-|AX650| 11 tokens/sec| *TBD* | 2.3GB | 2.3GB |
 ## How to use
@@ -56,17 +75,20 @@ Download all files from this repository to the device
 ```
 root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# tree -L 1
 .
-├── kvcache
-├── main
 ├── main_axcl_aarch64
 ├── main_axcl_x86
 ├── post_config.json
 ├── qwen2.5-1.5b-ctx-ax650
 ├── qwen2.5_tokenizer
 ├── qwen2.5_tokenizer_uid.py
 ├── run_qwen2.5_1.5b_ctx_ax650.sh
 ├── run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
-└── run_qwen2.5_1.5b_ctx_axcl_x86.sh
 ```
 #### Start the Tokenizer service

 - zh
 - en
 base_model:
+- Qwen/Qwen2.5-1.5B-Instruct-GPTQ-INT8
+- Qwen/Qwen2.5-1.5B-Instruct-GPTQ-INT4
 pipeline_tag: text-generation
 library_name: transformers
 tags:
 - Context
+- Qwen2.5-1.5B-Instruct-GPTQ-INT8
+- Qwen2.5-1.5B-Instruct-GPTQ-INT4
 ---
 # Qwen2.5-1.5B-Instruct-CTX-Int8
 This model has been optimized with the following LoRA:
+Compatible with Pulsar2 version: 4.1
 ## Feature
 [AXera NPU AXCL LLM Runtime](https://github.com/ZHEQIUSHUI/ax-llm/tree/axcl-context-kvcache)
+### Convert script
+```
+pulsar2 llm_build --input_path Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8  \
+                  --output_path Qwen/Qwen2.5-1.5B-Instruct-GPTQ-Int8-ctx-ax650 \
+                  --hidden_state_type bf16 --kv_cache_len 2047 --prefill_len 128 \
+                  --last_kv_cache_len 128 \
+                  --last_kv_cache_len 256 \
+                  --last_kv_cache_len 384 \
+                  --last_kv_cache_len 512 \
+                  --last_kv_cache_len 640 \
+                  --last_kv_cache_len 768 \
+                  --last_kv_cache_len 896 \
+                  --last_kv_cache_len 1024 \
+                  --chip AX650 -c 1 --parallel 8
+```
 ## Support Platform
 - AX650
 |Chips|w8a16|w4a16| DDR | Flash |
 |--|--|--|--|--|
+|AX650| 12 tokens/sec| 17 tokens/sec | 2.3GB | 2.3GB |
 ## How to use
 ```
 root@ax650:/mnt/qtang/llm-test/Qwen2.5-1.5B-Instruct-CTX-Int8# tree -L 1
 .
+├── main_api
+├── main_ax650
 ├── main_axcl_aarch64
 ├── main_axcl_x86
 ├── post_config.json
 ├── qwen2.5-1.5b-ctx-ax650
+├── qwen2.5-1.5b-ctx-int4-ax650
 ├── qwen2.5_tokenizer
 ├── qwen2.5_tokenizer_uid.py
+├── run_qwen2.5_1.5b_ctx_ax650_api.sh
 ├── run_qwen2.5_1.5b_ctx_ax650.sh
 ├── run_qwen2.5_1.5b_ctx_axcl_aarch64.sh
+├── run_qwen2.5_1.5b_ctx_axcl_x86.sh
+└── run_qwen2.5_1.5b_ctx_int4_ax650.sh
 ```
 #### Start the Tokenizer service