feihu.hf
commited on
Commit
·
f376580
1
Parent(s):
35d5786
update README
Browse files
README.md
CHANGED
@@ -32,8 +32,8 @@ Qwen3 is the latest generation of large language models in Qwen series, offering
|
|
32 |
- Number of Attention Heads (GQA): 32 for Q and 4 for KV
|
33 |
- Number of Experts: 128
|
34 |
- Number of Activated Experts: 8
|
|
|
35 |
|
36 |
-
- Context Length: 32,768.
|
37 |
- Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
|
38 |
|
39 |
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
|
@@ -86,6 +86,24 @@ The word strawberries contains 3 instances of the letter r. [...]
|
|
86 |
```
|
87 |
|
88 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
89 |
|
90 |
## Best Practices
|
91 |
|
|
|
32 |
- Number of Attention Heads (GQA): 32 for Q and 4 for KV
|
33 |
- Number of Experts: 128
|
34 |
- Number of Activated Experts: 8
|
35 |
+
- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).
|
36 |
|
|
|
37 |
- Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
|
38 |
|
39 |
For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
|
|
|
86 |
```
|
87 |
|
88 |
|
89 |
+
## Processing Long Texts
|
90 |
+
|
91 |
+
Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
|
92 |
+
|
93 |
+
To enable YARN in ``llama.cpp``:
|
94 |
+
|
95 |
+
```shell
|
96 |
+
./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
|
97 |
+
```
|
98 |
+
|
99 |
+
> [!NOTE]
|
100 |
+
> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
|
101 |
+
> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
|
102 |
+
> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
|
103 |
+
|
104 |
+
> [!TIP]
|
105 |
+
> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
|
106 |
+
|
107 |
|
108 |
## Best Practices
|
109 |
|