Qwen
/

Text Generation
GGUF
conversational
feihu.hf commited on
Commit
f376580
·
1 Parent(s): 35d5786

update README

Browse files
Files changed (1) hide show
  1. README.md +19 -1
README.md CHANGED
@@ -32,8 +32,8 @@ Qwen3 is the latest generation of large language models in Qwen series, offering
32
  - Number of Attention Heads (GQA): 32 for Q and 4 for KV
33
  - Number of Experts: 128
34
  - Number of Activated Experts: 8
 
35
 
36
- - Context Length: 32,768.
37
  - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
38
 
39
  For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
@@ -86,6 +86,24 @@ The word strawberries contains 3 instances of the letter r. [...]
86
  ```
87
 
88
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ## Best Practices
91
 
 
32
  - Number of Attention Heads (GQA): 32 for Q and 4 for KV
33
  - Number of Experts: 128
34
  - Number of Activated Experts: 8
35
+ - Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).
36
 
 
37
  - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
38
 
39
  For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
 
86
  ```
87
 
88
 
89
+ ## Processing Long Texts
90
+
91
+ Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
92
+
93
+ To enable YARN in ``llama.cpp``:
94
+
95
+ ```shell
96
+ ./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
97
+ ```
98
+
99
+ > [!NOTE]
100
+ > All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
101
+ > We advise adding the `rope_scaling` configuration only when processing long contexts is required.
102
+ > It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
103
+
104
+ > [!TIP]
105
+ > The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
106
+
107
 
108
  ## Best Practices
109