Qwen
/

Qwen3-30B-A3B-GGUF

Text Generation

Model card Files Files and versions

feihu.hf commited on May 5

Commit

f376580

·

1 Parent(s): 35d5786

update README

Files changed (1) hide show

README.md +19 -1

README.md CHANGED Viewed

@@ -32,8 +32,8 @@ Qwen3 is the latest generation of large language models in Qwen series, offering
 - Number of Attention Heads (GQA): 32 for Q and 4 for KV
 - Number of Experts: 128
 - Number of Activated Experts: 8
-- Context Length: 32,768.
 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
@@ -86,6 +86,24 @@ The word strawberries contains 3 instances of the letter r. [...]
 ```
 ## Best Practices

 - Number of Attention Heads (GQA): 32 for Q and 4 for KV
 - Number of Experts: 128
 - Number of Activated Experts: 8
+- Context Length: 32,768 natively and [131,072 tokens with YaRN](#processing-long-texts).
 - Quantization: q4_K_M, q5_0, q5_K_M, q6_K, q8_0
 For more details, including benchmark evaluation, hardware requirements, and inference performance, please refer to our [blog](https://qwenlm.github.io/blog/qwen3/), [GitHub](https://github.com/QwenLM/Qwen3), and [Documentation](https://qwen.readthedocs.io/en/latest/).
 ```
+## Processing Long Texts
+Qwen3 natively supports context lengths of up to 32,768 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 131,072 tokens using the [YaRN](https://arxiv.org/abs/2309.00071) method.
+To enable YARN in ``llama.cpp``:
+```shell
+./llama-cli ... -c 131072 --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768
+```
+> [!NOTE]
+> All the notable open-source frameworks implement static YaRN, which means the scaling factor remains constant regardless of input length, **potentially impacting performance on shorter texts.**
+> We advise adding the `rope_scaling` configuration only when processing long contexts is required.
+> It is also recommended to modify the `factor` as needed. For example, if the typical context length for your application is 65,536 tokens, it would be better to set `factor` as 2.0.
+> [!TIP]
+> The endpoint provided by Alibaba Model Studio supports dynamic YaRN by default and no extra configuration is needed.
 ## Best Practices