curryandsun commited on
Commit
302e55e
·
verified ·
1 Parent(s): 0e36ba9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -10
README.md CHANGED
@@ -32,21 +32,19 @@ When it comes to benchmarks, Ring-flash-linear-2.0 not only holds its own agains
32
  </div>
33
 
34
  ## Evaluation
35
- <!-- To properly evaluate the model's reasoning capabilities, we compared it against 3 other models—Ring-mini-2.0, Qwen3-8B-thinking, and GPT-OSS-20B-Medium—on 6 challenging reasoning benchmarks spanning mathematics, coding, and science. The results demonstrate that the performance of the hybrid linear architecture is by no means inferior to that of standard softmax attention; in fact, it even outperforms the other models on 3 of the benchmarks.
36
  <div style="display: flex; justify-content: center;">
37
  <div style="text-align: center;">
38
- <img src="https://cdn-uploads.huggingface.co/production/uploads/68d20104a6f8ea66da0cb447/_tjjgBEBlankfrWUY0N9i.png" width="800">
39
  <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
40
  </div>
41
  </div>
42
 
43
- Here is a demo of a small Snake game, with the code generated by our model.
44
  <div style="display: flex; justify-content: center;">
45
  <div style="text-align: center;">
46
- <img src="https://mdn.alipayobjects.com/huamei_jcuiuk/afts/img/tqfCQoTqRdAAAAAAgZAAAAgADr6CAQFr/original" width="800">
47
- <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 3:</strong> Snake Game </p>
48
  </div>
49
- </div> -->
50
 
51
 
52
  ## Linear Attention, Highly Sparse,High-Speed Generation
@@ -70,14 +68,14 @@ What is truly exciting is that in the comparison with Qwen3-32B, Ring-flash-line
70
  </div>
71
 
72
 
73
- ## Model Downloads
74
 
75
  <div align="center">
76
 
77
  | **Model** | **Context Length** | **Download** |
78
  | :----------------: | :----------------: | :----------: |
79
  | Ring-flash-linear-2.0 | 128K | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0) <br>[🤖 Modelscope](https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0)|
80
- </div>
81
 
82
  ## Quickstart
83
 
@@ -178,6 +176,51 @@ curl -s http://localhost:${PORT}/v1/chat/completions \
178
 
179
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
180
 
181
- ### vLLM
182
- TODO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
183
  ## Citation
 
32
  </div>
33
 
34
  ## Evaluation
 
35
  <div style="display: flex; justify-content: center;">
36
  <div style="text-align: center;">
37
+ <img src="https://mdn.alipayobjects.com/huamei_t783ie/afts/img/mc1wSo7zHV4AAAAARHAAAAgADgCDAQFr/original" width="800">
38
  <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
39
  </div>
40
  </div>
41
 
 
42
  <div style="display: flex; justify-content: center;">
43
  <div style="text-align: center;">
44
+ <img src="https://mdn.alipayobjects.com/huamei_t783ie/afts/img/N5xMTq4KouMAAAAARHAAAAgADgCDAQFr/original" width="800">
45
+ <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 3:</strong> Model Performance Comparison </p>
46
  </div>
47
+ </div>
48
 
49
 
50
  ## Linear Attention, Highly Sparse,High-Speed Generation
 
68
  </div>
69
 
70
 
71
+ <!-- ## Model Downloads
72
 
73
  <div align="center">
74
 
75
  | **Model** | **Context Length** | **Download** |
76
  | :----------------: | :----------------: | :----------: |
77
  | Ring-flash-linear-2.0 | 128K | [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0) <br>[🤖 Modelscope](https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0)|
78
+ </div> -->
79
 
80
  ## Quickstart
81
 
 
176
 
177
  More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
178
 
179
+ ### 🚀 vLLM
180
+
181
+ #### Environment Preparation
182
+
183
+ Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
184
+ ```shell
185
+ pip install torch==2.7.0 torchvision==0.22.0
186
+ ```
187
+
188
+ Then you should install our vLLM wheel package:
189
+ ```shell
190
+ pip install https://github.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
191
+ ```
192
+
193
+ #### Offline Inference
194
+
195
+ ```python
196
+ from transformers import AutoTokenizer
197
+ from vllm import LLM, SamplingParams
198
+
199
+ tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
200
+
201
+ sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)
202
+
203
+ llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
204
+ prompt = "Give me a short introduction to large language models."
205
+ messages = [
206
+ {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
207
+ {"role": "user", "content": prompt}
208
+ ]
209
+
210
+ text = tokenizer.apply_chat_template(
211
+ messages,
212
+ tokenize=False,
213
+ add_generation_prompt=True
214
+ )
215
+ outputs = llm.generate([text], sampling_params)
216
+ ```
217
+
218
+ #### Online Inference
219
+ ```shell
220
+ vllm serve inclusionAI/Ring-flash-linear-2.0 \
221
+ --tensor-parallel-size 4 \
222
+ --gpu-memory-utilization 0.90 \
223
+ --max-num-seqs 512 \
224
+ --no-enable-prefix-caching
225
+ ```
226
  ## Citation