inclusionAI
/

Ring-flash-linear-2.0

@@ -32,21 +32,19 @@ When it comes to benchmarks, Ring-flash-linear-2.0 not only holds its own agains
 </div>
 ## Evaluation
-<!-- To properly evaluate the model's reasoning capabilities, we compared it against 3 other models—Ring-mini-2.0, Qwen3-8B-thinking, and GPT-OSS-20B-Medium—on 6 challenging reasoning benchmarks spanning mathematics, coding, and science. The results demonstrate that the performance of the hybrid linear architecture is by no means inferior to that of standard softmax attention; in fact, it even outperforms the other models on 3 of the benchmarks.
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
-    <img src="https://cdn-uploads.huggingface.co/production/uploads/68d20104a6f8ea66da0cb447/_tjjgBEBlankfrWUY0N9i.png" width="800">
     <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
   </div>
 </div>
-Here is a demo of a small Snake game, with the code generated by our model.
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
-    <img src="https://mdn.alipayobjects.com/huamei_jcuiuk/afts/img/tqfCQoTqRdAAAAAAgZAAAAgADr6CAQFr/original" width="800">
-    <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 3:</strong> Snake Game </p>
   </div>
-</div> -->
 ## Linear Attention, Highly Sparse，High-Speed Generation
@@ -70,14 +68,14 @@ What is truly exciting is that in the comparison with Qwen3-32B, Ring-flash-line
 </div>
-## Model Downloads
 <div align="center">
 |     **Model**     | **Context Length** | **Download** |
 | :----------------: | :----------------: | :----------: |
 | Ring-flash-linear-2.0 |        128K         |      [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0) <br>[🤖 Modelscope](https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0)|
-</div>
 ## Quickstart
@@ -178,6 +176,51 @@ curl -s http://localhost:${PORT}/v1/chat/completions \
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
-### vLLM
-TODO
 ## Citation

 </div>
 ## Evaluation
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
+    <img src="https://mdn.alipayobjects.com/huamei_t783ie/afts/img/mc1wSo7zHV4AAAAARHAAAAgADgCDAQFr/original" width="800">
     <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 2:</strong> Model Performance Comparison </p>
   </div>
 </div>
 <div style="display: flex; justify-content: center;">
   <div style="text-align: center;">
+    <img src="https://mdn.alipayobjects.com/huamei_t783ie/afts/img/N5xMTq4KouMAAAAARHAAAAgADgCDAQFr/original" width="800">
+    <p style="margin-top: 8px; font-size: 14px;"><strong>Figure 3:</strong> Model Performance Comparison </p>
   </div>
+</div>
 ## Linear Attention, Highly Sparse，High-Speed Generation
 </div>
+<!-- ## Model Downloads
 <div align="center">
 |     **Model**     | **Context Length** | **Download** |
 | :----------------: | :----------------: | :----------: |
 | Ring-flash-linear-2.0 |        128K         |      [🤗 HuggingFace](https://huggingface.co/inclusionAI/Ring-flash-linear-2.0) <br>[🤖 Modelscope](https://modelscope.cn/models/inclusionAI/Ring-flash-linear-2.0)|
+</div> -->
 ## Quickstart
 More usage can be found [here](https://docs.sglang.ai/basic_usage/send_request.html)
+### 🚀 vLLM
+#### Environment Preparation
+Since the Pull Request (PR) has not been submitted to the vLLM community at this stage, please prepare the environment by following the steps below:
+```shell
+pip install torch==2.7.0 torchvision==0.22.0
+```
+Then you should install our vLLM wheel package:
+```shell
+pip install https://github.com/inclusionAI/Ring-V2/blob/main/hybrid_linear/whls/vllm-0.8.5%2Bcuda12_8_gcc10_2_1-cp310-cp310-linux_x86_64.whl --no-deps --force-reinstall
+```
+#### Offline Inference
+```python
+from transformers import AutoTokenizer
+from vllm import LLM, SamplingParams
+tokenizer = AutoTokenizer.from_pretrained("inclusionAI/Ring-mini-linear-2.0")
+sampling_params = SamplingParams(temperature=0.6, max_tokens=8192)
+llm = LLM(model="inclusionAI/Ring-flash-linear-2.0", dtype='bfloat16', enable_prefix_caching=False)
+prompt = "Give me a short introduction to large language models."
+messages = [
+    {"role": "system", "content": "You are Ling, an assistant created by inclusionAI"},
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+outputs = llm.generate([text], sampling_params)
+```
+#### Online Inference
+```shell
+vllm serve inclusionAI/Ring-flash-linear-2.0 \
+              --tensor-parallel-size 4 \
+              --gpu-memory-utilization 0.90 \
+              --max-num-seqs 512 \
+              --no-enable-prefix-caching
+```
 ## Citation