heylimon commited on
Commit
e8209fc
·
verified ·
1 Parent(s): 3c90d8b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -42
README.md CHANGED
@@ -51,9 +51,59 @@ For more details, see:
51
  | Think mode (standard requests) | ≈ 0.6 | 1.0 |
52
  | Complex reasoning requests | ≥ 0.8 | 1.0 |
53
 
 
 
 
 
54
 
55
  ## 👨‍💻 Examples of usage
56
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  ### HF Usage
58
 
59
  ```python
@@ -242,46 +292,9 @@ generated_text = [output.outputs[0].text for output in outputs]
242
  print(generated_text)
243
  ```
244
 
 
 
 
245
 
246
-
247
- ## SGLang Usage
248
-
249
- To run an inference server for **T-pro IT 2.0**, start by launching the SGLang server:
250
-
251
- ```bash
252
- python -m sglang.launch_server \
253
- --model-path t-tech/T-pro-it-2.0 \
254
- --reasoning-parser qwen3
255
- ````
256
-
257
- Once the server is up and listening on `localhost:30000`, you can send chat-based requests via the OpenAI Python client.
258
-
259
- ```python
260
- import openai
261
-
262
- client = openai.OpenAI(
263
- base_url="http://127.0.0.1:30000/v1",
264
- api_key="ANY" # the server ignores the API key
265
- )
266
-
267
- prompt = (
268
- "Пожалуйста, вычисли определённый интеграл ∫_0^1 x² eˣ dx, "
269
- "пошагово объясни решение и укажи окончательный результат."
270
- )
271
-
272
- completion = client.chat.completions.create(
273
- model="ANY", # the server ignores the model name
274
- messages=[
275
- {"role": "system", "content": "Ты T-pro, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
276
- {"role": "user", "content": prompt}
277
- ],
278
- # REQUIRED: sampling params from the "Recommended Generation Parameters" table
279
- temperature=0.6,
280
- presence_penalty=1.0,
281
- )
282
-
283
- # The generated reply is in `completion.choices[0].message.content`
284
- print(completion.choices[0].message.content)
285
- ```
286
-
287
- **Note:** It is **obligatory** to include both `temperature` and `presence_penalty` in every completion call.
 
51
  | Think mode (standard requests) | ≈ 0.6 | 1.0 |
52
  | Complex reasoning requests | ≥ 0.8 | 1.0 |
53
 
54
+ - Hybrid reasoning models need careful tuning of sampling hyperparameters, which vary by domain.
55
+ - Use lower temperature for straightforward queries and higher temperature for complex 'think-mode' tasks.
56
+ - A presence_penalty between 0 and 2 can help avoid repetitive outputs.
57
+
58
 
59
  ## 👨‍💻 Examples of usage
60
 
61
+
62
+
63
+ ## SGLang Usage
64
+ For better quality and stable performance, we recommend SGLang as your inference framework.
65
+
66
+ To run an inference server for **T-pro IT 2.0**, start by launching the SGLang server:
67
+
68
+ ```bash
69
+ python -m sglang.launch_server \
70
+ --model-path t-tech/T-pro-it-2.0 \
71
+ --reasoning-parser qwen3
72
+ ````
73
+
74
+ Once the server is up and listening on `localhost:30000`, you can send chat-based requests via the OpenAI Python client.
75
+
76
+ ```python
77
+ import openai
78
+
79
+ client = openai.OpenAI(
80
+ base_url="http://127.0.0.1:30000/v1",
81
+ api_key="ANY" # the server ignores the API key
82
+ )
83
+
84
+ prompt = (
85
+ "Пожалуйста, вычисли определённый интеграл ∫_0^1 x² eˣ dx, "
86
+ "пошагово объясни решение и укажи окончательный результат."
87
+ )
88
+
89
+ completion = client.chat.completions.create(
90
+ model="ANY", # the server ignores the model name
91
+ messages=[
92
+ {"role": "system", "content": "Ты T-pro, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
93
+ {"role": "user", "content": prompt}
94
+ ],
95
+ # REQUIRED: sampling params from the "Recommended Generation Parameters" table
96
+ temperature=0.6,
97
+ presence_penalty=1.0,
98
+ )
99
+
100
+ # The generated reply is in `completion.choices[0].message.content`
101
+ print(completion.choices[0].message.content)
102
+ ```
103
+
104
+ **Note:** It is **obligatory** to include both `temperature` and `presence_penalty` in every completion call.
105
+
106
+
107
  ### HF Usage
108
 
109
  ```python
 
292
  print(generated_text)
293
  ```
294
 
295
+ ## Long Context Usage
296
+ T-pro-it-2.0 natively supports a context length of 32,768 tokens.
297
+ For conversations where the input significantly exceeds this limit, follow the recommendations from the [Qwen3 model card](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) on processing long texts.
298
 
299
+ For example, in SGLang, you can enable 128K context support with the following command:
300
+ `llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768`