Update README.md
Browse files
README.md
CHANGED
@@ -51,9 +51,59 @@ For more details, see:
|
|
51 |
| Think mode (standard requests) | ≈ 0.6 | 1.0 |
|
52 |
| Complex reasoning requests | ≥ 0.8 | 1.0 |
|
53 |
|
|
|
|
|
|
|
|
|
54 |
|
55 |
## 👨💻 Examples of usage
|
56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
57 |
### HF Usage
|
58 |
|
59 |
```python
|
@@ -242,46 +292,9 @@ generated_text = [output.outputs[0].text for output in outputs]
|
|
242 |
print(generated_text)
|
243 |
```
|
244 |
|
|
|
|
|
|
|
245 |
|
246 |
-
|
247 |
-
|
248 |
-
|
249 |
-
To run an inference server for **T-pro IT 2.0**, start by launching the SGLang server:
|
250 |
-
|
251 |
-
```bash
|
252 |
-
python -m sglang.launch_server \
|
253 |
-
--model-path t-tech/T-pro-it-2.0 \
|
254 |
-
--reasoning-parser qwen3
|
255 |
-
````
|
256 |
-
|
257 |
-
Once the server is up and listening on `localhost:30000`, you can send chat-based requests via the OpenAI Python client.
|
258 |
-
|
259 |
-
```python
|
260 |
-
import openai
|
261 |
-
|
262 |
-
client = openai.OpenAI(
|
263 |
-
base_url="http://127.0.0.1:30000/v1",
|
264 |
-
api_key="ANY" # the server ignores the API key
|
265 |
-
)
|
266 |
-
|
267 |
-
prompt = (
|
268 |
-
"Пожалуйста, вычисли определённый интеграл ∫_0^1 x² eˣ dx, "
|
269 |
-
"пошагово объясни решение и укажи окончательный результат."
|
270 |
-
)
|
271 |
-
|
272 |
-
completion = client.chat.completions.create(
|
273 |
-
model="ANY", # the server ignores the model name
|
274 |
-
messages=[
|
275 |
-
{"role": "system", "content": "Ты T-pro, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
|
276 |
-
{"role": "user", "content": prompt}
|
277 |
-
],
|
278 |
-
# REQUIRED: sampling params from the "Recommended Generation Parameters" table
|
279 |
-
temperature=0.6,
|
280 |
-
presence_penalty=1.0,
|
281 |
-
)
|
282 |
-
|
283 |
-
# The generated reply is in `completion.choices[0].message.content`
|
284 |
-
print(completion.choices[0].message.content)
|
285 |
-
```
|
286 |
-
|
287 |
-
**Note:** It is **obligatory** to include both `temperature` and `presence_penalty` in every completion call.
|
|
|
51 |
| Think mode (standard requests) | ≈ 0.6 | 1.0 |
|
52 |
| Complex reasoning requests | ≥ 0.8 | 1.0 |
|
53 |
|
54 |
+
- Hybrid reasoning models need careful tuning of sampling hyperparameters, which vary by domain.
|
55 |
+
- Use lower temperature for straightforward queries and higher temperature for complex 'think-mode' tasks.
|
56 |
+
- A presence_penalty between 0 and 2 can help avoid repetitive outputs.
|
57 |
+
|
58 |
|
59 |
## 👨💻 Examples of usage
|
60 |
|
61 |
+
|
62 |
+
|
63 |
+
## SGLang Usage
|
64 |
+
For better quality and stable performance, we recommend SGLang as your inference framework.
|
65 |
+
|
66 |
+
To run an inference server for **T-pro IT 2.0**, start by launching the SGLang server:
|
67 |
+
|
68 |
+
```bash
|
69 |
+
python -m sglang.launch_server \
|
70 |
+
--model-path t-tech/T-pro-it-2.0 \
|
71 |
+
--reasoning-parser qwen3
|
72 |
+
````
|
73 |
+
|
74 |
+
Once the server is up and listening on `localhost:30000`, you can send chat-based requests via the OpenAI Python client.
|
75 |
+
|
76 |
+
```python
|
77 |
+
import openai
|
78 |
+
|
79 |
+
client = openai.OpenAI(
|
80 |
+
base_url="http://127.0.0.1:30000/v1",
|
81 |
+
api_key="ANY" # the server ignores the API key
|
82 |
+
)
|
83 |
+
|
84 |
+
prompt = (
|
85 |
+
"Пожалуйста, вычисли определённый интеграл ∫_0^1 x² eˣ dx, "
|
86 |
+
"пошагово объясни решение и укажи окончательный результат."
|
87 |
+
)
|
88 |
+
|
89 |
+
completion = client.chat.completions.create(
|
90 |
+
model="ANY", # the server ignores the model name
|
91 |
+
messages=[
|
92 |
+
{"role": "system", "content": "Ты T-pro, виртуальный ассистент в Т-Технологии. Твоя задача - быть полезным диалоговым ассистентом."},
|
93 |
+
{"role": "user", "content": prompt}
|
94 |
+
],
|
95 |
+
# REQUIRED: sampling params from the "Recommended Generation Parameters" table
|
96 |
+
temperature=0.6,
|
97 |
+
presence_penalty=1.0,
|
98 |
+
)
|
99 |
+
|
100 |
+
# The generated reply is in `completion.choices[0].message.content`
|
101 |
+
print(completion.choices[0].message.content)
|
102 |
+
```
|
103 |
+
|
104 |
+
**Note:** It is **obligatory** to include both `temperature` and `presence_penalty` in every completion call.
|
105 |
+
|
106 |
+
|
107 |
### HF Usage
|
108 |
|
109 |
```python
|
|
|
292 |
print(generated_text)
|
293 |
```
|
294 |
|
295 |
+
## Long Context Usage
|
296 |
+
T-pro-it-2.0 natively supports a context length of 32,768 tokens.
|
297 |
+
For conversations where the input significantly exceeds this limit, follow the recommendations from the [Qwen3 model card](https://huggingface.co/Qwen/Qwen3-235B-A22B#processing-long-texts) on processing long texts.
|
298 |
|
299 |
+
For example, in SGLang, you can enable 128K context support with the following command:
|
300 |
+
`llama-server ... --rope-scaling yarn --rope-scale 4 --yarn-orig-ctx 32768`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|