kunato commited on
Commit
e69fb55
·
verified ·
1 Parent(s): 263bb63

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +245 -0
README.md ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ pipeline_tag: text-generation
4
+ ---
5
+
6
+ **Typhoon2.1-Gemma3-12B**: Thai Large Language Model (Instruct)
7
+
8
+ **Typhoon2.1-Gemma3-12B** is a instruct Thai 🇹🇭 large language model with 12 billion parameters, a 128K context length, and function-calling capabilities. It is based on Gemma3 12B.
9
+
10
+ **This repo contains FP8 quantization of the original Typhoon2.1 12B model for more efficient deployment on NVIDIA Hopper and newer architectures.**
11
+
12
+ Remark: This is text only model.
13
+
14
+ ## **Performance**
15
+
16
+ ![12b model performance](https://storage.googleapis.com/typhoon-public/assets/typhoon-21/performance12b_table.png)
17
+
18
+ ## **Model Description**
19
+
20
+ - **Model type**: A 12B instruct decoder-only model based on Gemma3 architecture.
21
+ - **Requirement**: transformers 4.50.0 or newer.
22
+ - **Primary Language(s)**: Thai 🇹🇭 and English 🇬🇧
23
+ - **Context Length**: 128K
24
+ - **License**: [Gemma License](https://github.com/google-deepmind/gemma/blob/main/LICENSE)
25
+
26
+
27
+ ## Deploy as Server
28
+
29
+ This section shows how to run Typhoon2.1 as an OpenAI-compatible API server using vllm.
30
+
31
+ ```bash
32
+ pip install vllm
33
+ vllm serve scb10x/typhoon2.1-gemma3-12b-fp8 --max-model-len 16000 --dtype bfloat16 --tool-call-parser pythonic --enable-auto-tool-choice
34
+ # adjust --max-model-len based on your avaliable memory
35
+ ```
36
+
37
+ ## Using Tools
38
+
39
+ You can provide tools to the vLLM-powered OpenAI-compatible API for functionality.
40
+
41
+ ```
42
+ from openai import OpenAI
43
+ import json
44
+
45
+ client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
46
+
47
+ def get_weather(location: str, unit: str):
48
+ return f"Getting the weather for {location} in {unit}..."
49
+ tool_functions = {"get_weather": get_weather}
50
+
51
+ tools = [{
52
+ "type": "function",
53
+ "function": {
54
+ "name": "get_weather",
55
+ "description": "Get the current weather in a given location",
56
+ "parameters": {
57
+ "type": "object",
58
+ "properties": {
59
+ "location": {"type": "string", "description": "City and state, e.g., 'San Francisco, CA'"},
60
+ "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
61
+ },
62
+ "required": ["location", "unit"]
63
+ }
64
+ }
65
+ }]
66
+
67
+ response = client.chat.completions.create(
68
+ model=client.models.list().data[0].id,
69
+ messages=[{"role": "user", "content": "What's the weather like in San Francisco?"}],
70
+ tools=tools,
71
+ tool_choice="auto"
72
+ )
73
+
74
+ tool_call = response.choices[0].message.tool_calls[0].function
75
+ print(f"Function called: {tool_call.name}")
76
+ print(f"Arguments: {tool_call.arguments}")
77
+ print(f"Result: {get_weather(**json.loads(tool_call.arguments))}")
78
+ ```
79
+
80
+
81
+ ## Switching Between Thinking and Non-Thinking Mode
82
+
83
+ Typhoon supports two modes:
84
+ Non-thinking mode (default): Fast response generation without extra reasoning steps.
85
+ Thinking mode: The model first reasons internally, then provides a clearer and potentially more accurate final answer.
86
+ You can enable thinking mode by:
87
+ Setting enable_thinking=True in apply_chat_template.
88
+ Using a special system prompt that instructs the model to reason inside <think>...</think> tags.
89
+
90
+ You can turn on thinking mode by either
91
+ - add enable_thinking=True to apply_chat_template
92
+
93
+ ```python
94
+ input_ids = tokenizer.apply_chat_template(
95
+ messages,
96
+ add_generation_prompt=True,
97
+ return_tensors="pt",
98
+ enable_thinking=True # Switches between thinking and non-thinking modes. Default is False.
99
+ ).to(model.device)
100
+ ```
101
+
102
+ - manually by supply thinking mode system prompt
103
+
104
+ ```
105
+ You are a helpful assistant. First, think through the reasoning internally, then present the reasoning within <think>...</think>. After thinking, clearly state a response that addresses the user's request and aligns with their preferences, not just providing a direct answer.
106
+ ```
107
+
108
+ - in vllm powered openai compatible client you can add chat_template_kwargs to the post payload
109
+ ```json
110
+ {
111
+ "model": "scb10x/typhoon2.1-gemma3-12b",
112
+ "messages": [
113
+ {"role": "user", "content": "Give me a short introduction to large language models."}
114
+ ],
115
+ "chat_template_kwargs": {"enable_thinking": true}
116
+ }
117
+ ```
118
+
119
+ ## Budget forcing
120
+
121
+ This section introduces budget forcing, an advanced technique to let the model spend more time and tokens reasoning before producing a final answer—great for improving performance on complex questions.
122
+
123
+ ```
124
+ from vllm import LLM, SamplingParams
125
+ from transformers import AutoTokenizer
126
+ class BudgetForcingHandler:
127
+
128
+ def __init__(self, model_name: str, max_think_token: int, max_ignore=5, temperature=0.6, seed=32):
129
+ self.temperature = temperature
130
+ self.seed = seed
131
+ self.max_think_token = max_think_token
132
+ self.max_ignore = max_ignore
133
+ self.model = LLM(model_name, dtype='bfloat16', enforce_eager=True)
134
+ self.tokenizer = AutoTokenizer.from_pretrained(model_name)
135
+ self.alternative_str = '\nAlternatively'
136
+ self.system = """You are a reasoning assistant. First, think through the reasoning internally, then present the reasoning within <think>...</think>. After thinking, clearly state the final answer."""
137
+
138
+ def __call__(self, prompts: List[str]):
139
+ count_prompt = len(prompts)
140
+ prompts = [self.tokenizer.apply_chat_template([{'role': 'system', 'content': self.system}, {'role': 'user', 'content': f'Please solve this math question, and put your final answer within \\boxed{{}}.\n{p}'}], add_generation_prompt=True, tokenize=False) for p in prompts]
141
+ sampling_params = SamplingParams(
142
+ max_tokens=self.max_think_token,
143
+ seed=self.seed,
144
+ stop=["</think>"],
145
+ skip_special_tokens=False,
146
+ temperature=self.temperature,
147
+ )
148
+ o = self.model.generate(
149
+ prompts,
150
+ sampling_params=sampling_params
151
+ )
152
+
153
+ outputs = [output.outputs[0].text for output in o]
154
+ token_count = [len(output.outputs[0].token_ids) for output in o]
155
+ for i in range(len(prompts)):
156
+ prompts[i] = prompts[i] + outputs[i]
157
+
158
+ for _ in range(self.max_ignore): # Num of times to skip stop token
159
+ inference_loop_prompts = []
160
+ inference_idx = []
161
+ max_inference_token = 0
162
+ print('current token count: ', token_count)
163
+ for i in range(len(prompts)):
164
+ left_budget = self.max_think_token - token_count[i]
165
+ if left_budget > 0:
166
+ prompts[i] = prompts[i] + self.alternative_str
167
+ inference_loop_prompts.append(prompts[i])
168
+ inference_idx.append(i)
169
+ if left_budget > max_inference_token:
170
+ max_inference_token = left_budget
171
+
172
+ outputs = ['' for _ in range(len(prompts))]
173
+ if max_inference_token == 0 or len(inference_loop_prompts) == 0:
174
+ break
175
+ sampling_params = SamplingParams(
176
+ max_tokens=max_inference_token,
177
+ min_tokens=1,
178
+ seed=self.seed,
179
+ stop=["</think>"],
180
+ skip_special_tokens=False,
181
+ temperature=self.temperature,
182
+ )
183
+ o = self.model.generate(
184
+ inference_loop_prompts,
185
+ sampling_params=sampling_params
186
+ )
187
+ assert len(inference_idx) == len(inference_loop_prompts)
188
+ assert len(inference_idx) == len(o)
189
+ for i, output in zip(inference_idx, o):
190
+ outputs[i] = output.outputs[0].text
191
+
192
+ for i, idx in enumerate(inference_idx):
193
+ token_count[idx] = token_count[idx] + len(o[i].outputs[0].token_ids)
194
+
195
+ for i in range(len(prompts)):
196
+ prompts[i] = prompts[i] + outputs[i]
197
+ print('generating answer...')
198
+ prompts = [p + '\nTime\'s up. End of thinking process. Will answer immediately.\n</think>' for i, p in enumerate(prompts)]
199
+ sampling_params = SamplingParams(
200
+ max_tokens=2048,
201
+ min_tokens=0,
202
+ seed=self.seed,
203
+ skip_special_tokens=False,
204
+ temperature=self.temperature,
205
+ )
206
+ o = self.model.generate(
207
+ prompts,
208
+ sampling_params=sampling_params,
209
+ )
210
+ for i in range(len(prompts)):
211
+ prompts[i] = prompts[i] + o[i].outputs[0].text
212
+ assert len(prompts) == count_prompt
213
+ return prompts
214
+
215
+ handler = BudgetForcingHandler("scb10x/typhoon2.1-gemma3-12b", max_think_token=2048)
216
+ handler(["How many r in raspberry?"])
217
+ ```
218
+
219
+ ## **Intended Uses & Limitations**
220
+
221
+ This model is an instructional model. However, it’s still undergoing development. It incorporates some level of guardrails, but it still may produce answers that are inaccurate, biased, or otherwise objectionable in response to user prompts. We recommend that developers assess these risks in the context of their use case.
222
+
223
+ ## **Follow us**
224
+
225
+ **https://twitter.com/opentyphoon**
226
+
227
+ ## **Support**
228
+
229
+ **https://discord.gg/us5gAYmrxw**
230
+
231
+
232
+ ## **Citation**
233
+
234
+ - If you find Typhoon2 useful for your work, please cite it using:
235
+ ```
236
+ @misc{typhoon2,
237
+ title={Typhoon 2: A Family of Open Text and Multimodal Thai Large Language Models},
238
+ author={Kunat Pipatanakul and Potsawee Manakul and Natapong Nitarach and Warit Sirichotedumrong and Surapon Nonesung and Teetouch Jaknamon and Parinthapat Pengpun and Pittawat Taveekitworachai and Adisai Na-Thalang and Sittipong Sripaisarnmongkol and Krisanapong Jirayoot and Kasima Tharnpipitchai},
239
+ year={2024},
240
+ eprint={2412.13702},
241
+ archivePrefix={arXiv},
242
+ primaryClass={cs.CL},
243
+ url={https://arxiv.org/abs/2412.13702},
244
+ }
245
+ ```