Darkhn commited on
Commit
e4604b6
·
verified ·
1 Parent(s): 274e504

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +242 -241
README.md CHANGED
@@ -1,241 +1,242 @@
1
- ---
2
- license: llama3.1
3
- library_name: transformers
4
- base_model:
5
- - meta-llama/Llama-3.1-70B
6
- pipeline_tag: text-generation
7
- ---
8
-
9
- <p align="center">
10
- <img src="images/deep-cogito-logo.png" alt="Logo" width="40%">
11
- </p>
12
-
13
-
14
- # Cogito v1 preview - 70B
15
-
16
- [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview)
17
-
18
- The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.
19
-
20
- - Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
21
- - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
22
- - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
23
- - In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
24
- - Each model is trained in over 30 languages and supports a context length of 128k.
25
-
26
- # Evaluations
27
- We compare our models against the state of the art size equivalent models in direct mode as well as the reasoning mode. For the direct mode, we compare against Llama / Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts / Qwen's QwQ model.
28
-
29
- <p align="left">
30
- <img src="images/70b_benchmarks.png" alt="Logo" width="90%">
31
- </p>
32
-
33
- **Livebench Global Average:**
34
- <p align="left">
35
- <img src="images/livebench_global_average.png" alt="Logo" width="80%">
36
- </p>
37
-
38
- For detailed evaluations, please refer to the [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview).
39
-
40
-
41
- # Usage
42
- Here is a snippet below for usage with Transformers:
43
-
44
- ```python
45
- import transformers
46
- import torch
47
-
48
- model_id = "deepcogito/cogito-v1-preview-llama-70B"
49
-
50
- pipeline = transformers.pipeline(
51
- "text-generation",
52
- model=model_id,
53
- model_kwargs={"torch_dtype": torch.bfloat16},
54
- device_map="auto",
55
- )
56
-
57
- messages = [
58
- {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
59
- {"role": "user", "content": "Give me a short introduction to LLMs."},
60
- ]
61
-
62
- outputs = pipeline(
63
- messages,
64
- max_new_tokens=512,
65
- )
66
-
67
- print(outputs[0]["generated_text"][-1])
68
- ```
69
-
70
-
71
-
72
- ## Implementing extended thinking
73
- - By default, the model will answer in the standard mode.
74
- - To enable thinking, you can do any one of the two methods:
75
- - Add a specific system prompt, or
76
- - Set `enable_thinking=True` while applying the chat template.
77
-
78
-
79
- ### Method 1 - Add a specific system prompt.
80
- To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
81
-
82
- If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
83
-
84
- Here is an example -
85
-
86
- ```python
87
- import transformers
88
- import torch
89
-
90
- model_id = "deepcogito/cogito-v1-preview-llama-70B"
91
-
92
- pipeline = transformers.pipeline(
93
- "text-generation",
94
- model=model_id,
95
- model_kwargs={"torch_dtype": torch.bfloat16},
96
- device_map="auto",
97
- )
98
-
99
- DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
100
-
101
- messages = [
102
- {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
103
- {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
104
- ]
105
-
106
- outputs = pipeline(
107
- messages,
108
- max_new_tokens=512,
109
- )
110
-
111
- print(outputs[0]["generated_text"][-1])
112
- ```
113
-
114
-
115
- Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
116
-
117
- ```python
118
- DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
119
-
120
- system_prompt = "Reply to each prompt with only the actual code - no explanations."
121
- prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
122
-
123
- messages = [
124
- {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
125
- {"role": "user", "content": prompt}
126
- ]
127
- ```
128
-
129
- ### Method 2 - Set enable_thinking=True in the tokenizer
130
- If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
131
-
132
- Here is an example -
133
- ```python
134
- from transformers import AutoModelForCausalLM, AutoTokenizer
135
-
136
- model_name = "deepcogito/cogito-v1-preview-llama-70B"
137
-
138
- model = AutoModelForCausalLM.from_pretrained(
139
- model_name,
140
- torch_dtype="auto",
141
- device_map="auto"
142
- )
143
- tokenizer = AutoTokenizer.from_pretrained(model_name)
144
-
145
- prompt = "Give me a short introduction to LLMs."
146
- messages = [
147
- {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
148
- {"role": "user", "content": prompt}
149
- ]
150
-
151
- text = tokenizer.apply_chat_template(
152
- messages,
153
- tokenize=False,
154
- add_generation_prompt=True,
155
- enable_thinking=True
156
- )
157
- model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
158
-
159
- generated_ids = model.generate(
160
- **model_inputs,
161
- max_new_tokens=512
162
- )
163
- generated_ids = [
164
- output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
165
- ]
166
-
167
- response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
168
- print(response)
169
- ```
170
-
171
- # Tool Calling
172
- Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
173
-
174
- Here is a snippet -
175
-
176
- ```python
177
- # First, define a tool
178
- def get_current_temperature(location: str) -> float:
179
- """
180
- Get the current temperature at a location.
181
-
182
- Args:
183
- location: The location to get the temperature for, in the format "City, Country"
184
- Returns:
185
- The current temperature at the specified location in the specified units, as a float.
186
- """
187
- return 22. # A real function should probably actually get the temperature!
188
-
189
- # Next, create a chat and apply the chat template
190
- messages = [
191
- {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
192
- ]
193
-
194
- model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
195
-
196
- text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
197
- inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
198
- outputs = model.generate(**inputs, max_new_tokens=512)
199
- output_text = tokenizer.batch_decode(outputs)[0][len(text):]
200
- print(output_text)
201
- ```
202
-
203
- This will result in the output -
204
- ```
205
- <tool_call>
206
- {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
207
- </tool_call><|eot_id|>
208
- ```
209
-
210
- You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
211
-
212
- ```python
213
- tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
214
- messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
215
- ```
216
-
217
- and then call the tool and append the result, with the `tool` role, like so:
218
-
219
- ```python
220
- messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
221
- ```
222
-
223
- After that, you can `generate()` again to let the model use the tool result in the chat:
224
-
225
- ```python
226
- text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
227
- inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
228
- outputs = model.generate(**inputs, max_new_tokens=512)
229
- output_text = tokenizer.batch_decode(outputs)[0][len(text):]
230
- ```
231
-
232
- This should result in the string -
233
- ```
234
- 'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
235
- ```
236
-
237
- ## License
238
- This repository and the model weights are licensed under the [Llama 3.3 Community License Agreement](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE) (Llama models' default license agreement).
239
-
240
- ## Contact
241
- If you would like to reach out to our team, send an email to [[email protected]]([email protected]).
 
 
1
+ ---
2
+ base_model_relation: quantized
3
+ license: llama3.1
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ base_model:
7
+ - deepcogito/cogito-v1-preview-llama-70B
8
+ ---
9
+
10
+ <p align="center">
11
+ <img src="images/deep-cogito-logo.png" alt="Logo" width="40%">
12
+ </p>
13
+
14
+
15
+ # Cogito v1 preview - 70B
16
+
17
+ [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview)
18
+
19
+ The Cogito LLMs are instruction tuned generative models (text in/text out). All models are released under an open license for commercial use.
20
+
21
+ - Cogito models are hybrid reasoning models. Each model can answer directly (standard LLM), or self-reflect before answering (like reasoning models).
22
+ - The LLMs are trained using **Iterated Distillation and Amplification (IDA)** - an scalable and efficient alignment strategy for superintelligence using iterative self-improvement.
23
+ - The models have been optimized for coding, STEM, instruction following and general helpfulness, and have significantly higher multilingual, coding and tool calling capabilities than size equivalent counterparts.
24
+ - In both standard and reasoning modes, Cogito v1-preview models outperform their size equivalent counterparts on common industry benchmarks.
25
+ - Each model is trained in over 30 languages and supports a context length of 128k.
26
+
27
+ # Evaluations
28
+ We compare our models against the state of the art size equivalent models in direct mode as well as the reasoning mode. For the direct mode, we compare against Llama / Qwen instruct counterparts. For reasoning, we use Deepseek's R1 distilled counterparts / Qwen's QwQ model.
29
+
30
+ <p align="left">
31
+ <img src="images/70b_benchmarks.png" alt="Logo" width="90%">
32
+ </p>
33
+
34
+ **Livebench Global Average:**
35
+ <p align="left">
36
+ <img src="images/livebench_global_average.png" alt="Logo" width="80%">
37
+ </p>
38
+
39
+ For detailed evaluations, please refer to the [Blog Post](https://www.deepcogito.com/research/cogito-v1-preview).
40
+
41
+
42
+ # Usage
43
+ Here is a snippet below for usage with Transformers:
44
+
45
+ ```python
46
+ import transformers
47
+ import torch
48
+
49
+ model_id = "deepcogito/cogito-v1-preview-llama-70B"
50
+
51
+ pipeline = transformers.pipeline(
52
+ "text-generation",
53
+ model=model_id,
54
+ model_kwargs={"torch_dtype": torch.bfloat16},
55
+ device_map="auto",
56
+ )
57
+
58
+ messages = [
59
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
60
+ {"role": "user", "content": "Give me a short introduction to LLMs."},
61
+ ]
62
+
63
+ outputs = pipeline(
64
+ messages,
65
+ max_new_tokens=512,
66
+ )
67
+
68
+ print(outputs[0]["generated_text"][-1])
69
+ ```
70
+
71
+
72
+
73
+ ## Implementing extended thinking
74
+ - By default, the model will answer in the standard mode.
75
+ - To enable thinking, you can do any one of the two methods:
76
+ - Add a specific system prompt, or
77
+ - Set `enable_thinking=True` while applying the chat template.
78
+
79
+
80
+ ### Method 1 - Add a specific system prompt.
81
+ To enable thinking, simply use this in the system prompt `system_instruction = 'Enable deep thinking subroutine.'`
82
+
83
+ If you already have a system_instruction, then use `system_instruction = 'Enable deep thinking subroutine.' + '\n\n' + system_instruction`.
84
+
85
+ Here is an example -
86
+
87
+ ```python
88
+ import transformers
89
+ import torch
90
+
91
+ model_id = "deepcogito/cogito-v1-preview-llama-70B"
92
+
93
+ pipeline = transformers.pipeline(
94
+ "text-generation",
95
+ model=model_id,
96
+ model_kwargs={"torch_dtype": torch.bfloat16},
97
+ device_map="auto",
98
+ )
99
+
100
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
101
+
102
+ messages = [
103
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION},
104
+ {"role": "user", "content": "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."},
105
+ ]
106
+
107
+ outputs = pipeline(
108
+ messages,
109
+ max_new_tokens=512,
110
+ )
111
+
112
+ print(outputs[0]["generated_text"][-1])
113
+ ```
114
+
115
+
116
+ Similarly, if you have a system prompt, you can append the `DEEP_THINKING_INSTRUCTION` to the beginning in this way -
117
+
118
+ ```python
119
+ DEEP_THINKING_INSTRUCTION = "Enable deep thinking subroutine."
120
+
121
+ system_prompt = "Reply to each prompt with only the actual code - no explanations."
122
+ prompt = "Write a bash script that takes a matrix represented as a string with format '[1,2],[3,4],[5,6]' and prints the transpose in the same format."
123
+
124
+ messages = [
125
+ {"role": "system", "content": DEEP_THINKING_INSTRUCTION + '\n\n' + system_prompt},
126
+ {"role": "user", "content": prompt}
127
+ ]
128
+ ```
129
+
130
+ ### Method 2 - Set enable_thinking=True in the tokenizer
131
+ If you are using Huggingface tokenizers, then you can simply use add the argument `enable_thinking=True` to the tokenization (this option is added to the chat template).
132
+
133
+ Here is an example -
134
+ ```python
135
+ from transformers import AutoModelForCausalLM, AutoTokenizer
136
+
137
+ model_name = "deepcogito/cogito-v1-preview-llama-70B"
138
+
139
+ model = AutoModelForCausalLM.from_pretrained(
140
+ model_name,
141
+ torch_dtype="auto",
142
+ device_map="auto"
143
+ )
144
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
145
+
146
+ prompt = "Give me a short introduction to LLMs."
147
+ messages = [
148
+ {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
149
+ {"role": "user", "content": prompt}
150
+ ]
151
+
152
+ text = tokenizer.apply_chat_template(
153
+ messages,
154
+ tokenize=False,
155
+ add_generation_prompt=True,
156
+ enable_thinking=True
157
+ )
158
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
159
+
160
+ generated_ids = model.generate(
161
+ **model_inputs,
162
+ max_new_tokens=512
163
+ )
164
+ generated_ids = [
165
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
166
+ ]
167
+
168
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
169
+ print(response)
170
+ ```
171
+
172
+ # Tool Calling
173
+ Cogito models support tool calling (single, parallel, multiple and parallel_multiple) both in standard and extended thinking mode.
174
+
175
+ Here is a snippet -
176
+
177
+ ```python
178
+ # First, define a tool
179
+ def get_current_temperature(location: str) -> float:
180
+ """
181
+ Get the current temperature at a location.
182
+
183
+ Args:
184
+ location: The location to get the temperature for, in the format "City, Country"
185
+ Returns:
186
+ The current temperature at the specified location in the specified units, as a float.
187
+ """
188
+ return 22. # A real function should probably actually get the temperature!
189
+
190
+ # Next, create a chat and apply the chat template
191
+ messages = [
192
+ {"role": "user", "content": "Hey, what's the temperature in Paris right now?"}
193
+ ]
194
+
195
+ model_inputs = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True)
196
+
197
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
198
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
199
+ outputs = model.generate(**inputs, max_new_tokens=512)
200
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
201
+ print(output_text)
202
+ ```
203
+
204
+ This will result in the output -
205
+ ```
206
+ <tool_call>
207
+ {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
208
+ </tool_call><|eot_id|>
209
+ ```
210
+
211
+ You can then generate text from this input as normal. If the model generates a tool call, you should add it to the chat like so:
212
+
213
+ ```python
214
+ tool_call = {"name": "get_current_temperature", "arguments": {"location": "Paris, France"}}
215
+ messages.append({"role": "assistant", "tool_calls": [{"type": "function", "function": tool_call}]})
216
+ ```
217
+
218
+ and then call the tool and append the result, with the `tool` role, like so:
219
+
220
+ ```python
221
+ messages.append({"role": "tool", "name": "get_current_temperature", "content": "22.0"})
222
+ ```
223
+
224
+ After that, you can `generate()` again to let the model use the tool result in the chat:
225
+
226
+ ```python
227
+ text = tokenizer.apply_chat_template(messages, tools=[get_current_temperature], add_generation_prompt=True, tokenize=False)
228
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False).to(model.device)
229
+ outputs = model.generate(**inputs, max_new_tokens=512)
230
+ output_text = tokenizer.batch_decode(outputs)[0][len(text):]
231
+ ```
232
+
233
+ This should result in the string -
234
+ ```
235
+ 'The current temperature in Paris is 22.0 degrees.<|eot_id|>'
236
+ ```
237
+
238
+ ## License
239
+ This repository and the model weights are licensed under the [Llama 3.3 Community License Agreement](https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/LICENSE) (Llama models' default license agreement).
240
+
241
+ ## Contact
242
+ If you would like to reach out to our team, send an email to [[email protected]]([email protected]).