Files changed (1) hide show
  1. README.md +89 -75
README.md CHANGED
@@ -1,75 +1,89 @@
1
- ---
2
- base_model: Qwen/Qwen2.5-3B-Instruct
3
- library_name: transformers
4
- model_name: qwen-2.5-3b-r1-countdown
5
- tags:
6
- - generated_from_trainer
7
- - trl
8
- - grpo
9
- - r1
10
- - rl
11
- licence: qwen-research
12
- ---
13
-
14
- # Model Card for `qwen-2.5-3b-r1-countdown` a mini R1 experiments
15
-
16
- This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
17
- It has been trained using [TRL](https://github.com/huggingface/trl) and GRPO on the Countdown game.
18
-
19
- If you want to learn how to replicate this model and reproduce your own Deepseek R1 "aha" moment, check out my [blog post](https://www.philschmid.com/mini-deepseek-r1).
20
-
21
-
22
- ## Quick start
23
-
24
- ```python
25
- from vllm import LLM, SamplingParams
26
- from datasets import load_dataset
27
- from random import randint
28
-
29
- sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
30
-
31
- # use revision without "checkpoints-" as vLLM downloads all of them
32
- llm = LLM(model="philschmid/qwen-2.5-3b-r1-countdown", revision="099c0f8cbfc522e7c3a476edfb749f576b164539")
33
-
34
- # Load dataset from Hugging Face Hub
35
- dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
36
- dataset = load_dataset(dataset_id, split="train")
37
- sample = dataset[randint(0, len(dataset))]
38
-
39
- # create conversation
40
- messages = [
41
- {"role": "system", "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."},
42
- {"role": "user", "content": f"Using the numbers {sample['nums']}, create an equation that equals {sample['target']}. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>. Think step by step inside <think> tags."},
43
- {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
44
- ]
45
- # generate response
46
- res = llm.generate(llm.get_tokenizer().apply_chat_template(messages, tokenize=False, continue_final_message=True), sampling_params)
47
- res = "<think>" + res[0].outputs[0].text
48
- print(res)
49
-
50
- # <think> We need to use the numbers 37, 15, 4, and 13 with basic arithmetic operations to make 16. Let's try different combinations:
51
- # - 37 - 15 - 4 - 13 = 6 (too low)
52
- # - 37 - 15 + 4 - 13 = 13 (too low)
53
- # - 37 + 15 - 4 - 13 = 35 (too high)
54
- # - 37 - 15 + 4 + 13 = 39 (too high)
55
- # - 15 + 4 + 13 - 37 = -1 (too low)
56
- # - 37 + 15 + 4 - 13 = 43 (too high)
57
- # - 15 + 4 * 13 / 37 = 15 + 52 / 37 (not an integer)
58
- # - 15 * 4 / 37 - 37 = -28.24 (not a whole number)
59
- # - 4 * 13 / 15 - 37 = 41.3333 (not a whole number)
60
- # After all combinations, I got not any integer result as 16.
61
- # </think>
62
- # <answer> 37 - 15 + 4 + 13 </answer>
63
- ```
64
-
65
- ## Training procedure
66
-
67
- This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
68
-
69
- ### Framework versions
70
-
71
- - TRL: 0.14.0
72
- - Transformers: 4.48.1
73
- - Pytorch: 2.5.1+cu121
74
- - Datasets: 3.1.0
75
- - Tokenizers: 0.21.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-3B-Instruct
3
+ library_name: transformers
4
+ model_name: qwen-2.5-3b-r1-countdown
5
+ tags:
6
+ - generated_from_trainer
7
+ - trl
8
+ - grpo
9
+ - r1
10
+ - rl
11
+ licence: qwen-research
12
+ language:
13
+ - zho
14
+ - eng
15
+ - fra
16
+ - spa
17
+ - por
18
+ - deu
19
+ - ita
20
+ - rus
21
+ - jpn
22
+ - kor
23
+ - vie
24
+ - tha
25
+ - ara
26
+ ---
27
+
28
+ # Model Card for `qwen-2.5-3b-r1-countdown` a mini R1 experiments
29
+
30
+ This model is a fine-tuned version of [Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct).
31
+ It has been trained using [TRL](https://github.com/huggingface/trl) and GRPO on the Countdown game.
32
+
33
+ If you want to learn how to replicate this model and reproduce your own Deepseek R1 "aha" moment, check out my [blog post](https://www.philschmid.com/mini-deepseek-r1).
34
+
35
+
36
+ ## Quick start
37
+
38
+ ```python
39
+ from vllm import LLM, SamplingParams
40
+ from datasets import load_dataset
41
+ from random import randint
42
+
43
+ sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=512)
44
+
45
+ # use revision without "checkpoints-" as vLLM downloads all of them
46
+ llm = LLM(model="philschmid/qwen-2.5-3b-r1-countdown", revision="099c0f8cbfc522e7c3a476edfb749f576b164539")
47
+
48
+ # Load dataset from Hugging Face Hub
49
+ dataset_id = "Jiayi-Pan/Countdown-Tasks-3to4"
50
+ dataset = load_dataset(dataset_id, split="train")
51
+ sample = dataset[randint(0, len(dataset))]
52
+
53
+ # create conversation
54
+ messages = [
55
+ {"role": "system", "content": "You are a helpful assistant. You first thinks about the reasoning process in the mind and then provides the user with the answer."},
56
+ {"role": "user", "content": f"Using the numbers {sample['nums']}, create an equation that equals {sample['target']}. You can use basic arithmetic operations (+, -, *, /) one or multiple times but each number can only be used once. Show your work in <think> </think> tags. And return the final equation in <answer> </answer> tags, for example <answer> (1 + 2) / 3 </answer>. Think step by step inside <think> tags."},
57
+ {"role": "assistant", "content": "Let me solve this step by step.\n<think>"}
58
+ ]
59
+ # generate response
60
+ res = llm.generate(llm.get_tokenizer().apply_chat_template(messages, tokenize=False, continue_final_message=True), sampling_params)
61
+ res = "<think>" + res[0].outputs[0].text
62
+ print(res)
63
+
64
+ # <think> We need to use the numbers 37, 15, 4, and 13 with basic arithmetic operations to make 16. Let's try different combinations:
65
+ # - 37 - 15 - 4 - 13 = 6 (too low)
66
+ # - 37 - 15 + 4 - 13 = 13 (too low)
67
+ # - 37 + 15 - 4 - 13 = 35 (too high)
68
+ # - 37 - 15 + 4 + 13 = 39 (too high)
69
+ # - 15 + 4 + 13 - 37 = -1 (too low)
70
+ # - 37 + 15 + 4 - 13 = 43 (too high)
71
+ # - 15 + 4 * 13 / 37 = 15 + 52 / 37 (not an integer)
72
+ # - 15 * 4 / 37 - 37 = -28.24 (not a whole number)
73
+ # - 4 * 13 / 15 - 37 = 41.3333 (not a whole number)
74
+ # After all combinations, I got not any integer result as 16.
75
+ # </think>
76
+ # <answer> 37 - 15 + 4 + 13 </answer>
77
+ ```
78
+
79
+ ## Training procedure
80
+
81
+ This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
82
+
83
+ ### Framework versions
84
+
85
+ - TRL: 0.14.0
86
+ - Transformers: 4.48.1
87
+ - Pytorch: 2.5.1+cu121
88
+ - Datasets: 3.1.0
89
+ - Tokenizers: 0.21.0