luckeciano commited on
Commit
f4fe0dd
·
verified ·
1 Parent(s): 80ecb39

Model save

Browse files
Files changed (4) hide show
  1. README.md +5 -7
  2. all_results.json +4 -4
  3. train_results.json +4 -4
  4. trainer_state.json +0 -0
README.md CHANGED
@@ -1,19 +1,17 @@
1
  ---
2
  base_model: Qwen/Qwen2.5-Math-7B
3
- datasets: DigitalLearningGmbH/MATH-lighteval
4
  library_name: transformers
5
- model_name: Qwen-2.5-7B-GRPO-Base-32Action_647
6
  tags:
7
  - generated_from_trainer
8
- - open-r1
9
  - trl
10
  - grpo
11
  licence: license
12
  ---
13
 
14
- # Model Card for Qwen-2.5-7B-GRPO-Base-32Action_647
15
 
16
- This model is a fine-tuned version of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B) on the [DigitalLearningGmbH/MATH-lighteval](https://huggingface.co/datasets/DigitalLearningGmbH/MATH-lighteval) dataset.
17
  It has been trained using [TRL](https://github.com/huggingface/trl).
18
 
19
  ## Quick start
@@ -22,14 +20,14 @@ It has been trained using [TRL](https://github.com/huggingface/trl).
22
  from transformers import pipeline
23
 
24
  question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
25
- generator = pipeline("text-generation", model="luckeciano/Qwen-2.5-7B-GRPO-Base-32Action_647", device="cuda")
26
  output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
27
  print(output["generated_text"])
28
  ```
29
 
30
  ## Training procedure
31
 
32
- [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/max-ent-llms/PolicyGradientStability/runs/qj1xvx50)
33
 
34
 
35
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
 
1
  ---
2
  base_model: Qwen/Qwen2.5-Math-7B
 
3
  library_name: transformers
4
+ model_name: Qwen-2.5-7B-GRPO-Base-32Action_173
5
  tags:
6
  - generated_from_trainer
 
7
  - trl
8
  - grpo
9
  licence: license
10
  ---
11
 
12
+ # Model Card for Qwen-2.5-7B-GRPO-Base-32Action_173
13
 
14
+ This model is a fine-tuned version of [Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B).
15
  It has been trained using [TRL](https://github.com/huggingface/trl).
16
 
17
  ## Quick start
 
20
  from transformers import pipeline
21
 
22
  question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
23
+ generator = pipeline("text-generation", model="luckeciano/Qwen-2.5-7B-GRPO-Base-32Action_173", device="cuda")
24
  output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
25
  print(output["generated_text"])
26
  ```
27
 
28
  ## Training procedure
29
 
30
+ [<img src="https://raw.githubusercontent.com/wandb/assets/main/wandb-github-badge-28.svg" alt="Visualize in Weights & Biases" width="150" height="24"/>](https://wandb.ai/max-ent-llms/PolicyGradientStability/runs/34t8qyp3)
31
 
32
 
33
  This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
all_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "total_flos": 0.0,
3
- "train_loss": 2.261561673488899e-09,
4
- "train_runtime": 16912.9255,
5
  "train_samples": 7500,
6
- "train_samples_per_second": 0.568,
7
- "train_steps_per_second": 0.006
8
  }
 
1
  {
2
  "total_flos": 0.0,
3
+ "train_loss": -5.137796188492417e-10,
4
+ "train_runtime": 18587.2058,
5
  "train_samples": 7500,
6
+ "train_samples_per_second": 0.516,
7
+ "train_steps_per_second": 0.005
8
  }
train_results.json CHANGED
@@ -1,8 +1,8 @@
1
  {
2
  "total_flos": 0.0,
3
- "train_loss": 2.261561673488899e-09,
4
- "train_runtime": 16912.9255,
5
  "train_samples": 7500,
6
- "train_samples_per_second": 0.568,
7
- "train_steps_per_second": 0.006
8
  }
 
1
  {
2
  "total_flos": 0.0,
3
+ "train_loss": -5.137796188492417e-10,
4
+ "train_runtime": 18587.2058,
5
  "train_samples": 7500,
6
+ "train_samples_per_second": 0.516,
7
+ "train_steps_per_second": 0.005
8
  }
trainer_state.json CHANGED
The diff for this file is too large to render. See raw diff