zhoujun commited on
Commit
21de8c1
·
verified ·
1 Parent(s): fbd9d08

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -0
README.md ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ pipeline_tag: text-generation
4
+ license: cc-by-nc-4.0
5
+ ---
6
+
7
+ This repository contains the Guru-32B (base Qwen2.5-32B) model presented in [Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective](https://huggingface.co/papers/2506.14965).
8
+
9
+ The leaderboard is evaluated with our evaluation [code](https://github.com/LLM360/Reasoning360/tree/main/scripts/offline_eval). The parameters we set in evaluation for all models: temperature=1.0, top_p=0.7.
10
+
11
+ | **Domain** | **Benchmark** | **GURU 7B** | **General Reasoner 7B** | **ORZ 7B◇** | **SimpleRL 7B** | **GURU 32B** | **ORZ 32B◇** | **SimpleRL 32B** |
12
+ |----------------|--------------------------------|------------:|-------------------------:|------------:|---------------:|-------------:|--------------:|-----------------:|
13
+ | **Math** | AIME24 (avg@32) | 17.50 | 17.08 | 16.25 | 15.60 | 34.89 | 47.50 | 27.20 |
14
+ | | MATH500 | 77.25 | 70.40 | 80.80 | 87.00 | 86.00 | 89.80 | 89.60 |
15
+ | **Code** | LiveCodeBench (avg@4) | 16.49 | 8.51 | 5.47 | 6.72 | 29.30 | 22.04 | 19.80 |
16
+ | | HumanEval (avg@4) | 82.62 | 61.12 | 67.38 | 58.08 | 90.85 | 84.30 | 81.25 |
17
+ | | MBPP | 70.00 | 39.80 | 48.40 | 49.60 | 78.80 | 74.20 | 76.75 |
18
+ | **Science** | GPQA-diamond (avg@4) | 40.78 | 38.64 | 37.63 | 35.98 | 50.63 | 55.67 | 46.46 |
19
+ | | SuperGPQA | 31.80 | 30.64 | 29.75 | 27.29 | 43.60 | 46.05 | 37.73 |
20
+ | **Logic** | ARC-AGI (avg@4) | 3.31 | 0.75 | 0.00 | 0.50 | 7.63 | 2.31 | 5.25 |
21
+ | | Zebra Puzzle (avg@4) | 39.40 | 0.07 | 1.00 | 0.62 | 45.21 | 0.54 | 1.16 |
22
+ | **Simulation** | CodeI/O (avg@4) | 15.63 | 7.13 | 5.13 | 6.63 | 12.63 | 3.75 | 9.75 |
23
+ | | CruxEval-I | 61.72 | 63.63 | 69.38 | 56.25 | 80.63 | 71.13 | 72.63 |
24
+ | | CruxEval-O | 71.28 | 56.50 | 65.88 | 58.31 | 88.75 | 82.38 | 67.75 |
25
+ | **Tabular** | FinQA | 34.70 | 34.33 | 37.60 | 35.10 | 46.14 | 45.20 | 45.41 |
26
+ | | HiTab | 74.20 | 54.40 | 54.10 | 50.40 | 82.00 | 63.30 | 69.00 |
27
+ | | MultiHiertt (avg@4) | 44.94 | 31.62 | 38.10 | 37.57 | 55.28 | 52.83 | 52.83 |
28
+ | **Others** | IFEval | 35.81 | 39.56 | 32.72 | 36.69 | 55.45 | 38.26 | 55.27 |
29
+ | | LiveBench | 18.57 | 19.76 | 12.64 | 15.20 | 34.30 | 28.78 | 28.33 |
30
+ | | **Average Score** | **43.29** | **33.76** | **35.42** | **33.97** | **54.24** | **47.53** | **46.25** |
31
+
32
+
33
+
34
+ Example usage:
35
+ ```python
36
+ from transformers import AutoTokenizer, AutoModelForCausalLM
37
+
38
+ model = "LLM360/Guru-32B"
39
+ tokenizer = AutoTokenizer.from_pretrained(model)
40
+ model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", torch_dtype="auto")
41
+
42
+ messages = [{"role": "user", "content": "What is reinforcement learning?"}]
43
+ prompt = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
44
+ outputs = model.generate(prompt, max_new_tokens=256, temperature=1.0, top_p=0.7)
45
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
46
+ ```
47
+
48
+ Please refer to the [paper](https://arxiv.org/abs/2506.14965) for more details.