LLM360
/

guru-32B

+---
+library_name: transformers
+pipeline_tag: text-generation
+license: cc-by-nc-4.0
+---
+This repository contains the Guru-32B (base Qwen2.5-32B) model presented in [Revisiting Reinforcement Learning for LLM Reasoning from A Cross-Domain Perspective](https://huggingface.co/papers/2506.14965).
+The leaderboard is evaluated with our evaluation [code](https://github.com/LLM360/Reasoning360/tree/main/scripts/offline_eval). The parameters we set in evaluation for all models: temperature=1.0, top_p=0.7.
+| **Domain**     | **Benchmark**                  | **GURU 7B** | **General Reasoner 7B** | **ORZ 7B◇** | **SimpleRL 7B** | **GURU 32B** | **ORZ 32B◇** | **SimpleRL 32B** |
+|----------------|--------------------------------|------------:|-------------------------:|------------:|---------------:|-------------:|--------------:|-----------------:|
+| **Math**       | AIME24 (avg@32)                |       17.50 |                   17.08  |       16.25 |          15.60 |        34.89 |         47.50 |            27.20 |
+|                | MATH500                        |       77.25 |                   70.40  |       80.80 |          87.00 |        86.00 |         89.80 |            89.60 |
+| **Code**       | LiveCodeBench (avg@4)          |       16.49 |                    8.51  |        5.47 |           6.72 |        29.30 |         22.04 |            19.80 |
+|                | HumanEval (avg@4)              |       82.62 |                   61.12  |       67.38 |          58.08 |        90.85 |         84.30 |            81.25 |
+|                | MBPP                           |       70.00 |                   39.80  |       48.40 |          49.60 |        78.80 |         74.20 |            76.75 |
+| **Science**    | GPQA-diamond (avg@4)           |       40.78 |                   38.64  |       37.63 |          35.98 |        50.63 |         55.67 |            46.46 |
+|                | SuperGPQA                      |       31.80 |                   30.64  |       29.75 |          27.29 |        43.60 |         46.05 |            37.73 |
+| **Logic**      | ARC-AGI (avg@4)                |        3.31 |                    0.75  |        0.00 |           0.50 |         7.63 |          2.31 |             5.25 |
+|                | Zebra Puzzle (avg@4)          |       39.40 |                    0.07  |        1.00 |           0.62 |        45.21 |          0.54 |             1.16 |
+| **Simulation** | CodeI/O (avg@4)                |       15.63 |                    7.13  |        5.13 |           6.63 |        12.63 |          3.75 |             9.75 |
+|                | CruxEval-I                     |       61.72 |                   63.63  |       69.38 |          56.25 |        80.63 |         71.13 |            72.63 |
+|                | CruxEval-O                     |       71.28 |                   56.50  |       65.88 |          58.31 |        88.75 |         82.38 |            67.75 |
+| **Tabular**    | FinQA                          |       34.70 |                   34.33  |       37.60 |          35.10 |        46.14 |         45.20 |            45.41 |
+|                | HiTab                          |       74.20 |                   54.40  |       54.10 |          50.40 |        82.00 |         63.30 |            69.00 |
+|                | MultiHiertt (avg@4)            |       44.94 |                   31.62  |       38.10 |          37.57 |        55.28 |         52.83 |            52.83 |
+| **Others**     | IFEval                         |       35.81 |                   39.56  |       32.72 |          36.69 |        55.45 |         38.26 |            55.27 |
+|                | LiveBench                     |       18.57 |                   19.76  |       12.64 |          15.20 |        34.30 |         28.78 |            28.33 |
+|                | **Average Score**              | **43.29**   | **33.76**               | **35.42**   | **33.97**      | **54.24**    | **47.53**     | **46.25**        |
+Example usage:
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+model = "LLM360/Guru-32B"
+tokenizer = AutoTokenizer.from_pretrained(model)
+model = AutoModelForCausalLM.from_pretrained(model, device_map="auto", torch_dtype="auto")
+messages = [{"role": "user", "content": "What is reinforcement learning?"}]
+prompt = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)
+outputs = model.generate(prompt, max_new_tokens=256, temperature=1.0, top_p=0.7)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+Please refer to the [paper](https://arxiv.org/abs/2506.14965) for more details.