allenai
/

OLMo-2-0325-32B-Instruct

@@ -103,6 +103,87 @@ See the Falcon 180B model card for an example of this.
 | **OLMo-2-32B-0325-Instruct** | 68.8 | 42.8 | 70.6 | 78.0 | 87.6 | 85.6 | 49.7 | 77.3 | 85.9 | 37.5 | 73.2 |
 ## License and use
 OLMo 2 is licensed under the Apache 2.0 license.

 | **OLMo-2-32B-0325-Instruct** | 68.8 | 42.8 | 70.6 | 78.0 | 87.6 | 85.6 | 49.7 | 77.3 | 85.9 | 37.5 | 73.2 |
+## Learning curves
+Below is the training curves for `allenai/OLMo-2-0325-32B-Instruct`. The model was trained using 5 8xH100 nodes.
+![](olmo-32b-instruct-learning-curve.png)
+![](olmo-32b-instruct-learning-curve-time)
+Below are the core eval scores over steps for `allenai/OLMo-2-0325-32B-Instruct` (note we took step `320` as the final checkpoint, corresponding to episode `573,440`):
+![](olmo-32b-instruct-eval-curve.png)
+Below are the other eval scores over steps for `allenai/OLMo-2-0325-32B-Instruct`:
+![](olmo-32b-instruct-full-eval-curve.png)
+## Reproduction command
+The command below is copied directly from the tracked training job:
+```bash
+# clone and check out commit
+git clone https://github.com/allenai/open-instruct.git
+# this should be the correct commit, the main thing is to have the vllm monkey patch for
+# 32b olmo https://github.com/allenai/open-instruct/blob/894ffa236319bc6c26c346240a7e4ee04ba0bd31/open_instruct/vllm_utils2.py#L37-L59
+git checkout a51dc98525eec01de6e8a24c071f42dce407d738
+uv sync
+uv sync --extra compile
+# note that you may need 5 8xH100 nodes for the training.
+# so please setup ray properly, e.g., https://github.com/allenai/open-instruct/blob/main/docs/tulu3.md#llama-31-tulu-3-70b-reproduction
+python open_instruct/grpo_vllm_thread_ray_gtrl.py \
+    --exp_name 0310_olmo2_32b_grpo_12818 \
+    --beta 0.01 \
+    --local_mini_batch_size 32 \
+    --number_samples_per_prompt 16 \
+    --output_dir output \
+    --local_rollout_batch_size 4 \
+    --kl_estimator kl3 \
+    --learning_rate 5e-7 \
+    --dataset_mixer_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 1.0 \
+    --dataset_mixer_list_splits train \
+    --dataset_mixer_eval_list allenai/RLVR-GSM-MATH-IF-Mixed-Constraints 16 \
+    --dataset_mixer_eval_list_splits train \
+    --max_token_length 2048 \
+    --max_prompt_token_length 2048 \
+    --response_length 2048 \
+    --model_name_or_path allenai/OLMo-2-0325-32B-DPO \
+    --non_stop_penalty \
+    --stop_token eos \
+    --temperature 1.0 \
+    --ground_truths_key ground_truth \
+    --chat_template_name tulu \
+    --sft_messages_key messages \
+    --eval_max_length 4096 \
+    --total_episodes 10000000 \
+    --penalty_reward_value 0.0 \
+    --deepspeed_stage 3 \
+    --no_gather_whole_model \
+    --per_device_train_batch_size 2 \
+    --local_rollout_forward_batch_size 2 \
+    --actor_num_gpus_per_node 8 8 8 4 \
+    --num_epochs 1 \
+    --vllm_tensor_parallel_size 1 \
+    --vllm_num_engines 12 \
+    --lr_scheduler_type constant \
+    --apply_verifiable_reward true \
+    --seed 1 \
+    --num_evals 30 \
+    --save_freq 20 \
+    --reward_model_multiplier 0.0 \
+    --no_try_launch_beaker_eval_jobs \
+    --try_launch_beaker_eval_jobs_on_weka \
+    --gradient_checkpointing \
+    --with_tracking
+```
 ## License and use
 OLMo 2 is licensed under the Apache 2.0 license.