lordChipotle
/

Llama3GRPOReasoning

Reinforcement Learning

Model card Files Files and versions Community

lordChipotle commited on Jul 13

Commit

e2a1961

·

verified ·

1 Parent(s): 7f5ec28

Update README.md

Files changed (1) hide show

README.md +9 -10

README.md CHANGED Viewed

@@ -196,19 +196,23 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
 1. **Concatenate** `prompt_ids + completion_ids`
-2. **Run forward pass through old policy** to compute $$\pi_{\text{old}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
-3. **Run forward pass through ref policy** to compute $$\pi_{\text{ref}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
-4. **Run forward pass through current policy** to compute $$\pi(a|s)$$
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
-5. **Compute KL loss** between $$\pi(a|s)$ and $\pi_{\text{ref}}(a|s)$$
-6. **Compute advantage-weighted logprobs:** $$\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)$$
 ## Workflow Summary
@@ -305,11 +309,6 @@ Mark wants 12 total pieces of fruit. He already has 3 apples and 4 bananas, whic
 </answer>
 ```
-## Performance
-- **Dataset**: GSM8K test split (1,319 examples)
-- **Evaluation Metric**: Exact match accuracy on final numerical answers
-- **Performance**: [Insert actual accuracy from your evaluation]
 ## Technical Details

 1. **Concatenate** `prompt_ids + completion_ids`
+2. **Run forward pass through old policy** to compute
+   $$\pi_{\text{old}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
+4. **Run forward pass through ref policy** to compute $$\pi_{\text{ref}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
+5. **Run forward pass through current policy** to compute $$\pi(a|s)$$
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
+6. **Compute KL loss** between $$\pi(a|s)$ and $\pi_{\text{ref}}(a|s)$$
+7. **Compute advantage-weighted logprobs:** $$\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)$$
 ## Workflow Summary
 </answer>
 ```
 ## Technical Details