lordChipotle
/

Llama3GRPOReasoning

Reinforcement Learning

Model card Files Files and versions Community

lordChipotle commited on Jul 13

Commit

b6e7b7a

·

verified ·

1 Parent(s): df619ee

Update README.md

Files changed (1) hide show

README.md +5 -5

README.md CHANGED Viewed

@@ -137,7 +137,7 @@ Each rank performs the following operations:
 1. **Decodes completions**
 2. **Computes reward** for (prompt, completion) pairs
 3. **Gathers rewards** from other ranks (because it's possible for a given prompt to have its replica across GPUs)
-4. **Normalizes rewards** by mean/std ⟹ This gives us advantages $$A(s,a)$$
 5. **Discards completions** for prompts it doesn't own (called alien prompts)
 ### Concrete Example: Multi-GPU Setup
@@ -201,16 +201,16 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
    - This actually happens only once at the first iteration when we create the rollout
-4. **Run forward pass through ref policy** to compute $$\pi_{\text{ref}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
-5. **Run forward pass through current policy** to compute $$\pi(a|s)$$
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
-6. **Compute KL loss** between $$\pi(a|s)$ and $\pi_{\text{ref}}(a|s)$$
-7. **Compute advantage-weighted logprobs:** $$\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)$$
 ## Workflow Summary

 1. **Decodes completions**
 2. **Computes reward** for (prompt, completion) pairs
 3. **Gathers rewards** from other ranks (because it's possible for a given prompt to have its replica across GPUs)
+4. **Normalizes rewards** by mean/std ⟹ This gives us advantages \\(A(s,a)\\)
 5. **Discards completions** for prompts it doesn't own (called alien prompts)
 ### Concrete Example: Multi-GPU Setup
    - This actually happens only once at the first iteration when we create the rollout
+4. **Run forward pass through ref policy** to compute \\(\pi_{\text{ref}}(a|s)\\)
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
+5. **Run forward pass through current policy** to compute \\(\pi(a|s)\\)
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
+6. **Compute KL loss** between \\(\pi(a|s)\\) and \\(\pi_{\text{ref}}(a|s)\\)
+7. **Compute advantage-weighted logprobs:** \\(\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)\\)
 ## Workflow Summary