lordChipotle
/

Llama3GRPOReasoning

Reinforcement Learning

Model card Files Files and versions Community

lordChipotle commited on Jul 13

Commit

7f5ec28

·

verified ·

1 Parent(s): 6985228

Update README.md

Files changed (1) hide show

README.md +6 -6

README.md CHANGED Viewed

@@ -137,7 +137,7 @@ Each rank performs the following operations:
 1. **Decodes completions**
 2. **Computes reward** for (prompt, completion) pairs
 3. **Gathers rewards** from other ranks (because it's possible for a given prompt to have its replica across GPUs)
-4. **Normalizes rewards** by mean/std ⟹ This gives us advantages $A(s,a)$
 5. **Discards completions** for prompts it doesn't own (called alien prompts)
 ### Concrete Example: Multi-GPU Setup
@@ -196,19 +196,19 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
 1. **Concatenate** `prompt_ids + completion_ids`
-2. **Run forward pass through old policy** to compute $\pi_{\text{old}}(a|s)$
    - This actually happens only once at the first iteration when we create the rollout
-3. **Run forward pass through ref policy** to compute $\pi_{\text{ref}}(a|s)$
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
-4. **Run forward pass through current policy** to compute $\pi(a|s)$
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
-5. **Compute KL loss** between $\pi(a|s)$ and $\pi_{\text{ref}}(a|s)$
-6. **Compute advantage-weighted logprobs:** $\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)$
 ## Workflow Summary

 1. **Decodes completions**
 2. **Computes reward** for (prompt, completion) pairs
 3. **Gathers rewards** from other ranks (because it's possible for a given prompt to have its replica across GPUs)
+4. **Normalizes rewards** by mean/std ⟹ This gives us advantages $$A(s,a)$$
 5. **Discards completions** for prompts it doesn't own (called alien prompts)
 ### Concrete Example: Multi-GPU Setup
 1. **Concatenate** `prompt_ids + completion_ids`
+2. **Run forward pass through old policy** to compute $$\pi_{\text{old}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
+3. **Run forward pass through ref policy** to compute $$\pi_{\text{ref}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout
    - Ref model is the original model without LoRA adapters
+4. **Run forward pass through current policy** to compute $$\pi(a|s)$$
    - Needed only if `num_iterations > 1`; otherwise the same as old policy
+5. **Compute KL loss** between $$\pi(a|s)$ and $\pi_{\text{ref}}(a|s)$$
+6. **Compute advantage-weighted logprobs:** $$\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)} \times A(s,a)$$
 ## Workflow Summary