lordChipotle
/

Llama3GRPOReasoning

Reinforcement Learning

Model card Files Files and versions Community

lordChipotle commited on Jul 13

Commit

df619ee

·

verified ·

1 Parent(s): e2a1961

Update README.md

Files changed (1) hide show

README.md +1 -3

README.md CHANGED Viewed

@@ -196,9 +196,7 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
 1. **Concatenate** `prompt_ids + completion_ids`
-2. **Run forward pass through old policy** to compute
-   $$\pi_{\text{old}}(a|s)$$
    - This actually happens only once at the first iteration when we create the rollout

 1. **Concatenate** `prompt_ids + completion_ids`
+2. **Run forward pass through old policy** to compute \\(\pi_{\text{old}}(a|s)\\)
    - This actually happens only once at the first iteration when we create the rollout