Update README.md
Browse files
README.md
CHANGED
@@ -196,9 +196,7 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
|
|
196 |
|
197 |
1. **Concatenate** `prompt_ids + completion_ids`
|
198 |
|
199 |
-
2. **Run forward pass through old policy** to compute
|
200 |
-
|
201 |
-
$$\pi_{\text{old}}(a|s)$$
|
202 |
|
203 |
|
204 |
- This actually happens only once at the first iteration when we create the rollout
|
|
|
196 |
|
197 |
1. **Concatenate** `prompt_ids + completion_ids`
|
198 |
|
199 |
+
2. **Run forward pass through old policy** to compute \\(\pi_{\text{old}}(a|s)\\)
|
|
|
|
|
200 |
|
201 |
|
202 |
- This actually happens only once at the first iteration when we create the rollout
|