lordChipotle commited on
Commit
df619ee
·
verified ·
1 Parent(s): e2a1961

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -3
README.md CHANGED
@@ -196,9 +196,7 @@ $$\text{GRPO} = -\mathbb{E}_{(s,a)}\left[\frac{\pi(a|s)}{\pi_{\text{old}}(a|s)}
196
 
197
  1. **Concatenate** `prompt_ids + completion_ids`
198
 
199
- 2. **Run forward pass through old policy** to compute
200
-
201
- $$\pi_{\text{old}}(a|s)$$
202
 
203
 
204
  - This actually happens only once at the first iteration when we create the rollout
 
196
 
197
  1. **Concatenate** `prompt_ids + completion_ids`
198
 
199
+ 2. **Run forward pass through old policy** to compute \\(\pi_{\text{old}}(a|s)\\)
 
 
200
 
201
 
202
  - This actually happens only once at the first iteration when we create the rollout