Harsh Nilesh Pathak's picture

Harsh Nilesh Pathak

harsh306

·

harsh306

AI & ML interests

AI and ML

Recent Activity

replied to Kseniase's post 5 days ago

11 Fascinating new Policy Optimization techniques Policy optimization (PO) algorithms are central to training AI models with preference-based feedback. In recent weeks, numerous new PO methods have emerged that build on or replace the popular PPO and GRPO, solving their issues. Here are 11 of them: 1. BAlanced Policy Optimization (BAPO) → https://huggingface.co/papers/2510.18927 Dynamically adjusting the clipping bounds in PPO-style updates to balance positive and negative gradients and prevent entropy collapse 2. Training-Free GRPO → https://huggingface.co/papers/2510.08191 Instead of using numeric rewards, it compares rollouts semantically to distill useful knowledge as a token prior, which is then applied during inference to guide the model’s behavior 3. Asymmetric Importance Sampling Policy Optimization (ASPO) → https://huggingface.co/papers/2510.06062 Fixes imbalanced token weighting in LLM training. It flips the importance sampling ratios for positive tokens to correct over- and under-updates, and adds a soft dual-clipping step to keep gradients stable 4. In-Context Steered Policy Optimization (ICPO) → https://arxiv.org/abs/2510.26519 Uses a model’s own in-context learning ability to guide training with existing data. It combines Mixed-Policy GRPO with Implicit Expert Forcing to expand exploration and adds Expert Region Reject Sampling and Annealed Expert-Bonus Reward Shaping to ensure stability and balanced expert influence 5. Graph-Enhanced Policy Optimization (GEPO) → https://arxiv.org/abs/2510.26270 Builds a graph of an agent’s experiences to understand how different states connect, guide exploration and assign rewards more effectively 6. Information Gain-based Policy Optimization (IGPO) → https://huggingface.co/papers/2510.14967 Uses the model’s own belief updates to create dense, informative feedback for smoother multi-turn learning Read further below ⬇️ If you like this, also subscribe to the Turing post: https://www.turingpost.com/subscribe

View all activity

Organizations

None yet

models 0

None public yet

datasets 0

None public yet