EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning Paper • 2509.22576 • Published 4 days ago • 109
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning Paper • 2504.00487 • Published Apr 1 • 18
SimNPO-Unlearned Models Collection This collection hosts the SimNPO-unlearned models over TOFU, MUSE, and WMDP unlearning benchmarks. • 7 items • Updated Aug 8 • 2