zzfive
's Collections
RL+reason model
updated
RL + Transformer = A General-Purpose Problem Solver
Paper
•
2501.14176
•
Published
•
28
Towards General-Purpose Model-Free Reinforcement Learning
Paper
•
2501.16142
•
Published
•
30
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
•
2501.17161
•
Published
•
122
MaxInfoRL: Boosting exploration in reinforcement learning through
information gain maximization
Paper
•
2412.12098
•
Published
•
5
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
Paper
•
2412.09858
•
Published
•
1
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper
•
2501.18585
•
Published
•
61
o3-mini vs DeepSeek-R1: Which One is Safer?
Paper
•
2501.18438
•
Published
•
24
s1: Simple test-time scaling
Paper
•
2501.19393
•
Published
•
123
Process Reinforcement through Implicit Rewards
Paper
•
2502.01456
•
Published
•
62
The Jumping Reasoning Curve? Tracking the Evolution of Reasoning
Performance in GPT-[n] and o-[n] Models on Multimodal Puzzles
Paper
•
2502.01081
•
Published
•
14
Improving Transformer World Models for Data-Efficient RL
Paper
•
2502.01591
•
Published
•
9
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM
Reasoning via Autoregressive Search
Paper
•
2502.02508
•
Published
•
23
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
•
2502.03373
•
Published
•
59
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
•
2502.02339
•
Published
•
22
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs
using Particle-Based Monte Carlo Methods
Paper
•
2502.01618
•
Published
•
10
BOLT: Bootstrap Long Chain-of-Thought in Language Models without
Distillation
Paper
•
2502.03860
•
Published
•
24
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of
Language Models
Paper
•
2502.04404
•
Published
•
24
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
•
2502.06703
•
Published
•
153
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Paper
•
2502.06772
•
Published
•
21
LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!
Paper
•
2502.07374
•
Published
•
41
Teaching Language Models to Critique via Reinforcement Learning
Paper
•
2502.03492
•
Published
•
24
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in
One Day via Model Merging
Paper
•
2502.09056
•
Published
•
32
Logical Reasoning in Large Language Models: A Survey
Paper
•
2502.09100
•
Published
•
23
The Danger of Overthinking: Examining the Reasoning-Action Dilemma in
Agentic Tasks
Paper
•
2502.08235
•
Published
•
58
video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Paper
•
2502.11775
•
Published
•
8
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
•
2502.12900
•
Published
•
86
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly
Possess Test-Time Scaling Capabilities?
Paper
•
2502.12215
•
Published
•
16
Small Models Struggle to Learn from Strong Reasoners
Paper
•
2502.12143
•
Published
•
38
Thinking Preference Optimization
Paper
•
2502.13173
•
Published
•
17
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement
Learning
Paper
•
2502.14768
•
Published
•
48
LightThinker: Thinking Step-by-Step Compression
Paper
•
2502.15589
•
Published
•
29
The Relationship Between Reasoning and Performance in Large Language
Models -- o3 (mini) Thinks Harder, Not Longer
Paper
•
2502.15631
•
Published
•
9
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
18
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
•
2502.18449
•
Published
•
74
Self-rewarding correction for mathematical reasoning
Paper
•
2502.19613
•
Published
•
84
MedVLM-R1: Incentivizing Medical Reasoning Capability of Vision-Language
Models (VLMs) via Reinforcement Learning
Paper
•
2502.19634
•
Published
•
63
FINEREASON: Evaluating and Improving LLMs' Deliberate Reasoning through
Reflective Puzzle Solving
Paper
•
2502.20238
•
Published
•
24
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
79
SoS1: O1 and R1-Like Reasoning LLMs are Sum-of-Square Solvers
Paper
•
2502.20545
•
Published
•
22
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four
Habits of Highly Effective STaRs
Paper
•
2503.01307
•
Published
•
39
Efficient Test-Time Scaling via Self-Calibration
Paper
•
2503.00031
•
Published
•
15
LADDER: Self-Improving LLMs Through Recursive Problem Decomposition
Paper
•
2503.00735
•
Published
•
21
START: Self-taught Reasoner with Tools
Paper
•
2503.04625
•
Published
•
114
Sketch-of-Thought: Efficient LLM Reasoning with Adaptive
Cognitive-Inspired Sketching
Paper
•
2503.05179
•
Published
•
46
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Paper
•
2503.05132
•
Published
•
58
R1-Searcher: Incentivizing the Search Capability in LLMs via
Reinforcement Learning
Paper
•
2503.05592
•
Published
•
27
Learning from Failures in Multi-Attempt Reinforcement Learning
Paper
•
2503.04808
•
Published
•
18
An Empirical Study on Eliciting and Improving R1-like Reasoning Models
Paper
•
2503.04548
•
Published
•
8
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale
Reinforcement Learning
Paper
•
2503.07365
•
Published
•
62
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large
Language Models
Paper
•
2503.06749
•
Published
•
30
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
•
2503.07536
•
Published
•
86
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based
VLM Agent Training
Paper
•
2503.08525
•
Published
•
17
Search-R1: Training LLMs to Reason and Leverage Search Engines with
Reinforcement Learning
Paper
•
2503.09516
•
Published
•
31
GoT: Unleashing Reasoning Capability of Multimodal Large Language Model
for Visual Generation and Editing
Paper
•
2503.10639
•
Published
•
50
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
•
2503.10291
•
Published
•
36
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
•
2503.10460
•
Published
•
29
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
•
2503.10615
•
Published
•
17
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper
•
2503.12605
•
Published
•
34
R1-VL: Learning to Reason with Multimodal Large Language Models via
Step-wise Group Relative Policy Optimization
Paper
•
2503.12937
•
Published
•
29
reWordBench: Benchmarking and Improving the Robustness of Reward Models
with Transformed Inputs
Paper
•
2503.11751
•
Published
•
16
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Paper
•
2503.14476
•
Published
•
128
Temporal Consistency for LLM Reasoning Process Error Identification
Paper
•
2503.14495
•
Published
•
10
Towards Self-Improving Systematic Cognition for Next-Generation
Foundation MLLMs
Paper
•
2503.12303
•
Published
•
7
SWEET-RL: Training Multi-Turn LLM Agents on Collaborative Reasoning
Tasks
Paper
•
2503.15478
•
Published
•
11
Stop Overthinking: A Survey on Efficient Reasoning for Large Language
Models
Paper
•
2503.16419
•
Published
•
74
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning
via Iterative Self-Improvement
Paper
•
2503.17352
•
Published
•
23
I Have Covered All the Bases Here: Interpreting Reasoning Features in
Large Language Models via Sparse Autoencoders
Paper
•
2503.18878
•
Published
•
118
SimpleRL-Zoo: Investigating and Taming Zero Reinforcement Learning for
Open Base Models in the Wild
Paper
•
2503.18892
•
Published
•
30
Vision-R1: Evolving Human-Free Alignment in Large Vision-Language Models
via Vision-Guided Reinforcement Learning
Paper
•
2503.18013
•
Published
•
19
Mind with Eyes: from Language Reasoning to Multimodal Reasoning
Paper
•
2503.18071
•
Published
•
3
Inference-Time Scaling for Flow Models via Stochastic Generation and
Rollover Budget Forcing
Paper
•
2503.19385
•
Published
•
33
Think Twice: Enhancing LLM Reasoning by Scaling Multi-round Test-time
Thinking
Paper
•
2503.19855
•
Published
•
28
ReSearch: Learning to Reason with Search for LLMs via Reinforcement
Learning
Paper
•
2503.19470
•
Published
•
18
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
•
2503.20271
•
Published
•
7
Video-R1: Reinforcing Video Reasoning in MLLMs
Paper
•
2503.21776
•
Published
•
78
OThink-MR1: Stimulating multimodal generalized reasoning capabilities
via dynamic reinforcement learning
Paper
•
2503.16081
•
Published
•
26
A Survey of Efficient Reasoning for Large Reasoning Models: Language,
Multimodality, and Beyond
Paper
•
2503.21614
•
Published
•
39
Exploring Data Scaling Trends and Effects in Reinforcement Learning from
Human Feedback
Paper
•
2503.22230
•
Published
•
44
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
•
2503.24290
•
Published
•
62
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large
Language Models
Paper
•
2503.24235
•
Published
•
54
Efficient Inference for Large Reasoning Models: A Survey
Paper
•
2503.23077
•
Published
•
46
Exploring the Effect of Reinforcement Learning on Video Understanding:
Insights from SEED-Bench-R1
Paper
•
2503.24376
•
Published
•
38
Z1: Efficient Test-time Scaling with Code
Paper
•
2504.00810
•
Published
•
26
Improved Visual-Spatial Reasoning via R1-Zero-Like Training
Paper
•
2504.00883
•
Published
•
64
Understanding R1-Zero-Like Training: A Critical Perspective
Paper
•
2503.20783
•
Published
•
50
Inference-Time Scaling for Generalist Reward Modeling
Paper
•
2504.02495
•
Published
•
54
Rethinking RL Scaling for Vision Language Models: A Transparent,
From-Scratch Framework and Comprehensive Evaluation Scheme
Paper
•
2504.02587
•
Published
•
30
Rethinking Reflection in Pre-Training
Paper
•
2504.04022
•
Published
•
79
Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning
Models
Paper
•
2504.04823
•
Published
•
30
VAPO: Efficient and Reliable Reinforcement Learning for Advanced
Reasoning Tasks
Paper
•
2504.05118
•
Published
•
25
Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning
(v1)
Paper
•
2504.03151
•
Published
•
14
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
•
2504.06958
•
Published
•
11
DeepSeek-R1 Thoughtology: Let's <think> about LLM Reasoning
Paper
•
2504.07128
•
Published
•
85
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Paper
•
2504.07615
•
Published
•
32
VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models
with Reinforcement Learning
Paper
•
2504.08837
•
Published
•
43
DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM
Post-training
Paper
•
2504.09710
•
Published
•
19
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
•
2504.09641
•
Published
•
16
Reasoning Models Can Be Effective Without Thinking
Paper
•
2504.09858
•
Published
•
11
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Paper
•
2504.10481
•
Published
•
84
Efficient Reasoning Models: A Survey
Paper
•
2504.10903
•
Published
•
18
Efficient Process Reward Model Training via Active Learning
Paper
•
2504.10559
•
Published
•
13
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to
Reinforce
Paper
•
2504.11343
•
Published
•
17
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
•
2504.11536
•
Published
•
61
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
•
2504.11468
•
Published
•
28
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation
Paper
•
2504.13055
•
Published
•
19
Does Reinforcement Learning Really Incentivize Reasoning Capacity in
LLMs Beyond the Base Model?
Paper
•
2504.13837
•
Published
•
127
Learning to Reason under Off-Policy Guidance
Paper
•
2504.14945
•
Published
•
85
FlowReasoner: Reinforcing Query-Level Meta-Agents
Paper
•
2504.15257
•
Published
•
46
ToolRL: Reward is All Tool Learning Needs
Paper
•
2504.13958
•
Published
•
44
OTC: Optimal Tool Calls via Reinforcement Learning
Paper
•
2504.14870
•
Published
•
33
TTRL: Test-Time Reinforcement Learning
Paper
•
2504.16084
•
Published
•
112
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal
Large Language Models
Paper
•
2504.15279
•
Published
•
74
Process Reward Models That Think
Paper
•
2504.16828
•
Published
•
16
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
•
2504.16656
•
Published
•
57
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual
Dependency
Paper
•
2504.18589
•
Published
•
11
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
•
2504.20571
•
Published
•
93
WebThinker: Empowering Large Reasoning Models with Deep Research
Capability
Paper
•
2504.21776
•
Published
•
56
Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language
Models in Math
Paper
•
2504.21233
•
Published
•
45
Phi-4-reasoning Technical Report
Paper
•
2504.21318
•
Published
•
47
100 Days After DeepSeek-R1: A Survey on Replication Studies and More
Directions for Reasoning Language Models
Paper
•
2505.00551
•
Published
•
37
AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning
Optimization
Paper
•
2504.21659
•
Published
•
12
Llama-Nemotron: Efficient Reasoning Models
Paper
•
2505.00949
•
Published
•
35
RM-R1: Reward Modeling as Reasoning
Paper
•
2505.02387
•
Published
•
75
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization
in Rejection Sampling and RL
Paper
•
2505.02391
•
Published
•
24
R1-Reward: Training Multimodal Reward Model Through Stable Reinforcement
Learning
Paper
•
2505.02835
•
Published
•
25
Unified Multimodal Chain-of-Thought Reward Model through Reinforcement
Fine-Tuning
Paper
•
2505.03318
•
Published
•
92
Absolute Zero: Reinforced Self-play Reasoning with Zero Data
Paper
•
2505.03335
•
Published
•
168
ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Paper
•
2505.04588
•
Published
•
64
Scalable Chain of Thoughts via Elastic Reasoning
Paper
•
2505.05315
•
Published
•
24
X-Reasoner: Towards Generalizable Reasoning Across Modalities and
Domains
Paper
•
2505.03981
•
Published
•
14
MiMo: Unlocking the Reasoning Potential of Language Model -- From
Pretraining to Posttraining
Paper
•
2505.07608
•
Published
•
78
DanceGRPO: Unleashing GRPO on Visual Generation
Paper
•
2505.07818
•
Published
•
29
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
•
2505.07263
•
Published
•
29
AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale
Paper
•
2505.08311
•
Published
•
16
Bring Reason to Vision: Understanding Perception and Reasoning through
Model Merging
Paper
•
2505.05464
•
Published
•
10
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Paper
•
2505.09439
•
Published
•
8
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large
Reasoning Models
Paper
•
2505.10554
•
Published
•
118
WorldPM: Scaling Human Preference Modeling
Paper
•
2505.10527
•
Published
•
33
AdaptThink: Reasoning Models Can Learn When to Think
Paper
•
2505.13417
•
Published
•
78
Thinkless: LLM Learns When to Think
Paper
•
2505.13379
•
Published
•
50
VisionReasoner: Unified Visual Perception and Reasoning via
Reinforcement Learning
Paper
•
2505.12081
•
Published
•
17
ExTrans: Multilingual Deep Reasoning Translation via Exemplar-Enhanced
Reinforcement Learning
Paper
•
2505.12996
•
Published
•
3
Optimizing Anytime Reasoning via Budget Relative Policy Optimization
Paper
•
2505.13438
•
Published
•
35
VisualQuality-R1: Reasoning-Induced Image Quality Assessment via
Reinforcement Learning to Rank
Paper
•
2505.14460
•
Published
•
30
Paper
•
2505.14674
•
Published
•
34
Web-Shepherd: Advancing PRMs for Reinforcing Web Agents
Paper
•
2505.15277
•
Published
•
98
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement
Learning
Paper
•
2505.16410
•
Published
•
55
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with
Curiosity-Driven Reinforcement Learning
Paper
•
2505.15966
•
Published
•
51
GRIT: Teaching MLLMs to Think with Images
Paper
•
2505.15879
•
Published
•
12
VeriThinker: Learning to Verify Makes Reasoning Model Efficient
Paper
•
2505.17941
•
Published
•
24
Synthetic Data RL: Task Definition Is All You Need
Paper
•
2505.17063
•
Published
•
10
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
•
2505.18129
•
Published
•
59
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language
Model via Reinforcement Learning
Paper
•
2505.13426
•
Published
•
12
Active-O3: Empowering Multimodal Large Language Models with Active
Perception via GRPO
Paper
•
2505.21457
•
Published
•
14
The Entropy Mechanism of Reinforcement Learning for Reasoning Language
Models
Paper
•
2505.22617
•
Published
•
114
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
•
2505.22453
•
Published
•
45
Sherlock: Self-Correcting Reasoning in Vision-Language Models
Paper
•
2505.22651
•
Published
•
50
Skywork Open Reasoner 1 Technical Report
Paper
•
2505.22312
•
Published
•
52
Advancing Multimodal Reasoning via Reinforcement Learning with Cold
Start
Paper
•
2505.22334
•
Published
•
36
Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for
Frozen LLMs
Paper
•
2505.19075
•
Published
•
21
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement
Learning
Paper
•
2505.14362
•
Published
•
1
Table-R1: Inference-Time Scaling for Table Reasoning
Paper
•
2505.23621
•
Published
•
88
GoT-R1: Unleashing Reasoning Capability of MLLM for Visual Generation
with Reinforcement Learning
Paper
•
2505.17022
•
Published
•
26
LLMs for Engineering: Teaching Models to Design High Powered Rockets
Paper
•
2504.19394
•
Published
•
13
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
•
2505.24864
•
Published
•
112
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models
Paper
•
2505.24025
•
Published
•
23
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement
Learning
Paper
•
2505.24871
•
Published
•
18
Harnessing Negative Signals: Reinforcement Distillation from Teacher
Data for LLM Reasoning
Paper
•
2505.24850
•
Published
•
9
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective
Reinforcement Learning for LLM Reasoning
Paper
•
2506.01939
•
Published
•
123
SRPO: Enhancing Multimodal LLM Reasoning via Reflection-Aware
Reinforcement Learning
Paper
•
2506.01713
•
Published
•
30
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for
Language Reasoning
Paper
•
2505.24298
•
Published
•
20
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper
•
2505.24726
•
Published
•
162
OThink-R1: Intrinsic Fast/Slow Thinking Mode Switching for
Over-Reasoning Mitigation
Paper
•
2506.02397
•
Published
•
33
OpenThoughts: Data Recipes for Reasoning Models
Paper
•
2506.04178
•
Published
•
20
Critique-GRPO: Advancing LLM Reasoning with Natural Language and
Numerical Feedback
Paper
•
2506.03106
•
Published
•
6