HIPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs

This work is a companion to our earlier report KAT-V1: Kwai-AutoThink Technical Report, where we first introduced the AutoThink paradigm for controllable reasoning. While KAT-V1 outlined the overall framework of SFT + RL for adaptive reasoning, this paper provides the detailed algorithmic design of that training recipe.

Overview

We introduce HiPO (Hybrid Policy Optimization for Dynamic Reasoning in LLMs), a novel RL framework designed to enable models to decide when to “think” (i.e., Think-on)and when to skip reasoning (i.e., Think-off), thereby striking a balance between correctness and efﬁciency.

HIPO has two main components:

Hybrid Data Pipeline – Collects both think-on and think-off responses, categorizes queries by difficulty, and uses a strong model (e.g., DeepSeek-V3) to generate explanations that justify mode choices.
Hybrid Reward System – Combines rewards for both modes, with bias adjustment to prevent overuse of long reasoning and mode-aware advantage functions to align decisions with performance gains.

Experimental Findings

Think-on Only (Overthinking).
Training only on Think-on data makes the model reason on all problems, causing inefficiency.

GRPO.
Improves accuracy by +3.1%, but increases token length on simple tasks.

Think-on/Think-off Mix.
Yields higher accuracy (+4.0%) while reducing token length (–10.8%) and thinking rate (–22%).

HiPO Advantage.
Achieves the best results: +6.2% accuracy, –30% token length, –39% thinking rate, outperforming existing methods in both efficiency and accuracy.

Data Format

HiPO produces responses in a structured template that makes the reasoning path explicit and machine-parsable. Two modes are supported:

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "Kwaipilot/HiPO-8B"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "Give me a short introduction to large language model."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768,
    temperature=0.6,
    top_p=0.95,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
print("prompt:\n", prompt)
print("content:\n", content)

Citation

@article{Zhan2025HiPO,
  title={HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs},
  author={Ken Deng, Zizheng Zhan, Wen Xiang, Wenqiang Zhu and others},
  year={2025},
  institution={arXiv preprint arXiv:2509.23967},
  number={arXiv:2509.23967},
  url={https://arxiv.org/abs/2509.23967}
}