ask to learn: Question Solver Agent, Judge Agent, SFT&RL data

#13
by XinglinZhao - opened

After reading the WebResearcher paper https://arxiv.org/pdf/2509.13309, I have a few questions:

  • what model does the Question Solver Agent / Judge Agent use?

I assume the DeepResearch-30B-A3B is trained from data generated from WebFrontier. So wonder if Question Solver Agent / Judge Agent is a bigger & better model, and DeepResearch-30B-A3B effectively learns from these models.

Second question is: do SFT and RL use the same dataset? it's unclear in the paper.

Sign up or log in to comment