Using and Training on Test Time Scaling approaches in Non-Verifiable Domains
#14
by
blattimer
- opened
In reference to point 6, Expanding Beyond Verifiable Domains, our paper "SPARSE REWARDS CAN SELF-TRAIN DIALOGUE AGENTS" creates training data using test time scaling methodology and then trains on that data to self improve models in the dialogue domain. This domain is much more loose in it's verifiability and involves multi turn dialogue rather than single turn math. Let me know if this was helpful in exploring some alternative domains!