Argunauts Training Phase II: Selfplay Finetuning Line-By-Line

Community Article Published February 19, 2025

Drill dispenses with intelligence, training develops it.

― Gilbert Ryle, The Concept of Mind

Introduction

This article documents how we've been creating DebateLabKIT/Llama-3.1-Argunaut-1-8B-SPIN through Selfplay-Finetuning of Argunaut-1-8B-SFT.

The goal is to increase the models fluency and flexibility in effectively using Argdown, and its ability to follow meta-reasoning instructions. Moreover, we seek to re-establish, if possible, the skills that have been lost during continual pretraining on the Argunauts SFT mix before. The challenge is to achieve this with a strictly limited number of training examples and demonstrations.

Or, with Ryle, we're now moving beyond "drill" and starting with proper "training."

Training Llama-3.1-Argunaut-1-8B-SPIN is part of the Argunauts Project:

☑️ Argunauts: Motivation and Goals ↪
☑️ Phase I: SFT Training ↪
Phase II: ✨Selfplay Finetuning Line-By-Line✨

What is Selfplay Finetuning (SPIN)?

Selfplay Finetuning (SPIN) is a method introduced in Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models roughly one year ago. The authors show that SPIN allows a model to learn much more efficiently than SFT: achieving the same proficiency as with SFT by training on ~10% of the data only.

Why is that?

Here's my idiosyncratic, metaphorical and non-technical explanation.

With SFT training, we're showing the model a lot of examples, which contain both question and solution. The model is ultimately trying to learn all these examples "by heart." (Cross-entropy loss!) Of course, that is not what we want it to do, and that's why we're throwing so, so many diverse examples at the model, "overwhelming" it with correct solutions way beyond its capacity to memorize these examples in detail, such that it is forced to learn the abstract, more general structure behind the examples, which ultimately allows it to correctly answer further questions without simply recalling the answer.

Not so with SPIN.

With SPIN training, we reveal, in a first step, just the question and instruct the model to answer it on its own. We select a candidate answer, generated by the model, and inform the model that the correct answer is preferred to the self-generated one. And then the model learns from this information, comparing its own answer with the ground truth, without trying to memorize the correct solution. In another round, we can instruct the model again to answer the question (it has tried to answer before); as the model hasn't memorized the answer, it might come up with a solution that still differs from the correct answer, we present the correct answer, and the model learns from this pairwise comparison.

This non-technical analogy explains, to a certain extend, why SPIN training should be more efficient than SFT training. Moreover, from an educational perspective, it seems clear that SPIN corresponds to a much more attractive learning paradigm than SFT.

Training Data

A DPO dataset contains two answers for each prompt: a chosen and a rejected one. A selfplay dataset, in contrast, just needs one (correct) answer for each prompt, like a classical instruction-following SFT dataset. In SPIN, the model generates further candidate answers on its own (online), which will figure as rejected answers, thus allowing for DPO training. This is also the reason why we can mix selfplay datasets with classical preference datasets for training Llama-3.1-Argunaut-1-8B-SPIN (see also training curriculum below).

train_mixture:
    - path: DebateLabKIT/argdown_line-by-line
      config_name: critical-thinking-ten-arguments
      type: selfplay
    - path: DebateLabKIT/argdown_line-by-line
      config_name: argument-analysis-textbook
      type: selfplay
    - path: DebateLabKIT/argdown_line-by-line
      config_name: classic-philosophical-arguments
      type: selfplay
    - path: DebateLabKIT/argdown_line-by-line
      config_name: synthetic_corpus-001
      type: selfplay
    - path: DebateLabKIT/argdown_line-by-line
      config_name: argdown-docs-guide
      type: selfplay
    - path: DebateLabKIT/argdown_line-by-line
      config_name: z3-docs-logic
      type: selfplay
    - path: DebateLabKIT/argument_mapping_dpo_pairs
      type: preference
    - path: allenai/llama-3.1-tulu-3-70b-preference-mixture
      type: preference

Argdown line-by-line

I've assembled teaching material and preprints of mine containing Argdown argument reconstructions. Some of these have book length. Moreover, we have re-reconstructed and annotated analyses of classical philosophical arguments from "Just the arguments" (Bruce/Barbone). In sum, however, this amounts to significantly less than 1000 argument reconstructions. And that is probably not enough, not even for SPIN.

So, we came up with the idea to split up each Argdown code snippet, which typically contains one argument reconstruction, line by line. And rather than asking the model to generate an entire Argdown snippet with a complete argument reconstruction, we ask it to generate just the next line in a given, incomplete snippet. This multiplies the number of examples, and with the Argdown documentation as additional source we have easily more than 5k training items. On top of this, I'm adding line-by-line training examples from deep-argmap-conversations.

Splitting Argdown code snippets line-by-line has not only the pragmatic benefit of increasing the training dataset, it is also paying tribute to the insight that training on intermediate steps is highly advisable. The Phi-4 team has recently perfected this approach by identifying intermediary "pivotal tokens" in reasoning traces, allowing them to do DPO on pairs of individual tokens. It's still a long way from "all lines" (we) to "pivotal tokens" (Phi-4) — but the idea is the same.

Z3 line-by-line

Argunauts should also master Z3 in connection with Argdown, e.g. for checking deductive validity. To increase their proficiency with Z3, we collect all Z3 code snippets from the official documentation's logic tutorial, including their context, and build a line-by-line dataset suitable for SPIN training.

Logikon argument mapping pairs

Logikon AI has released its pipeline for reconstructing arbitrary texts as fuzzy argument maps with the help of general-purpose causal LLMs and classifiers.

We're collecting diverse argumentative texts ranging from layperson drug reviews and BBC news to supreme court decisions.

On each argumentative text, we run the Logikon piepline with a strong (70B) and weak (8B) model, thus obtaining two alternative argument maps.

To create preference pairs, we assume that the strong model's argument map is preferred to the weak one's.

Tülu 3 preference data

We generously mix in Tülu 3 preference data to avoid forgetting, unlearning, and over-specialization.

Training Curriculum

Rather than throwing all the training data at the model at once, we've set up a training curriculum, loosely inspired by insights from educational research.

The training is split into consecutive epochs. By switching between phases where the model is trained on a specific sub-dataset and phases where it is not seeing that sub-dataset at all, we further hope to avoid indirect memorization effects and allow the model to revisit tasks it hasn't solved in the past equipped with new skills it might have acquired in the meantime.

Our SPIN training curriculum for Llama-3.1-Argunaut-1-8B-SPIN comprises 12 epochs, grouped in 4 roughly similar quarters (numbers represent the relative weight of each dataset in the respective epoch):

image/png

Epochs 1–3 and 7–9 focus on generating and revising detailed logical reconstructions of individual arguments. Epochs 4–6 and 10–12, in contrast, emphasize informal argument mapping. An out-of-rhythm distribution of the Argdown and Z3 docs further increases the training mix variance between the different epochs.

In order to avoid that the model trains on examples it already masters, we have added dynamic task filtering:

For each example question Q (with correct answer A):
- model generates k candidate answers
- l of these k answers are identical with A
- with probability (1-l/k) model trains on this example (and skips it otherwise)

Technical Setup

Each epoch consists in two phases:

  • a generation phase and
  • a DPO training phase.

In the generation phase, the current model checkpoint produces candidate answers for all selfplay questions. Problems the models masters are dropped according to dynamic task filtering. Selfplay preference pairs are constructed and mixed with the standard preference datasets according to the weights specified in the curriculum. That results in a DPO training set for this epoch.

To generate the selfplay candidate answers, we spin-up a vLLM inference server, which serves the current model checkpoint, using for candidate generation:

Parameter Value
k (number of candidates) 5
temperature 0.7

Generating 5 candidates for ~4k problems takes no more than a few minutes.

In the DPO training phase, we use the training dataset produced in the previous generation phase for standard DPO training. This is implemented with

Training parameters and set-up, same for each epoch:

Parameter Value
dpo loss sigmoid
learning rate 2e-7
lr scheduler linear
effective batch size 64
deepspeed ZeRO stage 3
deepspeed CPU offload false
spectrum top % snr ratios targeted 30%

Training one epoch takes roughly 30 minutes on 2 H100.

Training metrics (one curve for each epoch):

image/png

Evaluation

Chat Experience

Let me share the first chat I've had with debatelabkit/llama-3.1-argunaut-1-8b-spin-q4_k_m.gguf.

👤 You
Hi, are you familiar with Argdown?

🤖 Assistant
Yes, I'm familiar with Argdown (Argument Markup Language). It's a syntax for representing and structuring arguments using a markdown-like approach. With Argdown, you can clearly define claims, arguments, premises, conclusions, and other components of an argumentative text in a way that is both human-readable and machine-processable. If you have some content you'd like to be organized or structured using Argdown, feel free to provide it!

👤 You
Can you please map the following text?

School uniforms deter crime and increase student safety. In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]

One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]

School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]

Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🤖 Assistant
Let's create an Argdown argument map to structure this text:

School uniforms deter crime and increase student safety.
  <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms.
    <+ School uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated.
      <+ Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.”
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%.
      <+ School uniforms prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible.
        <+ Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.”
      <+ Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.”

👤 You
Please summarize, in your own words, the main argument presented in the text. No more than 2-3 sentemces, please.

🤖 Assistant
The main argument is that school uniforms lead to a decrease in various types of incidents and crimes at schools, making students safer due to increased visibility and reduced peer pressure. This conclusion is supported by data from Sparks Middle School and Long Beach, California, as well as expert opinions.

👤 You
Let's focus on this argument. Just for the protocol, what's the conclusion?

🤖 Assistant
The text ultimately argues that: School uniforms deter crime and increase student safety.

👤 You
Can you reconstruct this argument as premise conclusion structure, using argdown syntax?

🤖 Assistant

(1) By instituting a uniform policy at Sparks Middle School, the school's police data showed a 63% drop in police log reports, along with decreases in gang activity, student fights, graffiti, property damage, and battery.
(2) A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms.
(3) Implementing a mandatory uniform policy at Long Beach's K-8 schools led to significant decreases in various types of incidents and crimes, such as assault and battery, possession of weapons or drugs, robbery, and vandalism.
-- with ?? from (1) (2) --
(4) Schools with uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.
(5) School uniforms prevent students from concealing weapons under baggy clothing.
(6) Implementing a uniform policy makes it easier to keep track of students on field trips.
(7) A uniform policy makes intruders on campus more visible.
-- with ?? from (5) (6) (7) --
(8) School uniforms are an effective measure for maintaining student safety in various scenarios, such as preventing the concealment of weapons and ensuring easier identification of students.
(9) When all students are dressed alike, competition between students over clothing choices can be eliminated.
-- with ?? from (9) --
(10) The elimination of competition due to uniform policies reduces peer pressure and bullying among students.
-- with ?? from (4) (8) (10) --
(11) School uniforms deter crime and increase student safety.

👤 You
That's way too comlicated. Just start with 2 or 3 premises.

🤖 Assistant
Let me simplify the argument:

Schools that implement uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms. Implementing a mandatory uniform policy at Long Beach's K-8 schools led to significant decreases in various types of incidents and crimes, such as assault and battery, possession of weapons or drugs, robbery, and vandalism. When all students are dressed alike, competition between students over clothing choices can be eliminated. These premises lead to the conclusion that school uniforms deter crime and increase student safety.

👤 You
Render this in argdown, please.

🤖 Assistant

(1) Schools that implement uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.
(2) Implementing a mandatory uniform policy at Long Beach's K-8 schools led to significant decreases in various types of incidents and crimes, such as assault and battery, possession of weapons or drugs, robbery, and vandalism.
(3) When all students are dressed alike, competition between students over clothing choices can be eliminated.
-----
(4) School uniforms deter crime and increase student safety.

👤 You
Do these three premises work together. Or do we have three distinct arguments?

🤖 Assistant
The given premises don't directly connect to form a single argument. They represent separate premises that could support different conclusions or be parts of various arguments.

We can see that:

Premise (1) could be part of an argument about the effectiveness of uniform policies in reducing incidents related to weapons and drugs. Premise (2) supports the idea that implementing uniform policies leads to a decrease in various types of incidents and crimes. Premise (3) suggests that eliminating competition over clothing choices can reduce peer pressure and bullying. To combine these into a single argument, we'd need additional premises or intermediary conclusions to link them together. For now, they are distinct premises that could be used separately in different arguments.

👤 You
Let's focus on premise (1) and drop (2) and (3) from the argument. Please render the resulting 1-primse-argument.

🤖 Assistant

[Conclusion 1]: Schools with uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.
    <+ <Subargument 1>
        <+ [Premise]: Schools that implement uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.

👤 You
Fine, and how does this look in standard form (premise-conclusion structure)?

🤖 Assistant

(1) Schools that implement uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.
-- with ?? from (1) --
(2) Schools with uniform policies tend to have fewer incidents related to weapons and drugs compared to those without uniforms.

This simplified argument focuses on the premise regarding schools with uniform policies having fewer incidents related to weapons and drugs, leading directly to the conclusion.

This appears to be more flexible, less stubborn, less formalistic than the Argunaut-1-8B-SFT base model.

Metrics

We assess the performance of DebateLabKIT/Llama-3.1-Argunaut-1-8B-SPIN on three benchmarks:

Llama-3.1-8B-Instruct Argunaut-1-8B-SFT Argunaut-1-8B-SPIN
⬇️ Argdown Bench
pass@1 80.8 98.9 79.5
pass@5 98.9 99.8 94.5
graph sim 8.4 65.5 34.3
isomorphic 1.3 44.0 17.1
🤗 HF Leaderboard
MMLU pro 37.6 34.5 35.9
MUSR 40.1 41.6 41.6
GPQA Diamond 32.8 30.8 31.8
GPQA Main 28.5 32.1 30.8
MATH 12.5 9.1 11.1
BBH 54.7 48.2 49.8
⛓️ COT Leaderboard
LogiQA 5.9 1.4 5.8
LogiQA2 15.5 0.8 9.2
LSAT-ar 11.7 3.0 6.5
LSAT-lr 20.8 3.9 12.9
LSAT-rc 27.5 13.8 21.2

↘️ The Argdown Bench results are not as brilliant as for the Argunaut-1-8B-SFT base model, but still confirm the chat impression that Argunaut-1-8B-SPIN all in all masters Argdown syntax and semantics.

↗️ On the HF Leaderboard tasks, the Argunaut-1-8B-SPIN model performs at least as good as the SFT model, or as good as the original Llama-3.1-8B-Instruct –– a slight improvement.

↗️ The COT Leaderboard results highlight the most substantial measurable effect of SPIN training: Argunaut-1-8B-SPIN recovers (though not fully) its ability to generate effective CoT reasoning traces.

Next Steps

The desiderata for further training phases are: make sure that Argunauts write 100% legal Argdown code; test and improve ability to use logic tools; further refine Argdown fluency and instruction following. All this will have to be achieved with minimal additional training data (🚀RLVR).

So there'll be even less drill, and more training. 😉

Community

Sign up or log in to comment