Argunauts Training Phase I: Continual Pretraining on Synthetic Data

Community Article Published February 18, 2025

Introduction

This article documents how we've been creating DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFT through continual pretraining of Llama-3.1-8B-Instruct.

The goal of this is to familiarize Llama-3.1-8B-Instruct with Argdown syntax and semantics, without diminishing the model's other capabilities.

Training Llama-3.1-Argunaut-1-8B-SFT is part of the Argunauts Project.

SFT Dataset Mix

The training data contains 1M examples and is composed of the following datasets:

Dataset Weight (examples) Weight (tokens)
DebateLabKIT/deepa2-conversations 25% 49%
DebateLabKIT/deep-argmap-conversations 25% 18%
allenai/tulu-3-sft-mixture 50% 33%

Ingredient #1: deepa2-conversations

The deepa2-conversations dataset contains multi-turn conversations with user/assistant roles (length: 2x2–20x2 turns, bell-shaped distribution). The user instructs an assistant, step by step, to reconstruct and analyse an initially given argumentative text.

It is a fully synthetic dataset, which upcycles the deepa2 data.

To build deepa2-conversations, we've defined 30 different kinds of artifacts that can be referred to, or produced in the course of an analysis, some of which annotate and analyse the source text, reconstruct the argument in standard form, or formalize premises and conclusions, while still others use the reconstruction for additional tasks (like writing a Z3 program to check deductive validity, or augmenting the argument with function-calling results).

The features the user may refer to / the assistant may produce in the course of the argumentation analysis are:

Source Text Artifacts

N = "narrative"  # free form text that presents the argument as narrative, dialogque, letter, or email  
ST = "source_text"  # original and more schematic "source text" containing the (incomplete) argument and potential distractors
SP = "source_paraphrase"  # paraphrase of source text which leaves out extraneous material and streamlines the presentation

Informal Analysis Artifacts

T = "title"  # short title of the argument 
G = "gist"  # key point summarizing the argument
TG = "title_gist"  # mixes title and gist
TGC = "title_gist_conclusion"  # mixes title, gist and conclusion in argdown snippet
X = "context"  # multi-purpose characterization of context

Argumentative Text Analysis Artifacts

R = "reasons"  # list of quotes from `ST` that justify claims
J = "conjectures"  # list of quotes from `ST` that are argued for
AT = "annotated_source_text"  # argumentative text annotation of source text
ATR = "annotated_source_text_ref"  # like AT, with reference to argdown reconstruction
IG = "inference_graph"  # renders all reasons and conjectures as mini argument map

Argument Reconstruction Artifacts

A = "argdown reconstruction"  # premise-conclusion structure
E = "erroneous_argdown"  # flawed reconstruction, to-be revised at later step
P = "premises"  # list of premises
I = "intermediary_conclusions"  # list of intermediary conclusions
C = "conclusion"  # final conclusion of argument
AM = "argument_map"  # map with subarguments, premises and conclusions 
ATG = "argdown_with_title_gist"  # argdown argument with header

Logical Analysis Artifacts

FP = "premises_formalized"  # formalizations of premises
FI = "intermediary_conclusions_formalized"  # formalizations of intermediary conclusions
FC = "conclusion_formalized"  # formalizations of final conclusion
PP = "predicate_placeholders"  # list of predicate placeholders
PE = "entity_placeholders"  # list of entity names
PM = "misc_placeholders"  # further placeholders (e.g., propositional constants)
K = "plchd_substitutions"  # keys, mapping placeholders to natural-language terms
AF = "argdown_with_formalizations"  # argdown argument with inline formalization info

Misc Artifacts

SMTL = "SMT_program_local"  # smtlib code to check validity of each subargument with Z3
SMTG = "SMT_program_global"  # smtlib code to check global validity with Z3
AY = "argdown_augmented_yaml"  # argdown argument augmented with fake inline yaml from tool calls
QA = "evaluation_question_and_answer"  # argument analysis Q&A

Not all artifacts figure in all conversations.

Every conversation in deepa2-conversations unfolds a meaningful piecemeal reconstruction process, where the user guides the assistant through the analysis, instructing it to produce one artifact after the other.

But what is a meaningful reconstruction process, and how do we simulate it?

There are two types of basic constraints for meaningful reconstruction processes:

  1. Redundancy. If an artifact is entailed by a previously generated artifact, it would be redundant to produce it. For example: If the assistant has already generated the entire (correct) premise-conclusion structure, it doesn't make sense to ask the assistant to produce just the (correct) conclusion, for it is contained in the argument we already have.
  2. Dependency. Some artifacts depend on others. For example, to identify all reasons in a source text through annotation and to link them to their corresponding premises in the Argdown argument requires that the argument has been reconstructed before.

In addition, the experience from teaching argumentation analysis suggests that, at a given state of reconstruction, some steps are more sensible than others. For example, it's more promising to work with the text before you reconstruct and formalize the entire argument.

We model these ideas through

  • a weighted directed graph on the node set of all artifact types, plus a start- and a stop-node (conversational graph), and
  • a set of additional constraints for legal paths.

The following diagram depicts a subgraph of the entire conversational graph.

image/png

An abstract deepa2 conversation is a directed path from the start- to the stop-node which satisfies the additional constraints for legal paths. We can now randomly sample abstract legal conversational paths. Unfold for two examples:

In this conversation, the user presents a source text (ST) before asking the assistant ...
  1. to summarize the argument (title and gist),
  2. to informally map all reason and conjecture statements contained in the source text,
  3. to reconstruct the argument as premise-conclusion structure,
  4. to draft a SMTLIB code snippet which checks whether the premises of the argument deductively entail its conclusion.

image/png

In this conversation, the user presents the source text and instructs the assistant ...

  1. to reconstruct the entire argument as premise-conclusion structure,
  2. to paraphrase the source text (given the reconstruction),
  3. to summarize the argument's title and key point,
  4. to annotate the source text by marking reasons and conjectures.

image/png

With all this in place, the construction of the deepa2-conversations dataset is straightforward. For each split:

  1. We set the split size.
  2. We iterate through the upcycled deepa2 data until the desired number of examples has been generated:
    • randomly sample an abstract conversation path
    • render argumentation artifacts from current deepa2 record with templates
    • replace abstract artifact types with argumentation artifacts in conversation

Ingredient #2: deep-argmap-conversations

The second ingredient of the Argunauts-SFT mix is based on the recently released syncIALO dataset — a corpus of 1k+ argument maps with several hundreds of claims each.

The conversations in deep-argmap-conversations have typically 2 or 4 turns and involve user and assistant. The user instructs the assistant to create an Argdown argument map. While the inputs and the specific task may vary, the general idea for building this dataset is this:

  1. Sample a submap (not more than 20 arguments) from a randomly chosen syncIALO debate.
  2. Render the submap as neat and consistent Argdown argument map (full_argdown_map). This will serve as ANSWER.
  3. Transform the submap into inputs that contain less information than the submap.
  4. Insert the inputs into a suitable instruction template, which gives the INSTRUCTION.
  5. INSTRUCTION - ANSWER pairs are rendered as conversations.

Expand for a list of all subtasks and example templates (stylized and shortened for illustrative purposes).

Using the shuffled list of argument claims as inputs:

- role: user
  content: >
    Organize the following sentences into an informal argument map (argdown syntax)
    according to their pairwise dialectic relations.
    sentences:
    {{ sentence_list }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Assistant is asked to add sentences to a partial / incomplete argument map:

- role: user
  content: >
    Organize the following sentences into an informal argument map according to
    their pairwise dialectic relations. Some of these dialectic relations are
    shown below (partialmap) and should be taken into account.
    sentences:
    {{ sentence_list }}
    partialmap:
    ```argdown
    {{ partialmap }}
    ```
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```\n
- role: user
  content: >
    Complete and revise the following argument map by adding the missing sentences.
    incomplete argument map:
    ```argdown\n{{ partialmap }}\n```
    missing sentences:
    {{} sentence_list }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Merge two or more maps into a single one:

- role: user
  content: >
    Merge the following argument maps into a single map. Streamline and add missing
    arguments as required.{% if hints | title %} In doing so, pay attention to the
    given hints.
    {% for partialmap in partialmaps %}
    argument map {{loop.index}}:
    ```argdown
    {{ partialmap }}
    ```
    {% endfor %}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

The brokenmap is obtained by randomly distorting the correct map:

- role: user
  content: >
    Improve the following argument map, if possible. In doing so, pay attention to the given hints.
    argument map:
    ```argdown\n{{ brokenmap }}\n```
    {% if hints %}
    hints:\n {{ hints }}
    {% endif %}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Take a flat pros and cons list as starting point:

- role: user
  content: >
    Reconstruct the following pros and cons list as an argument map (using argdown syntax).
    pros and cons:
    ```argdown\n{{ proscons }}\n```
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Using a free source text as input:

- role: user
  content: >
    Lay out the arguments presented in the following material as an informal argument map (argdown syntax).
    <material>
    {{ source_text }}
    </material>
    <context>
    {{ context }}
    </context>
    Just provide the argument map in your answer.
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Asking the assistant to add individual claims (premises, conclusion) to an argument map:

- role: user
  content: >
    Your task is to expand the following argument map by adding the
    premises / conclusion(s) of the arguments mentioned below as additional
    claims.
    argument map:
    ```argdown\n{{ partialmap }}\n```
    Insert the premises / conclusion(s) of 
    the following arguments:
    {{ unfolded_nodes_labels }}
- role: assistant
  content: >
    ```argdown\n{{ full_argdown_map }}\n```

Finally, a task where the full argdown map figures as input:

- role: user
  content: >
    Your task is to reconstruct a single argument from the following argument map and render
    it in standard form (premise-conclusion structure).
    argument map:\n\n
    ```argdown\n{{ full_argdown_map }}\n```\n\n
    argument to reconstruct:\n\n
    ```argdown\n{{ argument_label }}\n```\n
- role: assistant
  content: >
    ```argdown\n{{ argument_argdown_snippet }}\n```

By applying these different pipelines to syncIALO, we create a diverse conversational dataset with demanding argument mapping tasks.

Ingredient #3: Tülu 3 SFT mix

We generously add AI2's Tülu 3 SFT mix to our training data in order to prevent unlearning valuable skills (aka "catastrophic forgetting").

Training

Conceptually, we conceive of this SFT training phase as continual pretraining, rather than finetuning. The main purpose is to make the model familiar with Argdown syntax and semantics. That's also why we use packing, and gradient accumulation to increase the effective batch size. With some chats in deepa2-conversations having up to 20x2 turns, we chose to train with a 8192 context window.

🤔 Why don't we train on completions only?

Every user query is a reasonable instruction at the given state of analysis and should be learned by model. (But I think testing the effect of masking is worthwhile.)

🤔 Chat templates and continual pretraining -- isn't that odd?

Maybe it's not common, but I suggest chat templates should be used extensively during pretraining (and maybe are so in textbooks-are-all-you-need projects). In future iterations I'd use a variety of different chat templates and formats to increase diversity during continual pretraining.

Technically, the training stack is built with:

Base model: meta-llama/Llama-3.1-8B-Instruct

Training parameters and set-up:

Parameter Value
learning rate 5e-6
lr scheduler cosine
context window size 8192
effective batch size 32
deepspeed ZeRO stage 3
deepspeed CPU offload false
spectrum top % snr ratios targeted 30%

Training took 35h on 2 H100.

Evaluation

Chat Experience

Let me share two of the first chats I've been having with ggbetz/Llama-3.1-Argunaut-1-8B-SFT-Q4-mlx.

👤 You
Hi, are you familiar with Argdown?

🤖 Assistant
Yes, I'm familiar with Argdown. It's a markup language for reasoning that allows you to write structured text in a natural and readable way. With Argdown, you can create argument maps where sentences are connected by lines that represent different relations (e.g., support, reject, reason). This makes it easy to visualize complex argumentative structures.

Would you like to create an Argdown argument map? Feel free to provide a topic and/or some sentences, and I'll help you set up the map!

👤 You
Can you please map the following text?

School uniforms deter crime and increase student safety. In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]

One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]

School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]

Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🤖 Assistant

School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

👤 You
Please summarize, in your own words, the main argument presented in the text. No more than 2-3 sentemces, please.

🤖 Assistant
School uniforms are an effective solution to crime and safety issues in schools, as evidenced by the reduction of incidents in various institutions that implemented uniform policies.

👤 You
Let's focus on this argument. Just for the protocol, what's the conclusion?

🤖 Assistant
The text argues that:
School uniforms deter crime and increase student safety.

By examining this argument, we can now proceed to reconstruct it. Please provide the next step in this inference guide:

👤 You
Can you reconstruct this argument as premise conclusion structure, using argdown syntax?

🤖 Assistant
The argument can be reconstructed as follows:

School uniforms deter crime and increase student safety.
    <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
    <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
    <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
    <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🛑 Poor answer

That's NOT a reconstruction in standard form, but just a repetition of the previously given map.

👤 You
No, I really want you to reconstruct the argument in standard form, listing premises and conclusions.

🤖 Assistant

School uniforms deter crime and increase student safety.
    <+ <Subargument 1>
        <+ In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]
        <+ One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]
        <+ School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]
        <+ Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🛑 Poor answer

The assistant fails to correct the answer, despite the user's directions.

I'm probing the skill to analyse premise-conclusion structures in the second chat: expand to read.

👤 You
Hi, are you familiar with Argdown and rendering arguments as premise conclusion structures?

🤖 Assistant
Yes, I am familiar with Argdown and can render arguments as premise-conclusion structures. What would you like to discuss or have me demonstrate?

👤 You
Can you please describe the main argument presented in the following text?

School uniforms deter crime and increase student safety. In Long Beach, California, after two years of a district-wide K-8 mandatory uniform policy, reports of assault and battery in the district’s schools decreased by 34%, assault with a deadly weapon dropped by 50%, fighting incidents went down by 51%, sex offenses were cut by 74%, robbery dropped by 65%, possession of weapons (or weapon “look-alikes”) decreased by 52%, possession of drugs went down by 69%, and vandalism was lowered by 18%. [64]

One year after Sparks Middle School in Nevada instituted a uniform policy, school police data showed a 63% drop in police log reports, and decreases were also noted in gang activity, student fights, graffiti, property damage, and battery. A peer-reviewed study found that schools with uniform policies had 12% fewer firearm-related incidents and 15% fewer drug-related incidents than schools without uniforms. [25][69]

School uniforms also prevent students from concealing weapons under baggy clothing, make it easier to keep track of students on field trips, and make intruders on campus more visible. Frank Quatrone, superintendent in the Lodi school district of New Jersey, states, “When you have students dressed alike, you make them safer. If someone were to come into a building, the intruder could easily be recognized.” [6][38]

Further, school uniforms create a level playing field among students, reducing peer pressure and bullying. When all students are dressed alike, competition between students over clothing choices and the teasing of those who are dressed in less expensive or less fashionable outfits can be eliminated. Research by the Schoolwear Association found that 83% of teachers thought “a good school uniform…could prevent bullying based on appearance or economic background.” Arminta Jacobson, founder and director of the Center for Parent Education at the University of North Texas, states that uniforms put “all kids on the same playing field in terms of their appearance. I think it probably gives them a sense of belonging and a feeling of being socially accepted.” [5][91]

🤖 Assistant
School uniforms are not only a traditional and respectful way for institutions to create an environment of unity, but they also benefit students' safety and performance. The implementation of school uniforms can lead to a significant reduction in crime rates, improved student behavior, and enhanced safety on school campuses.

👤 You
Can you please reconstruct this argument in standard form (premise-conclusion structure)?

🤖 Assistant

(1) Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
(2) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
(3) The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
(4) The reduction in crime rates is a result of the uniform policy and not just a coincidence.
-----
(5) Implementing school uniforms is an effective way to deter crime and increase student safety.

🛑 Poor answer

Mechanically mimics the argument reconstructions in training data (deep-argmap): senselessly copying a proposition to get to 4 premises.

👤 You
Can you please add intermediary conclusions and add inference rule?

🤖 Assistant
Sure, here are the intermediate steps leading to the final conclusion:

Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
The reduction in crime rates is a result of the uniform policy and not just a coincidence.
The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.
Implementing school uniforms is an effective way to deter crime and increase student safety.

👤 You
Please render the refined reconstruction in Argdown.

🤖 Assistant
The argument can be represented as follows:

[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.

👤 You
And as premise-conclusion structure, with inference info added?

🤖 Assistant

[Conclusion 1]: Implementing school uniforms is an effective way to deter crime and increase student safety.
    <+ <Subargument 1>
        <+ [Premise]: Implementing school uniforms has led to a significant reduction in crime rates in the Long Beach, California district.
        <+ [Premise]: The reduction in crime rates is a result of the uniform policy and not just a coincidence.
        <+ [Premise]: The reduction in crime rates is a desirable outcome that contributes to the overall safety and well-being of students.

🛑 Poor answers

Model fails to follow user's directions, doesn't refine the premise-conclusion analysis.

Argdown-syntax-wise all this looks fine, but my subjective impression is that the model is sticking to the patterns it has seen during continual pretraining, and is a bit stubborn when being instructed to do Argdown stuff that deviates from what is in the synthetic training corpus.

Metrics

We assess the performance of DebateLabKIT/Llama-3.1-Argunaut-1-8B-SFTon three benchmarks:

Argdown Bench is a newly created test suite. It contains argument mapping tasks from deep-argmap-conversations' test split. We parse model completions with our own Argdown parser and assess whether the model's answer

  1. is syntactically valid Argdown code (pass@x),
  2. is close to the gold answer (as measured by networkx's graph distance measure),
  3. is fully isomorphic to the gold answer (zero graph distance).

We run all the evals on our own set-up (vLLM inference server), obtaining:

Llama-3.1-8B-Instruct Argunaut-1-8B-SFT
⬇️ Argdown Bench
pass@1 80.8 98.9
pass@5 98.9 99.8
graph sim 8.4 65.5
isomorphic 1.3 44.0
🤗 HF Leaderboard
MMLU pro 37.6 34.5
MUSR 40.1 41.6
GPQA Diamond 32.8 30.8
GPQA Main 28.5 32.1
MATH 12.5 9.1
BBH 54.7 48.2
⛓️ COT Leaderboard
LogiQA 5.9 1.4
LogiQA2 15.5 0.8
LSAT-ar 11.7 3.0
LSAT-lr 20.8 3.9
LSAT-rc 27.5 13.8

✅ The Argdown Bench results strongly confirm that Argunaut-1-8B-SFT has actually learned Argdown syntax and semantics. This is consistent with the legal argdown snippets we observe in the chats above.

✅ The HF Leaderboard results suggest that the performance in these generic multiple choice tasks has slightly, but certainly not dramatically decreased. No catastrophic loss of skills here.

🛑 The COT Leaderboard results, however, identify a disastrous drop in CoT effectiveness: Argunaut-1-8B-SFT seems to have lost the base model's ability to increase the accuracy of its answers through CoT reasoning.

Next Steps

  • Identify training data that goes beyond our synthetic Argunauts SFT mix: diverse, high quality demonstrations of logical analyses, exemplifying, in particular, consecutive revisions of previous reconstructions as well as Argdown's syntactic flexibility;
  • Train Argunaut-1-8B-SFT repeatedly with online DPO.

Community

Sign up or log in to comment