Model Information

This model was fine-tuned using a multi-task dataset from the base meta-llama/Llama-3.1-8B-Instruct LLM for the following task:

Argument Component classification (ACC)
Argument Quality assessment (AQ)
Argument Relation classification (AR)
Claim Detection (CD)
Evidence Detection (ED)
Evidence Type classification (ET)
Fallacies detection (FD)
Stance Detection (SD)
Developed by: Henri Savigny
Funded by: University Claude Bernard, Lyon 1 - Project AMELIA

Model Sources

Github repository: https://github.com/brunoyun/amelia
Paper: TBC

How to Get Started with the Model

The model use a temperature of 1.5 and a min p sampling of 0.1

Using Unsloth

from unsloth import FastLanguageModel
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=f'brunoyun/Llama-3.1-Amelia-MTFT-8B-v1',
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
    gpu_memory_utilization=0.6,
)

FastLanguageModel.for_inference(model)

messages = [{'role': 'system', 'content': 'You are an expert in argumentation. Your task is to determine the type of relation between [SOURCE] and [TARGET]. The type of relation would be in the [RELATION] set. Utilize the [TOPIC] as context to support your decision\nYour answer must be in the following format with only the type of the relation in the answer section:\n<|ANSWER|><answer><|ANSWER|>.'}, {'role': 'user', 'content': "[RELATION]: {'no relation', 'attack', 'support'}\n[TOPIC]: unknown\n[SOURCE]: WHO grade 3 or 4 neutropenia and diarrhoea affected 46% and 16%, respectively, of patients treated with CPT-11 + LFA 5-FU.\n[TARGET]: The TOM + LFA 5-FU regimen showed a RR and a toxicity profile very close to the MTX + LFA 5-FU combination, and does not deserve further evaluation in advanced colorectal cancer patients.\n"}]

txt_streamer = TextStreamer(tokenizer, skip_prompt=True)

txt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to('cuda')

_ = model.generate(
    txt,
    streamer=txt_streamer,
    max_new_tokens=128,
    pad_token_id=tokenizer.eos_token_id
)

Using LM Studio

To use the GGUF model with LM Studio, you need to set as temperature and min_p sampling:

temperature : 1.5
min_p sampling : 0.1

or use the following preset:

{
  "name": "Llama 3 V2",
  "inference_params": {
    "input_prefix": "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
    "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "pre_prompt": "You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.",
    "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
    "pre_prompt_suffix": "",
    "antiprompt": [
      "<|start_header_id|>",
      "<|eot_id|>"
    ]
  }
}

Moreover, to obtain the best performance on each task, you need to add before your prompt the system prompt used in each tasks (see model of the different model used in the merge).

Training Details

Training Data

This model was trained on 32000 (4000 x 8 tasks) elements. The training dataset for each task was extracted from the corresponding model:

brunoyun/Llama-3.1-Amelia-ACC-8B-v1 (ACC)
brunoyun/Llama-3.1-Amelia-AQ-8B-v1 (AQ)
brunoyun/Llama-3.1-Amelia-AR-8B-v1 (AR)
brunoyun/Llama-3.1-Amelia-CD-8B-v1 (CD)
brunoyun/Llama-3.1-Amelia-ED-8B-v1 (ED)
brunoyun/Llama-3.1-Amelia-ET-8B-v1 (ET)
brunoyun/Llama-3.1-Amelia-FD-8B-v1 (FD)
brunoyun/Llama-3.1-Amelia-SD-8B-v1 (SD)

The sample used for the training can be accessed from the Github repository.

Training Procedure

We used LoRA with the Unsloth library.

Training Hyperparameters

Training regime: Model loaded in 4 bits by Unsloth, LoRA r=16, LoRA alpha=16, batch_size=32, epoch=2. The full training code can be viewed from the Github repository.

Evaluation

Testing Data & Metrics

Testing Data

This model was trained on 6400 (800 x 8 tasks) elements. The testing dataset for each task was extracted from the corresponding model:

brunoyun/Llama-3.1-Amelia-ACC-8B-v1 (ACC)
brunoyun/Llama-3.1-Amelia-AQ-8B-v1 (AQ)
brunoyun/Llama-3.1-Amelia-AR-8B-v1 (AR)
brunoyun/Llama-3.1-Amelia-CD-8B-v1 (CD)
brunoyun/Llama-3.1-Amelia-ED-8B-v1 (ED)
brunoyun/Llama-3.1-Amelia-ET-8B-v1 (ET)
brunoyun/Llama-3.1-Amelia-FD-8B-v1 (FD)
brunoyun/Llama-3.1-Amelia-SD-8B-v1 (SD)

The sample used for the testing can be accessed from the Github repository.

Metrics

To evaluate the argument mining task, we used standard classification metrics: $F1$ score, precision and recall. In the case of the fallacies detection task, where for a single sentence $s$ there is a set of fallacies identified as true labels $F_s$, we adapted the precision and the recall. Given the sampled testing dataset $T$ with 800 elements (see previous section), we build the multiset $T’$ where sentences are repeated as many times as the number of corresponding true labels. Namely, we have:

$T’ = \{ s_1, \dots, s_n \mid s_i = s, n = |F_s|, s \in T\}$ Given our non-deterministic model $\phi$, we obtain the predictions: $Y = \{ \phi(s’) \mid s’ \in T’\}$ The new precision is the fraction of correct predictions among all the predictions, where a prediction is correct if the predicted fallacy belongs to the set of true labels. $Precision = \frac{\sum_{s \in T} \left( |\{ s' \in T'\mid s'=s, \phi(s') \in F_s \}| / |F_s| \right) }{|T|}$ Recall was measured based on the consistency of prediction distribution, i.e., over a series of instances annotated with the same fallacies, the model was expected to generate the corresponding fallacy task with similar frequency. $Recall = \frac{\sum_{s \in T} \sum_{f \in F_s} \min\left( \frac{|\{ s' \in T' \mid \phi(s') = f, s'= s\}|}{|F_s|}, \frac{1}{|F_s|}\right)}{|T|}$

Results

Model	ACC	CD	ED	AR	ET	SD	FD_Single	FD_Multi	AQ
Llama 3.1 8B zero-shot	73.52%	51.50%	17.06%	28.32%	37.41%	14.10%	44.07%	21.77%	15.10%
Llama 3.1 8B few-shot	75.47%	67.83%	64.20%	35.97%	49.31%	80.00%	48.50%	17.25%	31.83%
Llama 3.1 8B fine-tuned for ACC	89.61%	61.35%	68.25%	38.51%	41.43%	65.82%	38.43%	21.58%	33.07%
Llama 3.1 8B fine-tuned for CD	50.18%	85.16%	68.91%	38.29%	33.91%	66.97%	38.90%	22.67%	31.24%
Llama 3.1 8B fine-tuned for ED	63.32%	74.94%	78.00%	28.60%	38.67%	68.42%	39.65%	18.47%	29.01%
Llama 3.1 8B fine-tuned for AR	50.81%	59.98%	67.00%	87.20%	35.07%	76.00%	35.14%	25.86%	27.97%
Llama 3.1 8B fine-tuned for ET	56.10%	67.08%	61.45%	26.88%	75.22%	69.82%	46.78%	29.68%	29.03%
Llama 3.1 8B fine-tuned for SD	50.93%	48.88%	57.62%	38.26%	39.17%	94.63%	43.23%	20.99%	20.39%
Llama 3.1 8B fine-tuned for FD	66.58%	65.13%	64.50%	38.64%	46.83%	64.32%	82.92%	50.77%	41.90%
Llama 3.1 8B fine-tuned for AQ	74.46%	59.73%	68.00%	30.86%	44.06%	60.43%	47.98%	24.31%	69.54%
GGUF_ACC	87.73%	63.59%	63.75%	36.31%	37.98%	64.63%	30.19%	29.27%	32.94%
GGUF_CD	54.10%	81.92%	60.70%	36.43%	31.99%	63.82%	30.00%	31.21%	33.20%
GGUF_ED	56.20%	63.72%	71.62%	34.63%	36.22%	61.84%	34.10%	34.54%	34.77%
GGUF_AR	55.19%	60.25%	63.70%	84.57%	31.71%	76.50%	29.94%	34.18%	32.15%
GGUF_ET	58.23%	64.37%	58.59%	29.14%	72.47%	68.20%	39.05%	32.94%	31.48%
GGUF_SD	56.70%	50.75%	57.75%	38.27%	33.67%	93.75%	34.66%	30.32%	21.43%
GGUF_FD	62.20%	59.91%	62.88%	35.51%	42.52%	64.68%	74.08%	62.16%	41.69%
GGUF_AQ	67.08%	59.73%	69.50%	31.17%	41.31%	61.16%	41.86%	30.02%	66.53%
Llama 3.1 8B fine-tuned Multi-task	90.74%	84.71%	77.75%	88.33%	73.84%	95.75%	82.53%	50.22%	69.80%
Merged Model	78.72%	70.69%	69.62%	72.52%	54.60%	77.04%	57.00%	35.03%	57.52%
GGUF_Merged	65.95%	65.83%	62.13%	62.93%	49.06%	74.38%	50.04%	40.75%	44.97%

Intended Use

Intended Use Cases

Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases.

Out-of-scope

Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card**.

Bias, Risks, and Limitations

This model is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, its potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of this model, developers should perform safety testing and tuning tailored to their specific applications of the model.

brunoyun
/

Llama-3.1-Amelia-MTFT-8B-v1-GGUF