Model Information

This is a merge of pre-trained language models created using mergekit for the following argument mining tasks:

  • Argument Component classification (ACC)

  • Argument Quality assessment (AQ)

  • Argument Relation classification (AR)

  • Claim Detection (CD)

  • Evidence Detection (ED)

  • Evidence Type classification (ET)

  • Fallacies detection (FD)

  • Stance Detection (SD)

  • Developed by: Henri Savigny 

  • Funded by: University Claude Bernard, Lyon 1 - Project AMELIA

 

Model Sources

How to Get Started with the Model

The model use with a temperature of 1.5 and a min p sampling of 0.1

Using Unsloth

The following code snippet showcases how to use the merged model for the argument component classification task.

from unsloth import FastLanguageModel
from transformers import TextStreamer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=f'brunoyun/Llama-3.1-Amelia-MTMERGED-8B-v1',
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=False,
    gpu_memory_utilization=0.6,
)

FastLanguageModel.for_inference(model)

messages = [{'role': 'system', 'content': 'You are an expert in argumentation. Your task is to determine whether the given [SENTENCE] is a Claim or a Premise. Utilize the [TOPIC] and the [FULL TEXT] as context to support your decision\nYour answer must be in the following format with only Claim or Premise in the answer section:\n<|ANSWER|><answer><|ANSWER|>.'}, {'role': 'user', 'content': '[TOPIC]: unknown\n[SENTENCE]: in most industrialized countries, there is a number of companies that holds heavy factories in their constitutions\n[FULL TEXT]: Air pollution\n\nAir pollution is an issue threatening the environment in many different ways such as causing holes in the ozone and affecting global heating in the negative way. I believe that more responsibilities should put on private individuals and companies such as paying to clean up air pollution. \nFirst, in most industrialized countries, there is a number of companies that holds heavy factories in their constitutions. The waste products and harmful gases produced by these factories cause a significant amount of air pollution. While these companies make a huge amount of money from their businesses, most of them are not considering to take precautions to reduce the amount of air pollution. If these companies were obliged to pay to clean up the air pollution, they would at least make an effort to reduce the amount of air pollution they cause.\nSecond, private individuals cause air pollution in several ways such as inessential use of cars and house heating. As house heating is a major need in the winter time, some households do not consider to use filters to reduce the amount of pollution that the heating causes. Instead of this, some expensive vehicles cause a huge amount of pollution. In my opinion, by charging private individuals for causing air pollution, that will at least contribute to reduce the amount of the pollution. \nOn the other hand, I believe that governments should restrict the use of products that cause air pollution for both individuals and companies. Governments also should use an amount of money from their budgets to clean up the air, instead of putting all the responsibility to companies and private individuals. \nIn conclusion, due to the fact that governments are responsible to provide a healthy environment for their inhabitants and they should be responsible from cleaning up the air pollution, both companies and private individuals should take the main part to clean up the air pollution.\n'}]

txt_streamer = TextStreamer(tokenizer, skip_prompt=True)

txt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to('cuda')

out = model.generate(
    txt,
    streamer=txt_streamer,
    max_new_tokens=128,
    pad_token_id=tokenizer.eos_token_id
)

Using LM Studio

To use the GGUF model with LM Studio, you need to set as temperature and min_p sampling:

  • temperature : 1.5
  • min_p sampling : 0.1

or use the following preset:

{
  "name": "Llama 3 V2",
  "inference_params": {
    "input_prefix": "<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
    "input_suffix": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
    "pre_prompt": "You are a helpful, smart, kind, and efficient AI assistant. You always fulfill the user's requests to the best of your ability.",
    "pre_prompt_prefix": "<|start_header_id|>system<|end_header_id|>\n\n",
    "pre_prompt_suffix": "",
    "antiprompt": [
      "<|start_header_id|>",
      "<|eot_id|>"
    ]
  }
}

Moreover, to obtain the best performance on each task, you need to add before your prompt the system prompt used in each tasks (see model of the different model used in the merge).

Merge Details

Merge Method

This model was merged using the Linear DELLA merge method using unsloth/Llama-3.1-8B-Instruct as a base.

Models Merged

The following models were included in the merge:

  • brunoyun/Llama-3.1-Amelia-ED-8B-v1
  • brunoyun/Llama-3.1-Amelia-CD-8B-v1
  • brunoyun/Llama-3.1-Amelia-SD-8B-v1
  • brunoyun/Llama-3.1-Amelia-AQ-8B-v1
  • brunoyun/Llama-3.1-Amelia-ACC-8B-v1
  • brunoyun/Llama-3.1-Amelia-ET-8B-v1
  • brunoyun/Llama-3.1-Amelia-FD-8B-v1
  • brunoyun/Llama-3.1-Amelia-AR-8B-v1

Configuration

The following YAML configuration was used to produce this model:

models:
  - model: meta-llama/Llama-3.1-8B-Instruct # base model
  - model: brunoyun/Llama-3.1-Amelia-ACC-8B-v1 # ADUC Model
    parameters:
      density: 0.7
      epsilon: 0.25
      weight: 0.15
  - model: brunoyun/Llama-3.1-Amelia-CD-8B-v1 # CD Model
    parameters:
      density: 0.4
      epsilon: 0.3
      weight: 0.03
  - model: brunoyun/Llama-3.1-Amelia-ED-8B-v1 # ED Model
    parameters:
      density: 0.4
      epsilon: 0.3
      weight: 0.03
  - model: brunoyun/Llama-3.1-Amelia-ET-8B-v1 # ET Model
    parameters:
      density: 0.9
      epsilon: 0.05
      weight: 0.2
  - model: brunoyun/Llama-3.1-Amelia-FD-8B-v1 # FD Model
    parameters:
      density: 0.8
      epsilon: 0.15
      weight: 0.16
  - model:  brunoyun/Llama-3.1-Amelia-AR-8B-v1 # ARC Model
    parameters:
      density: 0.9
      epsilon: 0.05
      weight: 0.2
  - model: brunoyun/Llama-3.1-Amelia-SD-8B-v1 # SD Model
    parameters:
      density: 0.4
      epsilon: 0.3
      weight: 0.03
  - model: brunoyun/Llama-3.1-Amelia-AQ-8B-v1 # AQ Model
    parameters:
      density: 0.9
      epsilon: 0.05
      weight: 0.2
merge_method: della_linear
base_model: meta-llama/Llama-3.1-8B-Instruct
parameters:
  int8_mask: true
dtype: bfloat16

Evaluation

Testing Data & Metrics

Testing Data

This model was tested on 6400 (800 x 8 tasks) elements. The testing dataset for each task was extracted from the corresponding model:

  • brunoyun/Llama-3.1-Amelia-ACC-8B-v1 (ACC)
  • brunoyun/Llama-3.1-Amelia-AQ-8B-v1 (AQ)
  • brunoyun/Llama-3.1-Amelia-AR-8B-v1 (AR)
  • brunoyun/Llama-3.1-Amelia-CD-8B-v1 (CD)
  • brunoyun/Llama-3.1-Amelia-ED-8B-v1 (ED)
  • brunoyun/Llama-3.1-Amelia-ET-8B-v1 (ET)
  • brunoyun/Llama-3.1-Amelia-FD-8B-v1 (FD)
  • brunoyun/Llama-3.1-Amelia-SD-8B-v1 (SD)

The sample used for the testing can be accessed from the Github repository.

Metrics

To evaluate the argument mining task, we used standard classification metrics: $F1$ score, precision and recall.  In the case of the fallacies detection task, where for a single sentence $s$ there is a set of fallacies identified as true labels $F_s$, we adapted the precision and the recall. Given the sampled testing dataset $T$ with 800 elements (see previous section), we build the multiset $T’$ where sentences are repeated as many times as the number of corresponding true labels. Namely, we have:

T={s1,,snsi=s,n=Fs,sT} T’ = \{ s_1, \dots, s_n \mid s_i = s, n = |F_s|, s \in T\} Given our non-deterministic model $\phi$, we obtain the predictions: Y={ϕ(s)sT}Y = \{ \phi(s’) \mid s’ \in T’\} The new precision is the fraction of correct predictions among all the predictions, where a prediction is correct if the predicted fallacy belongs to the set of true labels.  Precision=sT ({sTs=s,ϕ(s)Fs}/Fs)T Precision = \frac{\sum_{s \in T}   \left( |\{ s' \in T'\mid s'=s, \phi(s') \in F_s \}| / |F_s| \right) }{|T|} Recall was measured based on the consistency of prediction distribution, i.e., over a series of instances annotated with the same fallacies, the model was expected to generate the corresponding fallacy task with similar frequency. Recall=sTfFsmin({sTϕ(s)=f,s=s}Fs,1Fs)TRecall = \frac{\sum_{s \in T} \sum_{f \in F_s} \min\left( \frac{|\{ s' \in T' \mid \phi(s') = f, s'= s\}|}{|F_s|}, \frac{1}{|F_s|}\right)}{|T|}

Results

Model ACC CD ED AR ET SD FD_Single FD_Multi AQ
Llama 3.1 8B zero-shot 73.52% 51.50% 17.06% 28.32% 37.41% 14.10% 44.07% 21.77% 15.10%
Llama 3.1 8B few-shot 75.47% 67.83% 64.20% 35.97% 49.31% 80.00% 48.50% 17.25% 31.83%
Llama 3.1 8B fine-tuned for ACC 89.61% 61.35% 68.25% 38.51% 41.43% 65.82% 38.43% 21.58% 33.07%
Llama 3.1 8B fine-tuned for CD 50.18% 85.16% 68.91% 38.29% 33.91% 66.97% 38.90% 22.67% 31.24%
Llama 3.1 8B fine-tuned for ED 63.32% 74.94% 78.00% 28.60% 38.67% 68.42% 39.65% 18.47% 29.01%
Llama 3.1 8B fine-tuned for AR 50.81% 59.98% 67.00% 87.20% 35.07% 76.00% 35.14% 25.86% 27.97%
Llama 3.1 8B fine-tuned for ET 56.10% 67.08% 61.45% 26.88% 75.22% 69.82% 46.78% 29.68% 29.03%
Llama 3.1 8B fine-tuned for SD 50.93% 48.88% 57.62% 38.26% 39.17% 94.63% 43.23% 20.99% 20.39%
Llama 3.1 8B fine-tuned for FD 66.58% 65.13% 64.50% 38.64% 46.83% 64.32% 82.92% 50.77% 41.90%
Llama 3.1 8B fine-tuned for AQ 74.46% 59.73% 68.00% 30.86% 44.06% 60.43% 47.98% 24.31% 69.54%
GGUF_ACC 87.73% 63.59% 63.75% 36.31% 37.98% 64.63% 30.19% 29.27% 32.94%
GGUF_CD 54.10% 81.92% 60.70% 36.43% 31.99% 63.82% 30.00% 31.21% 33.20%
GGUF_ED 56.20% 63.72% 71.62% 34.63% 36.22% 61.84% 34.10% 34.54% 34.77%
GGUF_AR 55.19% 60.25% 63.70% 84.57% 31.71% 76.50% 29.94% 34.18% 32.15%
GGUF_ET 58.23% 64.37% 58.59% 29.14% 72.47% 68.20% 39.05% 32.94% 31.48%
GGUF_SD 56.70% 50.75% 57.75% 38.27% 33.67% 93.75% 34.66% 30.32% 21.43%
GGUF_FD 62.20% 59.91% 62.88% 35.51% 42.52% 64.68% 74.08% 62.16% 41.69%
GGUF_AQ 67.08% 59.73% 69.50% 31.17% 41.31% 61.16% 41.86% 30.02% 66.53%
Llama 3.1 8B fine-tuned Multi-task 90.74% 84.71% 77.75% 88.33% 73.84% 95.75% 82.53% 50.22% 69.80%
Merged Model 78.72% 70.69% 69.62% 72.52% 54.60% 77.04% 57.00% 35.03% 57.52%
GGUF_Merged 65.95% 65.83% 62.13% 62.93% 49.06% 74.38% 50.04% 40.75% 44.97%

Intended Use

Intended Use Cases

Llama 3.1 is intended for commercial and research use in multiple languages. Instruction tuned text only models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks. The Llama 3.1 model collection also supports the ability to leverage the outputs of its models to improve other models including synthetic data generation and distillation. The Llama 3.1 Community License allows for these use cases. 

Out-of-scope

Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3.1 Community License. Use in languages beyond those explicitly referenced as supported in this model card**.

Bias, Risks, and Limitations

This model is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, its potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of this model, developers should perform safety testing and tuning tailored to their specific applications of the model. 

Downloads last month
12
GGUF
Model size
8.03B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for brunoyun/Llama-3.1-Amelia-MTMERGED-8B-v1-GGUF

Collection including brunoyun/Llama-3.1-Amelia-MTMERGED-8B-v1-GGUF