Model Card for a63140nd-h10505jd-ED-Opt-C

A DeBERTa V3 Large model fine-tuned for Evidence Detection, determining if a provided piece of evidence is relevant (1) or not relevant (0) to a given claim. This solution leverages data augmentation, focal loss for robust performance on imbalanced data.

Model Details

Model Description

This model addresses the Evidence Detection (ED) shared task: given a claim and a piece of evidence, determine if the evidence is relevant to that claim (binary classification). Built upon the powerful DeBERTa V3 Large architecture from Microsoft (microsoft/deberta-v3-large), the system is fine-tuned on over 21K training instances (plus augmented data) derived from claim-evidence pairs, with ~6K instances in the validation/development set. Several data augmentation techniques (paraphrasing, backtranslation, synonym replacement, adversarial typos) are used to increase data diversity. A custom WeightedTrainer class with FocalLoss handles class imbalance, and Platt scaling calibrates predicted probabilities for better threshold-based decision-making.

Developed by: Nikolaos Douranos & James Deslandes
Language(s): English
Model type: Sequence Relation Classification
Model architecture: Transformer-based architecture from microsoft/deberta-v3-large
Finetuned from model [optional]: microsoft/deberta-v3-large

Model Resources

Repository: https://huggingface.co/Jed612/BERT-ED
Base Model https://huggingface.co/microsoft/deberta-v3-large

Training Details

Training Data

The model is trained on ~21K claim-evidence pairs augmented with paraphrasing, backtranslation, synonym replacement, and adversarial typos. A ~6K record development set is used for validation and probability calibration.

Training Procedure

Training Hyperparameters

num_train_epochs: 5
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
learning_rate: 1e-05
weight_decay: 0.01
focal_loss_alpha: 2.0
focal_loss_gamma: 2.0
max_seq_length: 256

Speeds, Sizes, Times

Training required almost 3 hours on a single T4 GPU. The final model size is roughly 1.9GB. Trained for 5 epochs using a batch size of 8.

TrainOutput: global_step=17030, training_loss=0.07785630589981328

Train runtime: ~ 3.33 hours (11976.46 seconds)
Train samples/sec: 11.376
Train steps/sec: 1.422
Total FLOPs: 6.35e+16
Final training loss: 0.0779

Evaluation

Testing Data & Metrics

Testing Data

The ~6K dev set is used for final model selection and calibration.

Metrics

Macro-F1, Accuracy, Precision, Recall, ROC-AUC, PR-AUC, Matthews Corrcoef (MCC), and Cohen’s Kappa. Platt scaling is applied, and the threshold is tuned based on the dev set to maximize macro-F1.

Results

The final model demonstrates strong performance on the dev set. Actual metrics may vary, but typical results show improvements in F1, MCC, and Kappa.

Classification Report (Dev Calibrated):

              precision    recall  f1-score    support

           0     0.9319    0.9193    0.9255      4286
           1     0.7962    0.8244    0.8101      1640

    accuracy                         0.8930      5926
   macro avg     0.8641    0.8718    0.8678      5926
weighted avg     0.8943    0.8930    0.8936      5926

Confusion Matrix:

Dev (Calibrated) Metrics:

Accuracy: 0.893
Matthews Corrcoef (MCC): 0.74
Cohen’s Kappa: 0.74

Technical Specifications

Hardware

At least one modern GPU (e.g., NVIDIA V100 or RTX 3090) with ~24GB GPU memory for efficient training, plus ~50GB of system memory.

Software

Python 3.8+, PyTorch, Hugging Face Transformers, Datasets, Evaluate, scikit-learn, and NLTK are required. Mixed precision (fp16) training is activated if supported by the GPU.

Bias, Risks, and Limitations

Model may exhibit biases learned from underlying training data, especially if claim-evidence pairs exhibit distributional shifts. Focal Loss helps handle class imbalance but does not eliminate potential biases or domain gaps. Users should carefully validate the model on relevant real-world samples. Misclassification can lead to ignoring crucial evidence or wrongly including irrelevant data.

Additionally, it’s worth noting that efforts were made to enhance the model’s accuracy on the development (dev) dataset by implementing Platt scaling for threshold calibration. However, when performed on the training dataset and the selected threshold was applied to the dev evaluation set, it yielded poorer results. Consequently, to avoid potential risks, this technique was included in the final model design.

Resources

[1] P. He, J. Gao, and W. Chen, "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing," arXiv:2111.09543 [cs.CL], 2023. [Online]. Available: https://arxiv.org/abs/2111.09543

[2] J. Chen, D. Tam, C. Raffel, M. Bansal, and D. Yang, "An Empirical Survey of Data Augmentation for Limited Data Learning in NLP," arXiv:2106.07499 [cs.CL], 2021. [Online]. Available: https://arxiv.org/abs/2106.07499

[3] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, "Focal Loss for Dense Object Detection," arXiv:1708.02002 [cs.CV], 2018. [Online]. Available: https://arxiv.org/abs/1708.02002

[4] B. Böken, "On the appropriateness of Platt scaling in classifier calibration," Information Systems, vol. 95, p. 101641, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0306437920301083, doi: 10.1016/j.is.2020.101641

Use of Generative AI

Generative AI has been employed to conduct in-depth research across scientific papers and web resources, aiding in the generation of ideas to enhance the already advanced Microsoft/DeBERTa-v3-large model. Additionally, it has supported debugging efforts and the creation of effective code comments.