Model Card for z72819ap-e91802zc-NLI

This is a classification model that was trained to detect whether a premise and hypothesis entail each other or not, using binary classification.

Model Details

Model Description

This model is based upon a ensemble of RoBERTa models that was fine-tuned using over 24K premise-hypothesis pairs from the shared task dataset for Natural Language Inference (NLI).

Developed by: Alan Prophett and Zac Curtis
Language(s): English
Model type: Supervised
Model architecture: Transformers
Finetuned from model [optional]: roberta-base

Model Resources

Repository: https://huggingface.co/FacebookAI/roberta-base
Paper or documentation: https://arxiv.org/abs/1907.11692

Training Details

Training Data

24K+ premise-hypothesis pairs from the shared task dataset provided for Natural Language Inference (NLI).

Training Procedure

Training Hyperparameters

All Models and datasets
  - seed: 42

Roberta Large NLI Binary Classification Model
  - learning_rate: 2e-05
  - train_batch_size: 16
  - eval_batch_size: 16
  - num_epochs: 5

Semantic Textual Similarity Binary Classification Model
  - learning_rate: 2e-05
  - train_batch_size: 16
  - eval_batch_size: 16
  - num_epochs: 5

Ensemble Meta Model
  - learning_rate: 2e-05
  - train_batch_size: 128
  - eval_batch_size: 16
  - num_epochs: 3

Speeds, Sizes, Times

  - overall training time: 309 minutes 30 seconds

Roberta Large NLI Binary Classification Model
  - duration per training epoch: 11 minutes
  - model size: 1.42 GB

Semantic Textual Similarity Binary Classification Model
  - duration per training epoch: 4 minutes 30 seconds
  - model size: 501 MB

Ensamble Meta Model
  - duration per training epoch: 4 minutes
  - model size: 1.92 GB

Evaluation

Testing Data & Metrics

Testing Data

A subset of the development set provided, amounting to 5.3k+ pairs for validation and 1.3k+ for testing.

Metrics

  - Precision
  - Recall
  - F1-score
  - Accuracy

Results

  The Ensemble Model obtained an F1-score of 91% and an accuracy of 91%.

  Validation set
  - Macro Precision: 91.0%
  - Macro Recall: 91.0%
  - Macro F1-score: 91.0%
  - Weighted Precision: 91.0%
  - Weighted Recall: 91.0%
  - Weighted F1-score: 91.0%
  - accuracy: 91.0%
  - Support: 5389

  Test set
  - Macro Precision: 91.0%
  - Macro Recall: 91.0%
  - Macro F1-score: 91.0%
  - Weighted Precision: 91.0%
  - Weighted Recall: 91.0%
  - Weighted F1-score: 91.0%
  - accuracy: 91.0%
  - Support: 1347

Technical Specifications

Hardware

  - RAM: at least 10 GB
  - Storage: at least 4GB,
  - GPU: a100 40GB

Software

  - Tensorflow 2.18.0+cu12.4
  - Transformers 4.50.3
  - Pandas 2.2.2
  - NumPy 2.0.2
  - Seaborn 0.13.2
  - Huggingface_hub 0.30.1
  - Matplotlib 3.10.0
  - Scikit-learn 1.6.1

Bias, Risks, and Limitations

Any inputs (concatenation of two sequences) longer than 512 subwords will be truncated by the model.

Additional Information

The hyperparameters were determined by experimentation with different values.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support