language: en license: cc-by-4.0 tags:
- text-classification repo: https://github.com/AAP9002/COMP34812-NLU-NLI
Model Card for z72819ap-e91802zc-NLI
This is a classification model that was trained to detect whether a premise and hypothesis entail each other or not, using binary classification.
Model Details
Model Description
This model is based upon a ensemble of RoBERTa models that was fine-tuned using over 24K premise-hypothesis pairs from the shared task dataset for Natural Language Inference (NLI).
- Developed by: Alan Prophett and Zac Curtis
- Language(s): English
- Model type: Supervised
- Model architecture: Transformers
- Finetuned from model [optional]: roberta-base
Model Resources
- Repository: https://huggingface.co/FacebookAI/roberta-base
- Paper or documentation: https://arxiv.org/abs/1907.11692
Training Details
Training Data
24K+ premise-hypothesis pairs from the shared task dataset provided for Natural Language Inference (NLI).
Training Procedure
Training Hyperparameters
All Models and datasets
- seed: 42
Roberta Large NLI Binary Classification Model
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- num_epochs: 5
Semantic Textual Similarity Binary Classification Model
- learning_rate: 2e-05
- train_batch_size: 16
- eval_batch_size: 16
- num_epochs: 5
Ensemble Meta Model
- learning_rate: 2e-05
- train_batch_size: 128
- eval_batch_size: 16
- num_epochs: 3
Speeds, Sizes, Times
- overall training time: 309 minutes 30 seconds
Roberta Large NLI Binary Classification Model
- duration per training epoch: 11 minutes
- model size: 1.42 GB
Semantic Textual Similarity Binary Classification Model
- duration per training epoch: 4 minutes 30 seconds
- model size: 501 MB
Ensamble Meta Model
- duration per training epoch: 4 minutes
- model size: 1.92 GB
Evaluation
Testing Data & Metrics
Testing Data
A subset of the development set provided, amounting to 5.3k+ pairs for validation and 1.3k+ for testing.
Metrics
- Precision
- Recall
- F1-score
- Accuracy
Results
The Ensemble Model obtained an F1-score of 91% and an accuracy of 91%.
Validation set
- Macro Precision: 91.0%
- Macro Recall: 91.0%
- Macro F1-score: 91.0%
- Weighted Precision: 91.0%
- Weighted Recall: 91.0%
- Weighted F1-score: 91.0%
- accuracy: 91.0%
- Support: 5389
Test set
- Macro Precision: 91.0%
- Macro Recall: 91.0%
- Macro F1-score: 91.0%
- Weighted Precision: 91.0%
- Weighted Recall: 91.0%
- Weighted F1-score: 91.0%
- accuracy: 91.0%
- Support: 1347
Technical Specifications
Hardware
- RAM: at least 10 GB
- Storage: at least 4GB,
- GPU: a100 40GB
Software
- Tensorflow 2.18.0+cu12.4
- Transformers 4.50.3
- Pandas 2.2.2
- NumPy 2.0.2
- Seaborn 0.13.2
- Huggingface_hub 0.30.1
- Matplotlib 3.10.0
- Scikit-learn 1.6.1
Bias, Risks, and Limitations
Any inputs (concatenation of two sequences) longer than 512 subwords will be truncated by the model.
Additional Information
The hyperparameters were determined by experimentation with different values.