File size: 4,180 Bytes
2942687 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- distilbert
- fine-tuned
- pytorch
datasets:
- cassieli226/cities-text-dataset
base_model: distilbert-base-uncased
model-index:
- name: hw2-text-distilbert
results:
- task:
type: text-classification
name: Text Classification
dataset:
type: cassieli226/cities-text-dataset
name: Cities Text Dataset
split: test
metrics:
- type: accuracy
value: 99.5
name: Test Accuracy
- type: f1
value: 99.5
name: Test F1 Score (Macro)
---
# DistilBERT Text Classification Model
This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for text classification tasks.
## Model Description
This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.
- **Model type:** Text Classification (Binary)
- **Language(s) (NLP):** English
- **Base model:** distilbert-base-uncased
- **Classes:** Pittsburgh, Shanghai
## Intended Uses & Limitations
### Intended Uses
- Binary text classification between Pittsburgh and Shanghai-related content
- City-based text categorization tasks
- Research and educational purposes in NLP and text classification
### Limitations
- Limited to English language text
- Performance may vary on out-of-domain data
- Maximum input length of 256 tokens due to truncation
## Training and Evaluation Data
### Training Data
- **Base dataset:** [cassieli226/cities-text-dataset](https://huggingface.co/datasets/cassieli226/cities-text-dataset)
- **Classes:** Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
- **Original dataset:** 100 samples (50 Pittsburgh, 50 Shanghai)
- **Data augmentation:** Applied to increase dataset size from 100 to 1000 samples
- **Train/Test Split:** 80/20 split (800 train, 200 test) with stratified sampling
- **External validation:** Original 100 samples used for additional validation
### Preprocessing
- Text tokenization using DistilBERT tokenizer
- Maximum sequence length: 256 tokens
- Truncation applied to longer sequences
## Training Procedure
### Training Hyperparameters
- **Learning rate:** 5e-5
- **Training batch size:** 16
- **Evaluation batch size:** 32
- **Number of epochs:** 4
- **Weight decay:** 0.01
- **Warmup ratio:** 0.1
- **LR scheduler:** Linear
- **Gradient accumulation steps:** 1
- **Mixed precision:** FP16 (if GPU available)
### Training Configuration
- **Optimizer:** AdamW (default)
- **Early stopping:** Enabled with patience of 2 epochs
- **Best model selection:** Based on F1 score (macro)
- **Evaluation strategy:** Every epoch
- **Save strategy:** Every epoch (best model only)
## Evaluation
### Metrics
The model was evaluated using:
- **Accuracy:** Overall classification accuracy
- **F1 Score (Macro):** Macro-averaged F1 score across all classes
- **Per-class accuracy:** Individual class performance metrics
### Results
- **Test Set Performance:**
- Accuracy: 99.5%
- F1 Score (Macro): 99.5%
- **External Validation:**
- Accuracy: 100.0%
- F1 Score (Macro): 100.0%
### Detailed Performance
- **Pittsburgh Class:** 99.01% accuracy (101 samples)
- **Shanghai Class:** 100.0% accuracy (99 samples)
- **Confusion Matrix:** Only 1 misclassification out of 200 test samples
## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Load model and tokenizer
model_name = "Anyuhhh/hw2-text-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example usage
text = "Your input text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1)
print(f"Predicted class: {predicted_class.item()}") |