File size: 4,180 Bytes
2942687
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- distilbert
- fine-tuned
- pytorch
datasets:
- cassieli226/cities-text-dataset
base_model: distilbert-base-uncased

model-index:
- name: hw2-text-distilbert
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: cassieli226/cities-text-dataset
      name: Cities Text Dataset
      split: test
    metrics:
      - type: accuracy
        value: 99.5
        name: Test Accuracy
      - type: f1
        value: 99.5
        name: Test F1 Score (Macro)
---

# DistilBERT Text Classification Model

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for text classification tasks.

## Model Description

This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.

- **Model type:** Text Classification (Binary)
- **Language(s) (NLP):** English
- **Base model:** distilbert-base-uncased
- **Classes:** Pittsburgh, Shanghai

## Intended Uses & Limitations

### Intended Uses
- Binary text classification between Pittsburgh and Shanghai-related content
- City-based text categorization tasks
- Research and educational purposes in NLP and text classification

### Limitations
- Limited to English language text
- Performance may vary on out-of-domain data
- Maximum input length of 256 tokens due to truncation

## Training and Evaluation Data

### Training Data
- **Base dataset:** [cassieli226/cities-text-dataset](https://huggingface.co/datasets/cassieli226/cities-text-dataset)
- **Classes:** Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
- **Original dataset:** 100 samples (50 Pittsburgh, 50 Shanghai)
- **Data augmentation:** Applied to increase dataset size from 100 to 1000 samples
- **Train/Test Split:** 80/20 split (800 train, 200 test) with stratified sampling
- **External validation:** Original 100 samples used for additional validation

### Preprocessing
- Text tokenization using DistilBERT tokenizer
- Maximum sequence length: 256 tokens
- Truncation applied to longer sequences

## Training Procedure

### Training Hyperparameters
- **Learning rate:** 5e-5
- **Training batch size:** 16
- **Evaluation batch size:** 32
- **Number of epochs:** 4
- **Weight decay:** 0.01
- **Warmup ratio:** 0.1
- **LR scheduler:** Linear
- **Gradient accumulation steps:** 1
- **Mixed precision:** FP16 (if GPU available)

### Training Configuration
- **Optimizer:** AdamW (default)
- **Early stopping:** Enabled with patience of 2 epochs
- **Best model selection:** Based on F1 score (macro)
- **Evaluation strategy:** Every epoch
- **Save strategy:** Every epoch (best model only)

## Evaluation

### Metrics
The model was evaluated using:
- **Accuracy:** Overall classification accuracy
- **F1 Score (Macro):** Macro-averaged F1 score across all classes
- **Per-class accuracy:** Individual class performance metrics

### Results
- **Test Set Performance:** 
  - Accuracy: 99.5%
  - F1 Score (Macro): 99.5%
- **External Validation:** 
  - Accuracy: 100.0%
  - F1 Score (Macro): 100.0%

### Detailed Performance
- **Pittsburgh Class:** 99.01% accuracy (101 samples)
- **Shanghai Class:** 100.0% accuracy (99 samples)
- **Confusion Matrix:** Only 1 misclassification out of 200 test samples

## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Anyuhhh/hw2-text-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
text = "Your input text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")