File size: 4,180 Bytes
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- distilbert
- fine-tuned
- pytorch
datasets:
- cassieli226/cities-text-dataset
base_model: distilbert-base-uncased

model-index:
- name: hw2-text-distilbert
  results:
  - task:
      type: text-classification
      name: Text Classification
    dataset:
      type: cassieli226/cities-text-dataset
      name: Cities Text Dataset
      split: test
    metrics:
      - type: accuracy
        value: 99.5
        name: Test Accuracy
      - type: f1
        value: 99.5
        name: Test F1 Score (Macro)
---

# DistilBERT Text Classification Model

This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for text classification tasks.

## Model Description

This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set.

- **Model type:** Text Classification (Binary)
- **Language(s) (NLP):** English
- **Base model:** distilbert-base-uncased
- **Classes:** Pittsburgh, Shanghai

## Intended Uses & Limitations

### Intended Uses
- Binary text classification between Pittsburgh and Shanghai-related content
- City-based text categorization tasks
- Research and educational purposes in NLP and text classification

### Limitations
- Limited to English language text
- Performance may vary on out-of-domain data
- Maximum input length of 256 tokens due to truncation

## Training and Evaluation Data

### Training Data
- **Base dataset:** [cassieli226/cities-text-dataset](https://huggingface.co/datasets/cassieli226/cities-text-dataset)
- **Classes:** Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset
- **Original dataset:** 100 samples (50 Pittsburgh, 50 Shanghai)
- **Data augmentation:** Applied to increase dataset size from 100 to 1000 samples
- **Train/Test Split:** 80/20 split (800 train, 200 test) with stratified sampling
- **External validation:** Original 100 samples used for additional validation

### Preprocessing
- Text tokenization using DistilBERT tokenizer
- Maximum sequence length: 256 tokens
- Truncation applied to longer sequences

## Training Procedure

### Training Hyperparameters
- **Learning rate:** 5e-5
- **Training batch size:** 16
- **Evaluation batch size:** 32
- **Number of epochs:** 4
- **Weight decay:** 0.01
- **Warmup ratio:** 0.1
- **LR scheduler:** Linear
- **Gradient accumulation steps:** 1
- **Mixed precision:** FP16 (if GPU available)

### Training Configuration
- **Optimizer:** AdamW (default)
- **Early stopping:** Enabled with patience of 2 epochs
- **Best model selection:** Based on F1 score (macro)
- **Evaluation strategy:** Every epoch
- **Save strategy:** Every epoch (best model only)

## Evaluation

### Metrics
The model was evaluated using:
- **Accuracy:** Overall classification accuracy
- **F1 Score (Macro):** Macro-averaged F1 score across all classes
- **Per-class accuracy:** Individual class performance metrics

### Results
- **Test Set Performance:** 
  - Accuracy: 99.5%
  - F1 Score (Macro): 99.5%
- **External Validation:** 
  - Accuracy: 100.0%
  - F1 Score (Macro): 100.0%

### Detailed Performance
- **Pittsburgh Class:** 99.01% accuracy (101 samples)
- **Shanghai Class:** 100.0% accuracy (99 samples)
- **Confusion Matrix:** Only 1 misclassification out of 200 test samples

## Usage
```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Load model and tokenizer
model_name = "Anyuhhh/hw2-text-distilbert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
text = "Your input text here"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)

with torch.no_grad():
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")