--- language: - en license: apache-2.0 library_name: transformers tags: - text-classification - distilbert - fine-tuned - pytorch datasets: - cassieli226/cities-text-dataset base_model: distilbert-base-uncased model-index: - name: hw2-text-distilbert results: - task: type: text-classification name: Text Classification dataset: type: cassieli226/cities-text-dataset name: Cities Text Dataset split: test metrics: - type: accuracy value: 99.5 name: Test Accuracy - type: f1 value: 99.5 name: Test F1 Score (Macro) --- # DistilBERT Text Classification Model This model is a fine-tuned version of [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) for text classification tasks. ## Model Description This model is a fine-tuned DistilBERT model for binary text classification, specifically designed to classify text as being related to either Pittsburgh or Shanghai cities. The model achieves excellent performance with 99.5% accuracy on the test set. - **Model type:** Text Classification (Binary) - **Language(s) (NLP):** English - **Base model:** distilbert-base-uncased - **Classes:** Pittsburgh, Shanghai ## Intended Uses & Limitations ### Intended Uses - Binary text classification between Pittsburgh and Shanghai-related content - City-based text categorization tasks - Research and educational purposes in NLP and text classification ### Limitations - Limited to English language text - Performance may vary on out-of-domain data - Maximum input length of 256 tokens due to truncation ## Training and Evaluation Data ### Training Data - **Base dataset:** [cassieli226/cities-text-dataset](https://huggingface.co/datasets/cassieli226/cities-text-dataset) - **Classes:** Pittsburgh (507 samples) and Shanghai (493 samples) in augmented dataset - **Original dataset:** 100 samples (50 Pittsburgh, 50 Shanghai) - **Data augmentation:** Applied to increase dataset size from 100 to 1000 samples - **Train/Test Split:** 80/20 split (800 train, 200 test) with stratified sampling - **External validation:** Original 100 samples used for additional validation ### Preprocessing - Text tokenization using DistilBERT tokenizer - Maximum sequence length: 256 tokens - Truncation applied to longer sequences ## Training Procedure ### Training Hyperparameters - **Learning rate:** 5e-5 - **Training batch size:** 16 - **Evaluation batch size:** 32 - **Number of epochs:** 4 - **Weight decay:** 0.01 - **Warmup ratio:** 0.1 - **LR scheduler:** Linear - **Gradient accumulation steps:** 1 - **Mixed precision:** FP16 (if GPU available) ### Training Configuration - **Optimizer:** AdamW (default) - **Early stopping:** Enabled with patience of 2 epochs - **Best model selection:** Based on F1 score (macro) - **Evaluation strategy:** Every epoch - **Save strategy:** Every epoch (best model only) ## Evaluation ### Metrics The model was evaluated using: - **Accuracy:** Overall classification accuracy - **F1 Score (Macro):** Macro-averaged F1 score across all classes - **Per-class accuracy:** Individual class performance metrics ### Results - **Test Set Performance:** - Accuracy: 99.5% - F1 Score (Macro): 99.5% - **External Validation:** - Accuracy: 100.0% - F1 Score (Macro): 100.0% ### Detailed Performance - **Pittsburgh Class:** 99.01% accuracy (101 samples) - **Shanghai Class:** 100.0% accuracy (99 samples) - **Confusion Matrix:** Only 1 misclassification out of 200 test samples ## Usage ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch # Load model and tokenizer model_name = "Anyuhhh/hw2-text-distilbert" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example usage text = "Your input text here" inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_class = torch.argmax(predictions, dim=-1) print(f"Predicted class: {predicted_class.item()}")