metadata

library_name: transformers
tags:
  - text-classification
  - sentiment-analysis
  - distilbert
  - sequence-classification
  - huggingface
datasets:
  - imdb

Model Card for Fine-tuned DistilBERT on IMDB for Sentiment Analysis

Model Details

Model Description

This model is a fine-tuned version of distilbert-base-uncased for Sentiment Analysis using the IMDB dataset. The model classifies movie reviews as either positive or negative sentiment.

Developed by: [shogun-the-great]
Model type: Sequence Classification (Sentiment Analysis)
Language(s): English
License: Apache-2.0 (or specify your license)
Finetuned from model: distilbert-base-uncased

Model Sources

Dataset: imdb

Uses

Direct Use

This model can be directly used for sentiment analysis tasks on movie reviews and similar text content. Typical use cases include:

Review sentiment classification
Customer feedback analysis
Social media sentiment monitoring
Product review analysis

Downstream Use

This model can be further fine-tuned for specific tasks requiring sentiment analysis in specific domains like product reviews, social media content, etc.

Out-of-Scope Use

This model might not perform well for:

Non-English text
Text domains very different from movie reviews
Fine-grained sentiment analysis (beyond binary positive/negative classification)

Bias, Risks, and Limitations

Bias

The model's predictions are influenced by the IMDB dataset used during fine-tuning. If the dataset contains biases related to certain movie genres, directors, or actors, they may be reflected in the predictions.

Risks

False positives/negatives: Incorrectly classified sentiment, especially for reviews with complex or nuanced opinions
Limited generalization to non-entertainment domains
Potential reinforcement of existing biases in movie review data

Recommendations

Regularly update the model with diverse data for better generalization
Review and monitor predictions to ensure accuracy across different types of content
Consider using this model as part of a larger system with human oversight for critical applications

Training Details

Training Data

The model was trained on the IMDB dataset, which contains 25,000 movie reviews for training and 25,000 for testing, with balanced positive and negative labels.

Training Procedure

Preprocessing: Text was tokenized using the DistilBERT tokenizer with maximum sequence length of 512 tokens
Training Hyperparameters:
- Learning rate: 2e-5
- Batch size: 16
- Number of epochs: 3
- Weight decay: 0.01
- Optimizer: AdamW
Evaluation Results: Accuracy of approximately 92.5% on the test set

How to Get Started with the Model

You can load the fine-tuned model directly from the Hugging Face Hub:

from transformers import pipeline

# Load model directly
classifier = pipeline("sentiment-analysis", model="shogun-the-great/distilbert-imdb-finetuned")

# Example usage
texts = [
    "This movie was absolutely fantastic! I loved every minute of it.",
    "What a waste of time. The plot made no sense and the acting was terrible."
]

results = classifier(texts)
for text, result in zip(texts, results):
    sentiment = "positive" if result["label"] == "LABEL_1" else "negative"
    print(f"Text: {text}")
    print(f"Sentiment: {sentiment} (Score: {result['score']:.4f})")
    print()

Alternatively, you can load the model and tokenizer separately:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "shogun-the-great/distilbert-imdb-finetuned"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example usage
text = "This movie was absolutely fantastic! I loved every minute of it."
inputs = tokenizer(text, return_tensors="pt", truncation=True)
outputs = model(**inputs)
prediction = torch.nn.functional.softmax(outputs.logits, dim=-1)
sentiment = "positive" if prediction[0][1] > prediction[0][0] else "negative"
score = prediction[0][1].item() if sentiment == "positive" else prediction[0][0].item()
print(f"Sentiment: {sentiment} (Score: {score:.4f})")

Model Architecture

DistilBERT with a sequence classification head:

6 layers
768 hidden dimension
12 attention heads
66M parameters total