Kunalatmosoft commited on
Commit
3ca9351
·
verified ·
1 Parent(s): 700fbd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+ # IMDb Sentiment Analysis Model
5
+
6
+ ## Model Overview
7
+ This model is a fine-tuned **DistilBERT** (`distilbert-base-uncased`) for **sentiment analysis** on the IMDb dataset. It classifies movie reviews as **positive (1) or negative (0)**.
8
+
9
+ ## Dataset
10
+ - **Dataset Used**: IMDb Movie Reviews
11
+ - **Source**: Hugging Face's `datasets` library (`imdb`)
12
+ - **Training Samples**: 50 (for fast training)
13
+ - **Test Samples**: 20
14
+
15
+ ## Training Details
16
+ - **Model Architecture**: DistilBERT for Sequence Classification
17
+ - **Pretrained Model**: `distilbert-base-uncased`
18
+ - **Training Time**: ~1 minute
19
+ - **Number of Epochs**: 1
20
+ - **Batch Size**: 1 (for speed)
21
+ - **Evaluation Strategy**: Per epoch
22
+
23
+ ## Training Script
24
+ ```python
25
+ from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer, AutoTokenizer
26
+ from datasets import load_dataset
27
+
28
+ # Load IMDb dataset
29
+ dataset = load_dataset("imdb")
30
+
31
+ # Load tokenizer and model
32
+ model_name = "distilbert-base-uncased"
33
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
34
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
35
+
36
+ # Tokenize dataset
37
+ def tokenize_function(examples):
38
+ return tokenizer(examples["text"], padding="max_length", truncation=True)
39
+
40
+ tokenized_datasets = dataset.map(tokenize_function, batched=True)
41
+
42
+ # Reduce dataset size for fast training
43
+ train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(50))
44
+ test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(20))
45
+
46
+ # Training arguments
47
+ training_args = TrainingArguments(
48
+ output_dir="./results",
49
+ evaluation_strategy="epoch",
50
+ per_device_train_batch_size=1,
51
+ per_device_eval_batch_size=1,
52
+ num_train_epochs=1,
53
+ save_strategy="epoch",
54
+ report_to="none"
55
+ )
56
+
57
+ # Trainer setup
58
+ trainer = Trainer(
59
+ model=model,
60
+ args=training_args,
61
+ train_dataset=train_dataset,
62
+ eval_dataset=test_dataset
63
+ )
64
+
65
+ # Train the model
66
+ trainer.train()
67
+
68
+ # Save trained model
69
+ model.save_pretrained("my_model")
70
+ tokenizer.save_pretrained("my_model")
71
+ ```
72
+
73
+ ## How to Use the Model
74
+ You can load the trained model and use it for sentiment analysis as follows:
75
+
76
+ ```python
77
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
78
+ import torch
79
+
80
+ # Load the trained model
81
+ tokenizer = AutoTokenizer.from_pretrained("my_model")
82
+ model = AutoModelForSequenceClassification.from_pretrained("my_model")
83
+
84
+ # Function to predict sentiment
85
+ def predict_sentiment(text):
86
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
87
+ outputs = model(**inputs)
88
+ prediction = torch.argmax(outputs.logits, dim=1).item()
89
+ return "Positive" if prediction == 1 else "Negative"
90
+
91
+ # Example usage
92
+ print(predict_sentiment("This movie was amazing!")) # Expected: Positive
93
+ print(predict_sentiment("I didn't like this movie.")) # Expected: Negative
94
+ ```
95
+
96
+ ## Deployment
97
+ The trained model can be deployed on Hugging Face for inference:
98
+
99
+ ```bash
100
+ huggingface-cli login
101
+ transformers-cli upload "my_model" --organization your-hf-username
102
+ ```
103
+
104
+ ## License
105
+ MIT License
106
+