azherali commited on
Commit
eae5abd
·
verified ·
1 Parent(s): 56f1a4a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -0
README.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: apache-2.0
4
+ base_model: distilbert-base-uncased
5
+ pipeline_tag: fill-mask
6
+ tags:
7
+ - masked-language-modeling
8
+ - fill-mask
9
+ - distilbert
10
+ - imdb
11
+ - domain-adaptation
12
+ - nlp
13
+ - transformers
14
+ model-index:
15
+ - name: distilbert-imdb_mask_model
16
+ results:
17
+ - task:
18
+ name: Masked Language Modeling
19
+ type: fill-mask
20
+ dataset:
21
+ name: IMDB Movie Reviews (unsupervised text)
22
+ type: imdb
23
+ split: train
24
+ metrics:
25
+ - name: Loss
26
+ type: loss
27
+ value: N/A
28
+ - name: Perplexity
29
+ type: perplexity
30
+ value: N/A
31
+ ---
32
+
33
+ # Masked Language Modeling
34
+
35
+ ## 📌 Model Overview
36
+ This model is a fine-tuned version of **distilbert-base-uncased** on the **IMDb dataset** using the **Masked Language Modeling (MLM)** objective.
37
+ It is designed for **domain adaptation**, helping DistilBERT better understand the linguistic style of IMDb movie reviews.
38
+
39
+ ---
40
+
41
+ ## ✨ What this model does
42
+
43
+ - Learns to predict masked tokens in movie-review text (MLM / `fill-mask`).
44
+ - Helpful as a **domain-adapted backbone** for:
45
+ - Sentiment analysis on reviews
46
+ - Topic classification / intent
47
+ - Review-specific QA / RAG preprocessing
48
+ - Any task that benefits from in-domain representations
49
+
50
+ ---
51
+
52
+ ## 🚀 Quickstart
53
+
54
+ ### Use with `pipeline` (Fill-Mask)
55
+
56
+ ```python
57
+ from transformers import pipeline
58
+
59
+ nlp = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model")
60
+
61
+ nlp("This movie was absolutely [MASK] and the performances were stunning.")
62
+ # [{'sequence': 'this movie was absolutely fantastic ...', 'score': ...}, ...]
63
+
64
+ for x in pipe(text):
65
+ print(x["sequence"])
66
+
67
+ output:
68
+ # this movie was absolutely fantastic and the performances were stunning.
69
+ # this movie was absolutely stunning and the performances were stunning.
70
+ # this movie was absolutely beautiful and the performances were stunning.
71
+ # this movie was absolutely brilliant and the performances were stunning.
72
+ # this movie was absolutely wonderful and the performances were stunning.
73
+
74
+ ```
75
+ ### Use with AutoModel (programmatic logits)
76
+
77
+
78
+ ```python
79
+ import torch
80
+ from transformers import AutoModelForMaskedLM,AutoTokenizer
81
+
82
+ model_checkpoint = "azherali/distilbert-imdb_mask_model"
83
+
84
+ model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
85
+ tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
86
+
87
+ text ="This movie was absolutely [MASK] and the performances were stunning."
88
+
89
+ inputs = tokenizer(text, return_tensors="pt")
90
+ token_logits = model(**inputs).logits
91
+ # Find the location of [MASK] and extract its logits
92
+ mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
93
+ mask_token_logits = token_logits[0, mask_token_index, :]
94
+ # Pick the [MASK] candidates with the highest logits
95
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
96
+
97
+ for token in top_5_tokens:
98
+ print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")
99
+
100
+ ```
101
+
102
+
103
+
104
+
105
+ ## 📈 Training Results
106
+
107
+ The model was trained for **5 epochs** on the IMDb dataset using the **Masked Language Modeling (MLM)** objective.
108
+
109
+ **Loss Progression:**
110
+ | Epoch | Training Loss | Validation Loss | Perplexity |
111
+ |-------|---------------|-----------------|-------------|
112
+ | 1 | 2.5249 | 2.3440 | 10.42 |
113
+ | 2 | 2.3985 | 2.2913 | 9.89 |
114
+ | 3 | 2.3441 | 2.2569 | 9.55 |
115
+ | 4 | 2.3079 | 2.2328 | 9.33 |
116
+ | 5 | 2.2869 | 2.2271 | 9.27 |
117
+
118
+ ✔️ **Final Training Loss:** 2.28
119
+ ✔️ **Final Validation Loss:** 2.22
120
+ ✔️ **Final Perplexity:** 9.27
121
+
122
+ ---
123
+
124
+ ## ⚡ Training Configuration
125
+
126
+ - **Model:** distilbert-base-uncased
127
+ - **Dataset:** IMDb (unsupervised)
128
+ - **Epochs:** 5
129
+ - **Batch Size:** 32
130
+ - **Optimizer:** AdamW
131
+ - **Learning Rate Scheduler:** Linear warmup + decay
132
+ - **Total Steps:** 9,580
133
+ - **Total FLOPs:** 1.02e+16
134
+
135
+ ---