File size: 3,924 Bytes
eae5abd 7f471a8 eae5abd 7f471a8 eae5abd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
---
library_name: transformers
license: apache-2.0
base_model: distilbert-base-uncased
pipeline_tag: fill-mask
tags:
- masked-language-modeling
- fill-mask
- distilbert
- imdb
- domain-adaptation
- nlp
- transformers
model-index:
- name: distilbert-imdb_mask_model
results:
- task:
name: Masked Language Modeling
type: fill-mask
dataset:
name: IMDB Movie Reviews (unsupervised text)
type: imdb
split: train
metrics:
- name: Loss
type: loss
value: N/A
- name: Perplexity
type: perplexity
value: N/A
---
# Masked Language Modeling
## π Model Overview
This model is a fine-tuned version of **distilbert-base-uncased** on the **IMDb dataset** using the **Masked Language Modeling (MLM)** objective.
It is designed for **domain adaptation**, helping DistilBERT better understand the linguistic style of IMDb movie reviews.
---
## β¨ What this model does
- Learns to predict masked tokens in movie-review text (MLM / `fill-mask`).
- Helpful as a **domain-adapted backbone** for:
- Sentiment analysis on reviews
- Topic classification / intent
- Review-specific QA / RAG preprocessing
- Any task that benefits from in-domain representations
---
## π Quickstart
### Use with `pipeline` (Fill-Mask)
```python
from transformers import pipeline
pipe = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model")
text = "This movie was absolutely [MASK] and the performances were stunning."
pipe(text)
# [{'sequence': 'this movie was absolutely fantastic ...', 'score': ...}, ...]
for x in pipe(text):
print(x["sequence"])
output:
# this movie was absolutely fantastic and the performances were stunning.
# this movie was absolutely stunning and the performances were stunning.
# this movie was absolutely beautiful and the performances were stunning.
# this movie was absolutely brilliant and the performances were stunning.
# this movie was absolutely wonderful and the performances were stunning.
```
### Use with AutoModel (programmatic logits)
```python
import torch
from transformers import AutoModelForMaskedLM,AutoTokenizer
model_checkpoint = "azherali/distilbert-imdb_mask_model"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
text ="This movie was absolutely [MASK] and the performances were stunning."
inputs = tokenizer(text, return_tensors="pt")
token_logits = model(**inputs).logits
# Find the location of [MASK] and extract its logits
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
mask_token_logits = token_logits[0, mask_token_index, :]
# Pick the [MASK] candidates with the highest logits
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
for token in top_5_tokens:
print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")
```
## π Training Results
The model was trained for **5 epochs** on the IMDb dataset using the **Masked Language Modeling (MLM)** objective.
**Loss Progression:**
| Epoch | Training Loss | Validation Loss | Perplexity |
|-------|---------------|-----------------|-------------|
| 1 | 2.5249 | 2.3440 | 10.42 |
| 2 | 2.3985 | 2.2913 | 9.89 |
| 3 | 2.3441 | 2.2569 | 9.55 |
| 4 | 2.3079 | 2.2328 | 9.33 |
| 5 | 2.2869 | 2.2271 | 9.27 |
βοΈ **Final Training Loss:** 2.28
βοΈ **Final Validation Loss:** 2.22
βοΈ **Final Perplexity:** 9.27
---
## β‘ Training Configuration
- **Model:** distilbert-base-uncased
- **Dataset:** IMDb (unsupervised)
- **Epochs:** 5
- **Batch Size:** 32
- **Optimizer:** AdamW
- **Learning Rate Scheduler:** Linear warmup + decay
- **Total Steps:** 9,580
- **Total FLOPs:** 1.02e+16
---
|