|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
base_model: distilbert-base-uncased |
|
pipeline_tag: fill-mask |
|
tags: |
|
- masked-language-modeling |
|
- fill-mask |
|
- distilbert |
|
- imdb |
|
- domain-adaptation |
|
- nlp |
|
- transformers |
|
model-index: |
|
- name: distilbert-imdb_mask_model |
|
results: |
|
- task: |
|
name: Masked Language Modeling |
|
type: fill-mask |
|
dataset: |
|
name: IMDB Movie Reviews (unsupervised text) |
|
type: imdb |
|
split: train |
|
metrics: |
|
- name: Loss |
|
type: loss |
|
value: N/A |
|
- name: Perplexity |
|
type: perplexity |
|
value: N/A |
|
--- |
|
|
|
# Masked Language Modeling |
|
|
|
## π Model Overview |
|
This model is a fine-tuned version of **distilbert-base-uncased** on the **IMDb dataset** using the **Masked Language Modeling (MLM)** objective. |
|
It is designed for **domain adaptation**, helping DistilBERT better understand the linguistic style of IMDb movie reviews. |
|
|
|
--- |
|
|
|
## β¨ What this model does |
|
|
|
- Learns to predict masked tokens in movie-review text (MLM / `fill-mask`). |
|
- Helpful as a **domain-adapted backbone** for: |
|
- Sentiment analysis on reviews |
|
- Topic classification / intent |
|
- Review-specific QA / RAG preprocessing |
|
- Any task that benefits from in-domain representations |
|
|
|
--- |
|
|
|
## π Quickstart |
|
|
|
### Use with `pipeline` (Fill-Mask) |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
pipe = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model") |
|
|
|
text = "This movie was absolutely [MASK] and the performances were stunning." |
|
pipe(text) |
|
# [{'sequence': 'this movie was absolutely fantastic ...', 'score': ...}, ...] |
|
|
|
for x in pipe(text): |
|
print(x["sequence"]) |
|
|
|
output: |
|
# this movie was absolutely fantastic and the performances were stunning. |
|
# this movie was absolutely stunning and the performances were stunning. |
|
# this movie was absolutely beautiful and the performances were stunning. |
|
# this movie was absolutely brilliant and the performances were stunning. |
|
# this movie was absolutely wonderful and the performances were stunning. |
|
|
|
``` |
|
### Use with AutoModel (programmatic logits) |
|
|
|
|
|
```python |
|
import torch |
|
from transformers import AutoModelForMaskedLM,AutoTokenizer |
|
|
|
model_checkpoint = "azherali/distilbert-imdb_mask_model" |
|
|
|
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) |
|
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) |
|
|
|
text ="This movie was absolutely [MASK] and the performances were stunning." |
|
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
token_logits = model(**inputs).logits |
|
# Find the location of [MASK] and extract its logits |
|
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] |
|
mask_token_logits = token_logits[0, mask_token_index, :] |
|
# Pick the [MASK] candidates with the highest logits |
|
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() |
|
|
|
for token in top_5_tokens: |
|
print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'") |
|
|
|
``` |
|
|
|
|
|
|
|
|
|
## π Training Results |
|
|
|
The model was trained for **5 epochs** on the IMDb dataset using the **Masked Language Modeling (MLM)** objective. |
|
|
|
**Loss Progression:** |
|
| Epoch | Training Loss | Validation Loss | Perplexity | |
|
|-------|---------------|-----------------|-------------| |
|
| 1 | 2.5249 | 2.3440 | 10.42 | |
|
| 2 | 2.3985 | 2.2913 | 9.89 | |
|
| 3 | 2.3441 | 2.2569 | 9.55 | |
|
| 4 | 2.3079 | 2.2328 | 9.33 | |
|
| 5 | 2.2869 | 2.2271 | 9.27 | |
|
|
|
βοΈ **Final Training Loss:** 2.28 |
|
βοΈ **Final Validation Loss:** 2.22 |
|
βοΈ **Final Perplexity:** 9.27 |
|
|
|
--- |
|
|
|
## β‘ Training Configuration |
|
|
|
- **Model:** distilbert-base-uncased |
|
- **Dataset:** IMDb (unsupervised) |
|
- **Epochs:** 5 |
|
- **Batch Size:** 32 |
|
- **Optimizer:** AdamW |
|
- **Learning Rate Scheduler:** Linear warmup + decay |
|
- **Total Steps:** 9,580 |
|
- **Total FLOPs:** 1.02e+16 |
|
|
|
--- |
|
|