--- library_name: transformers license: apache-2.0 base_model: distilbert-base-uncased pipeline_tag: fill-mask tags: - masked-language-modeling - fill-mask - distilbert - imdb - domain-adaptation - nlp - transformers model-index: - name: distilbert-imdb_mask_model results: - task: name: Masked Language Modeling type: fill-mask dataset: name: IMDB Movie Reviews (unsupervised text) type: imdb split: train metrics: - name: Loss type: loss value: N/A - name: Perplexity type: perplexity value: N/A --- # Masked Language Modeling ## 📌 Model Overview This model is a fine-tuned version of **distilbert-base-uncased** on the **IMDb dataset** using the **Masked Language Modeling (MLM)** objective. It is designed for **domain adaptation**, helping DistilBERT better understand the linguistic style of IMDb movie reviews. --- ## ✨ What this model does - Learns to predict masked tokens in movie-review text (MLM / `fill-mask`). - Helpful as a **domain-adapted backbone** for: - Sentiment analysis on reviews - Topic classification / intent - Review-specific QA / RAG preprocessing - Any task that benefits from in-domain representations --- ## 🚀 Quickstart ### Use with `pipeline` (Fill-Mask) ```python from transformers import pipeline pipe = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model") text = "This movie was absolutely [MASK] and the performances were stunning." pipe(text) # [{'sequence': 'this movie was absolutely fantastic ...', 'score': ...}, ...] for x in pipe(text): print(x["sequence"]) output: # this movie was absolutely fantastic and the performances were stunning. # this movie was absolutely stunning and the performances were stunning. # this movie was absolutely beautiful and the performances were stunning. # this movie was absolutely brilliant and the performances were stunning. # this movie was absolutely wonderful and the performances were stunning. ``` ### Use with AutoModel (programmatic logits) ```python import torch from transformers import AutoModelForMaskedLM,AutoTokenizer model_checkpoint = "azherali/distilbert-imdb_mask_model" model = AutoModelForMaskedLM.from_pretrained(model_checkpoint) tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) text ="This movie was absolutely [MASK] and the performances were stunning." inputs = tokenizer(text, return_tensors="pt") token_logits = model(**inputs).logits # Find the location of [MASK] and extract its logits mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1] mask_token_logits = token_logits[0, mask_token_index, :] # Pick the [MASK] candidates with the highest logits top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist() for token in top_5_tokens: print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'") ``` ## 📈 Training Results The model was trained for **5 epochs** on the IMDb dataset using the **Masked Language Modeling (MLM)** objective. **Loss Progression:** | Epoch | Training Loss | Validation Loss | Perplexity | |-------|---------------|-----------------|-------------| | 1 | 2.5249 | 2.3440 | 10.42 | | 2 | 2.3985 | 2.2913 | 9.89 | | 3 | 2.3441 | 2.2569 | 9.55 | | 4 | 2.3079 | 2.2328 | 9.33 | | 5 | 2.2869 | 2.2271 | 9.27 | ✔️ **Final Training Loss:** 2.28 ✔️ **Final Validation Loss:** 2.22 ✔️ **Final Perplexity:** 9.27 --- ## ⚡ Training Configuration - **Model:** distilbert-base-uncased - **Dataset:** IMDb (unsupervised) - **Epochs:** 5 - **Batch Size:** 32 - **Optimizer:** AdamW - **Learning Rate Scheduler:** Linear warmup + decay - **Total Steps:** 9,580 - **Total FLOPs:** 1.02e+16 ---