Update README.md

7f471a8 verified 10 days ago

3.92 kB

	---
	library_name: transformers
	license: apache-2.0
	base_model: distilbert-base-uncased
	pipeline_tag: fill-mask
	tags:
	- masked-language-modeling
	- fill-mask
	- distilbert
	- imdb
	- domain-adaptation
	- nlp
	- transformers
	model-index:
	- name: distilbert-imdb_mask_model
	results:
	- task:
	name: Masked Language Modeling
	type: fill-mask
	dataset:
	name: IMDB Movie Reviews (unsupervised text)
	type: imdb
	split: train
	metrics:
	- name: Loss
	type: loss
	value: N/A
	- name: Perplexity
	type: perplexity
	value: N/A
	---

	# Masked Language Modeling

	## 📌 Model Overview
	This model is a fine-tuned version of distilbert-base-uncased on the IMDb dataset using the Masked Language Modeling (MLM) objective.
	It is designed for domain adaptation, helping DistilBERT better understand the linguistic style of IMDb movie reviews.

	---

	## ✨ What this model does

	- Learns to predict masked tokens in movie-review text (MLM / `fill-mask`).
	- Helpful as a domain-adapted backbone for:
	- Sentiment analysis on reviews
	- Topic classification / intent
	- Review-specific QA / RAG preprocessing
	- Any task that benefits from in-domain representations

	---

	## 🚀 Quickstart

	### Use with `pipeline` (Fill-Mask)

	```python
	from transformers import pipeline

	pipe = pipeline("fill-mask", model="azherali/distilbert-imdb_mask_model")

	text = "This movie was absolutely [MASK] and the performances were stunning."
	pipe(text)
	# [{'sequence': 'this movie was absolutely fantastic ...', 'score': ...}, ...]

	for x in pipe(text):
	print(x["sequence"])

	output:
	# this movie was absolutely fantastic and the performances were stunning.
	# this movie was absolutely stunning and the performances were stunning.
	# this movie was absolutely beautiful and the performances were stunning.
	# this movie was absolutely brilliant and the performances were stunning.
	# this movie was absolutely wonderful and the performances were stunning.

	```
	### Use with AutoModel (programmatic logits)


	```python
	import torch
	from transformers import AutoModelForMaskedLM,AutoTokenizer

	model_checkpoint = "azherali/distilbert-imdb_mask_model"

	model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
	tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

	text ="This movie was absolutely [MASK] and the performances were stunning."

	inputs = tokenizer(text, return_tensors="pt")
	token_logits = model(**inputs).logits
	# Find the location of [MASK] and extract its logits
	mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
	mask_token_logits = token_logits[0, mask_token_index, :]
	# Pick the [MASK] candidates with the highest logits
	top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

	for token in top_5_tokens:
	print(f"'>>> {text.replace(tokenizer.mask_token, tokenizer.decode([token]))}'")

	```




	## 📈 Training Results

	The model was trained for 5 epochs on the IMDb dataset using the Masked Language Modeling (MLM) objective.

	Loss Progression:
	\| Epoch \| Training Loss \| Validation Loss \| Perplexity \|
	\|-------\|---------------\|-----------------\|-------------\|
	\| 1 \| 2.5249 \| 2.3440 \| 10.42 \|
	\| 2 \| 2.3985 \| 2.2913 \| 9.89 \|
	\| 3 \| 2.3441 \| 2.2569 \| 9.55 \|
	\| 4 \| 2.3079 \| 2.2328 \| 9.33 \|
	\| 5 \| 2.2869 \| 2.2271 \| 9.27 \|

	✔️ Final Training Loss: 2.28
	✔️ Final Validation Loss: 2.22
	✔️ Final Perplexity: 9.27

	---

	## ⚡ Training Configuration

	- Model: distilbert-base-uncased
	- Dataset: IMDb (unsupervised)
	- Epochs: 5
	- Batch Size: 32
	- Optimizer: AdamW
	- Learning Rate Scheduler: Linear warmup + decay
	- Total Steps: 9,580
	- Total FLOPs: 1.02e+16

	---