Model Information
- Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
- Fine-tuned Model Name: jatinmehra/smolLM-fine-tuned-for-plagiarism-detection
- Language: English
- Task: Text Classification (Binary)
- Performance Metrics: Accuracy, F1 Score, Recall
- License: MIT
Dataset
The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.
- Train: 70%
- Validation: 10%
- Test: 20%
Training and Model Details
- Architecture: The model was modified for sequence classification with two labels.
- Optimizer: AdamW with a learning rate of 2e-5.
- Loss Function: Cross-Entropy Loss.
- Batch Size: 16
- Epochs: 3
- Padding: Custom padding token to align with SmolLM requirements.
Results and Evaluation
Validation sets
Accuracy: 96.20%
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
0 | 0.96 | 0.97 | 0.96 | 36,586 |
1 | 0.97 | 0.96 | 0.96 | 36,888 |
Overall Metrics:
- Accuracy: 0.96
- Macro Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
- Weighted Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
- Total Support: 73,474
Hardware:
- GPU: 2 * Nvidia Tesla T4
- Time: 9 Hours
Inference Script
To use the model for plagiarism detection, you can utilize the following imports and initialization:
import torch
from transformers import GPT2Tokenizer, LlamaForSequenceClassification
# Load the tokenizer and model
model_path = "jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = LlamaForSequenceClassification.from_pretrained(model_path)
model.eval()
# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Function to preprocess and tokenize the input text
def preprocess_text(text1, text2):
inputs = tokenizer(
text1, text2,
add_special_tokens=True,
max_length=128,
padding='max_length',
truncation=True,
return_tensors="pt"
)
return inputs
# Dataset class
class PlagiarismDataset(Dataset):
def __init__(self, text1, text2, tokenizer):
self.text1 = text1
self.text2 = text2
self.tokenizer = tokenizer
def __len__(self):
return len(self.text1)
def __getitem__(self, idx):
inputs = preprocess_text(self.text1[idx], self.text2[idx])
return {
'input_ids': inputs['input_ids'].squeeze(0),
'attention_mask': inputs['attention_mask'].squeeze(0)
}
# Function to detect plagiarism using the model
def detect_plagiarism(text1, text2):
dataset = PlagiarismDataset(text1, text2, tokenizer)
data_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)
predictions = []
with torch.no_grad():
for batch in data_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
preds = torch.argmax(outputs.logits, dim=1)
predictions.append(preds.item())
return predictions[0]
# Usage
text1 = input("Text from the first document:")
text2 = input("Text from the first document:")
Result = detect_plagiarism(text1, text2)
# Display the result
if result == 1:
print("Plagiarism detected!")
else:
print("No plagiarism detected.")
This script loads the fine-tuned model and tokenizer for detecting plagiarism between two text inputs.
License
This project is licensed under the MIT License, making it free for both personal and commercial use.
Connect with Me
I appreciate your interest!
GitHub | [email protected] | LinkedIn | Portfolio
- Downloads last month
- 1,232
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection
Base model
HuggingFaceTB/SmolLM2-135M
Quantized
HuggingFaceTB/SmolLM2-135M-Instruct