image

Model Information

  • Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
  • Fine-tuned Model Name: jatinmehra/smolLM-fine-tuned-for-plagiarism-detection
  • Language: English
  • Task: Text Classification (Binary)
  • Performance Metrics: Accuracy, F1 Score, Recall
  • License: MIT

Dataset

The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.

  • Train: 70%
  • Validation: 10%
  • Test: 20%

Training and Model Details

  • Architecture: The model was modified for sequence classification with two labels.
  • Optimizer: AdamW with a learning rate of 2e-5.
  • Loss Function: Cross-Entropy Loss.
  • Batch Size: 16
  • Epochs: 3
  • Padding: Custom padding token to align with SmolLM requirements.

Results and Evaluation

Validation sets

  • Accuracy: 96.05%

  • Test set:

    Classification Report

Accuracy: 96.20%

Class Precision Recall F1-Score Support
0 0.96 0.97 0.96 36,586
1 0.97 0.96 0.96 36,888

Overall Metrics:

  • Accuracy: 0.96
  • Macro Average:
    • Precision: 0.96
    • Recall: 0.96
    • F1-Score: 0.96
  • Weighted Average:
    • Precision: 0.96
    • Recall: 0.96
    • F1-Score: 0.96
  • Total Support: 73,474

Hardware:

  • GPU: 2 * Nvidia Tesla T4
  • Time: 9 Hours

Inference Script

To use the model for plagiarism detection, you can utilize the following imports and initialization:

import torch 
from transformers import GPT2Tokenizer, LlamaForSequenceClassification

# Load the tokenizer and model
model_path = "jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = LlamaForSequenceClassification.from_pretrained(model_path)
model.eval()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Function to preprocess and tokenize the input text
def preprocess_text(text1, text2):
    inputs = tokenizer(
        text1, text2,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )
    return inputs

# Dataset class
class PlagiarismDataset(Dataset):
    def __init__(self, text1, text2, tokenizer):
        self.text1 = text1
        self.text2 = text2
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.text1)

    def __getitem__(self, idx):
        inputs = preprocess_text(self.text1[idx], self.text2[idx])
        return {
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0)
        }

# Function to detect plagiarism using the model
def detect_plagiarism(text1, text2):
    dataset = PlagiarismDataset(text1, text2, tokenizer)
    data_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)

    predictions = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1)

            predictions.append(preds.item())

    return predictions[0]

# Usage
text1 = input("Text from the first document:")
text2 = input("Text from the first document:")

Result = detect_plagiarism(text1, text2)

# Display the result
if result == 1:
    print("Plagiarism detected!")
else:
    print("No plagiarism detected.")

This script loads the fine-tuned model and tokenizer for detecting plagiarism between two text inputs.

License

This project is licensed under the MIT License, making it free for both personal and commercial use.

Connect with Me

I appreciate your interest!
GitHub | [email protected] | LinkedIn | Portfolio

Downloads last month
1,232
Safetensors
Model size
135M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection

Finetuned
(169)
this model

Dataset used to train jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection

Space using jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection 1