jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection

Model Information

Base Model: HuggingFaceTB/SmolLM2-135M-Instruct
Fine-tuned Model Name: jatinmehra/smolLM-fine-tuned-for-plagiarism-detection
Language: English
Task: Text Classification (Binary)
Performance Metrics: Accuracy, F1 Score, Recall
License: MIT

Dataset

The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.

Train: 70%
Validation: 10%
Test: 20%

Training and Model Details

Architecture: The model was modified for sequence classification with two labels.
Optimizer: AdamW with a learning rate of 2e-5.
Loss Function: Cross-Entropy Loss.
Batch Size: 16
Epochs: 3
Padding: Custom padding token to align with SmolLM requirements.

Results and Evaluation

Validation sets

Accuracy: 96.05%
Test set:

Classification Report

Accuracy: 96.20%

Class	Precision	Recall	F1-Score	Support
0	0.96	0.97	0.96	36,586
1	0.97	0.96	0.96	36,888

Overall Metrics:

Accuracy: 0.96
Macro Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
Weighted Average:
- Precision: 0.96
- Recall: 0.96
- F1-Score: 0.96
Total Support: 73,474

Hardware:

GPU: 2 * Nvidia Tesla T4
Time: 9 Hours

Inference Script

To use the model for plagiarism detection, you can utilize the following imports and initialization:

import torch 
from transformers import GPT2Tokenizer, LlamaForSequenceClassification

# Load the tokenizer and model
model_path = "jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection"
tokenizer = GPT2Tokenizer.from_pretrained(model_path)
model = LlamaForSequenceClassification.from_pretrained(model_path)
model.eval()

# Set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Function to preprocess and tokenize the input text
def preprocess_text(text1, text2):
    inputs = tokenizer(
        text1, text2,
        add_special_tokens=True,
        max_length=128,
        padding='max_length',
        truncation=True,
        return_tensors="pt"
    )
    return inputs

# Dataset class
class PlagiarismDataset(Dataset):
    def __init__(self, text1, text2, tokenizer):
        self.text1 = text1
        self.text2 = text2
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.text1)

    def __getitem__(self, idx):
        inputs = preprocess_text(self.text1[idx], self.text2[idx])
        return {
            'input_ids': inputs['input_ids'].squeeze(0),
            'attention_mask': inputs['attention_mask'].squeeze(0)
        }

# Function to detect plagiarism using the model
def detect_plagiarism(text1, text2):
    dataset = PlagiarismDataset(text1, text2, tokenizer)
    data_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)

    predictions = []
    with torch.no_grad():
        for batch in data_loader:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)

            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1)

            predictions.append(preds.item())

    return predictions[0]

# Usage
text1 = input("Text from the first document:")
text2 = input("Text from the first document:")

Result = detect_plagiarism(text1, text2)

# Display the result
if result == 1:
    print("Plagiarism detected!")
else:
    print("No plagiarism detected.")

This script loads the fine-tuned model and tokenizer for detecting plagiarism between two text inputs.

License

This project is licensed under the MIT License, making it free for both personal and commercial use.

Connect with Me

I appreciate your interest!
GitHub | [email protected] | LinkedIn | Portfolio

jatinmehra
/

smolLM-fined-tuned-for-PLAGAIRISM_Detection

Model Information

Dataset

Training and Model Details

Results and Evaluation

Validation sets

Classification Report

Hardware:

Inference Script

License

Connect with Me

Model tree for jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection

Dataset used to train jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection

Space using jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection 1