jatinmehra commited on
Commit
fb131a3
·
verified ·
1 Parent(s): 7f810dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -1
README.md CHANGED
@@ -71,7 +71,92 @@ The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled
71
  - F1-Score: 0.96
72
  - **Total Support**: 73,474
73
 
74
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  This project is licensed under the MIT License, making it free for both personal and commercial use.
76
 
77
  ## Connect with Me
 
71
  - F1-Score: 0.96
72
  - **Total Support**: 73,474
73
 
74
+ ## Hardware:
75
+ - GPU: 2 * Nvidia Tesla T4
76
+ - Time: 9 Hours
77
+
78
+ ## Inference Script
79
+
80
+ To use the model for plagiarism detection, you can utilize the following imports and initialization:
81
+
82
+ ```python
83
+ import torch
84
+ from transformers import GPT2Tokenizer, LlamaForSequenceClassification
85
+
86
+ # Load the tokenizer and model
87
+ model_path = "jatinmehra/smolLM-fined-tuned-for-PLAGAIRISM_Detection"
88
+ tokenizer = GPT2Tokenizer.from_pretrained(model_path)
89
+ model = LlamaForSequenceClassification.from_pretrained(model_path)
90
+ model.eval()
91
+
92
+ # Set device
93
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
94
+ model = model.to(device)
95
+
96
+ # Function to preprocess and tokenize the input text
97
+ def preprocess_text(text1, text2):
98
+ inputs = tokenizer(
99
+ text1, text2,
100
+ add_special_tokens=True,
101
+ max_length=128,
102
+ padding='max_length',
103
+ truncation=True,
104
+ return_tensors="pt"
105
+ )
106
+ return inputs
107
+
108
+ # Dataset class
109
+ class PlagiarismDataset(Dataset):
110
+ def __init__(self, text1, text2, tokenizer):
111
+ self.text1 = text1
112
+ self.text2 = text2
113
+ self.tokenizer = tokenizer
114
+
115
+ def __len__(self):
116
+ return len(self.text1)
117
+
118
+ def __getitem__(self, idx):
119
+ inputs = preprocess_text(self.text1[idx], self.text2[idx])
120
+ return {
121
+ 'input_ids': inputs['input_ids'].squeeze(0),
122
+ 'attention_mask': inputs['attention_mask'].squeeze(0)
123
+ }
124
+
125
+ # Function to detect plagiarism using the model
126
+ def detect_plagiarism(text1, text2):
127
+ dataset = PlagiarismDataset(text1, text2, tokenizer)
128
+ data_loader = torch.utils.data.DataLoader(dataset, batch_size=1, shuffle=False)
129
+
130
+ predictions = []
131
+ with torch.no_grad():
132
+ for batch in data_loader:
133
+ input_ids = batch['input_ids'].to(device)
134
+ attention_mask = batch['attention_mask'].to(device)
135
+
136
+ outputs = model(input_ids=input_ids, attention_mask=attention_mask)
137
+ preds = torch.argmax(outputs.logits, dim=1)
138
+
139
+ predictions.append(preds.item())
140
+
141
+ return predictions[0]
142
+
143
+ # Usage
144
+ text1 = input("Text from the first document:")
145
+ text2 = input("Text from the first document:")
146
+
147
+ Result = detect_plagiarism(text1, text2)
148
+
149
+ # Display the result
150
+ if result == 1:
151
+ print("Plagiarism detected!")
152
+ else:
153
+ print("No plagiarism detected.")
154
+
155
+ ```
156
+
157
+ This script loads the fine-tuned model and tokenizer for detecting plagiarism between two text inputs.
158
+
159
+ ## License
160
  This project is licensed under the MIT License, making it free for both personal and commercial use.
161
 
162
  ## Connect with Me