durs-llm-web-scanner
Model Description
durs-llm-web-scanner
is a language model (LLM) that has been fine-tuned to classify various types of cybersecurity-related inputs. This model is trained to recognize and differentiate between:
- Injection Payloads: Such as XSS, SQLi, LFI, SSRF, etc.
- Contextual Data: Such as vulnerable parameter names (e.g.,
user_id
for IDOR) or error patterns (e.g., SQL error messages). - Scanner Logic: Textual descriptions of the workflow and decision-making processes of a security scanner.
- Crawler Logic: Descriptions of how to discover new endpoints, forms, and parameters.
The primary goal of this model is to act as the "brain" for an autonomous security scanning agent, enabling it to understand context and make strategic decisions.
Project Status: Beta This model is still in the early stages of development (beta). Its dataset will be continuously updated and enriched periodically to improve accuracy and detection coverage.
How to Use
This model is designed to be used with the transformers
library in Python.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
# Replace 'kangali/durs-llm-web-scanner' with your repo name if different
model_name = "kangali/durs-llm-web-scanner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# Example inputs
text_inputs = [
"<script>alert(1)</script>", # XSS Payload
"user_id", # IDOR Context
"A probe string is reflected inside an HTML tag...", # Scanner Logic
"This is a normal comment." # Benign
]
# Prediction
inputs = tokenizer(text_inputs, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_ids = torch.argmax(logits, dim=1)
# To get the string labels, you need the label_encoder.joblib from the repository
# or you can map them manually from the model's config.json
for i, text in enumerate(text_inputs):
predicted_id = predicted_class_ids[i].item()
label = model.config.id2label[predicted_id]
print(f"Input: '{text[:50]}...' -> Predicted Label: {label}")
Training Data
This model was trained on a custom-built master_training_dataset.csv
dataset, which contains over 2700 samples extracted and synthesized from the codebase of Dursgo, an open-source web security scanner.
The dataset includes three main categories of data:
- Injection Payloads: Concrete examples of attack payloads (XSS, SQLi, LFI, etc scanner in dursgo.).
- Contextual Definitions: Keywords, parameter names, and error patterns that provide context for attacks (e.g., IDOR parameter names, SQL error messages).
- Scanner & Crawler Logic: Textual descriptions of the workflows and decision rules used by the scanner and crawler (e.g., "If the 'url' parameter is found, test for SSRF").
Training Procedure
This model is a distilbert-base-uncased
that has been fine-tuned for 50 epochs using the Trainer
from the Hugging Face Transformers library. The complete workflow for creating the dataset and retraining this model is available in the project's GitHub repository: Tunning-AI (Repo Private - To Be Continue to Open).
- Downloads last month
- 22
Model tree for kangali/durs-llm-web-scanner
Base model
distilbert/distilbert-base-uncased