durs-llm-web-scanner

durs-llm-web-scanner logo

Model Description

durs-llm-web-scanner is a language model (LLM) that has been fine-tuned to classify various types of cybersecurity-related inputs. This model is trained to recognize and differentiate between:

  • Injection Payloads: Such as XSS, SQLi, LFI, SSRF, etc.
  • Contextual Data: Such as vulnerable parameter names (e.g., user_id for IDOR) or error patterns (e.g., SQL error messages).
  • Scanner Logic: Textual descriptions of the workflow and decision-making processes of a security scanner.
  • Crawler Logic: Descriptions of how to discover new endpoints, forms, and parameters.

The primary goal of this model is to act as the "brain" for an autonomous security scanning agent, enabling it to understand context and make strategic decisions.

Project Status: Beta This model is still in the early stages of development (beta). Its dataset will be continuously updated and enriched periodically to improve accuracy and detection coverage.

How to Use

This model is designed to be used with the transformers library in Python.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Replace 'kangali/durs-llm-web-scanner' with your repo name if different
model_name = "kangali/durs-llm-web-scanner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# Example inputs
text_inputs = [
    "<script>alert(1)</script>",  # XSS Payload
    "user_id",                    # IDOR Context
    "A probe string is reflected inside an HTML tag...", # Scanner Logic
    "This is a normal comment."   # Benign
]

# Prediction
inputs = tokenizer(text_inputs, return_tensors="pt", padding=True, truncation=True)
with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_ids = torch.argmax(logits, dim=1)

# To get the string labels, you need the label_encoder.joblib from the repository
# or you can map them manually from the model's config.json
for i, text in enumerate(text_inputs):
    predicted_id = predicted_class_ids[i].item()
    label = model.config.id2label[predicted_id]
    print(f"Input: '{text[:50]}...' -> Predicted Label: {label}")

Training Data

This model was trained on a custom-built master_training_dataset.csv dataset, which contains over 2700 samples extracted and synthesized from the codebase of Dursgo, an open-source web security scanner.

The dataset includes three main categories of data:

  1. Injection Payloads: Concrete examples of attack payloads (XSS, SQLi, LFI, etc scanner in dursgo.).
  2. Contextual Definitions: Keywords, parameter names, and error patterns that provide context for attacks (e.g., IDOR parameter names, SQL error messages).
  3. Scanner & Crawler Logic: Textual descriptions of the workflows and decision rules used by the scanner and crawler (e.g., "If the 'url' parameter is found, test for SSRF").

Training Procedure

This model is a distilbert-base-uncased that has been fine-tuned for 50 epochs using the Trainer from the Hugging Face Transformers library. The complete workflow for creating the dataset and retraining this model is available in the project's GitHub repository: Tunning-AI (Repo Private - To Be Continue to Open).

Downloads last month
22
Safetensors
Model size
67M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for kangali/durs-llm-web-scanner

Finetuned
(9548)
this model