Model Card for Model ID

Model ID: sunnysingh1011/gibberish-detection

This model is based on LoRA (Low-Rank Adaptation) weights designed to detect gibberish text. The main objective of this project is to categorize user input as either gibberish or non-gibberish.

Version History

v1.0.0: Initial release.

Model Details

This model is based on LoRA (Low-Rank Adaptation) weights designed to detect gibberish text. The main objective of this project is to categorize user input as either gibberish or non-gibberish.

Model Description

This model leverages LoRA (Low-Rank Adaptation) weights specifically developed for the task of identifying gibberish text. The core objective of this project is to accurately classify user input into distinct categories, distinguishing between gibberish and meaningful, coherent text.

Classification Categories:

Clean sentence: Meaningful, well-formed, and grammatically correct sentences.
Example: "The quick brown fox jumps over the lazy dog."
Out of Dictionary Word: Input containing words that are not found in the standard dictionary.
Example: "There is a busafadkdb in the code."
Mild gibberish: Text that is somewhat incoherent but may still contain recognizable words or phrases.
Example: "there bug is in the code"
Word Salad: A jumble of words that lacks logical structure or coherent meaning.
Example: "Jumped quick dog over lazy the fox brown."
Number gibberish: Text primarily composed of random numbers or numerical sequences with no clear linguistic pattern.
Example: "hello theree, 12345 67890 are you"
Developed by: Sunny Singh
Model type: BERT
Language(s) (NLP): English
License: MIT
Finetuned from model [optional]: distilbert-base-uncased

Validation Metrics

Evaluation Loss: 0.24577048420906067
Evaluation Accuracy: 0.8985
Evaluation F1 Score: 0.8972105646238008
Epoch: 5.0

How to Get Started with the Model

from peft import PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import torch.nn.functional as F

label_map = {0: "clean sentence", 1: "out of dictionary words", 2: "word salad", 3: "number gibberish", 4: "mild gibberish"}

def intlabel_to_strlabel(label):
    return label_map[label]

def get_prediction(model, tokenzier, input, label_fn):
    infer_inputs = tokenzier(input, return_tensors="pt")

    infer_device = model.device
    infer_inputs = {key: value.to(infer_device) for key, value in infer_inputs.items()}

    with torch.no_grad():
        outputs = model(**infer_inputs)

    logits = outputs.logits

    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()

    probabilities = F.softmax(logits, dim=-1)
    predicted_index = torch.argmax(probabilities, dim=1).item()
    predicted_prob = probabilities[0][predicted_index].item()

    label = label_fn(predicted_class)

    output = {"label": label, "score": predicted_prob}

    return output

lora_weights = "sunnysingh1011/gibberish-detection"
tokenizer_path = "sunnysingh1011/gibberish-detection"
base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_map))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

inference_model = PeftModel.from_pretrained(base_model, lora_weights).to(device)
inference_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

input = "Jumped quick dog over lazy the fox brown."
get_prediction(inference_model, inference_tokenizer, input, intlabel_to_strlabel)

sunnysingh1011
/

gibberish-detection