Model Card for Model ID

Model ID: sunnysingh1011/gibberish-detection

This model is based on LoRA (Low-Rank Adaptation) weights designed to detect gibberish text. The main objective of this project is to categorize user input as either gibberish or non-gibberish.

Version History

  • v1.0.0: Initial release.

Model Details

This model is based on LoRA (Low-Rank Adaptation) weights designed to detect gibberish text. The main objective of this project is to categorize user input as either gibberish or non-gibberish.

Model Description

This model leverages LoRA (Low-Rank Adaptation) weights specifically developed for the task of identifying gibberish text. The core objective of this project is to accurately classify user input into distinct categories, distinguishing between gibberish and meaningful, coherent text.

Classification Categories:

  • Clean sentence: Meaningful, well-formed, and grammatically correct sentences.
    Example: "The quick brown fox jumps over the lazy dog."

  • Out of Dictionary Word: Input containing words that are not found in the standard dictionary.
    Example: "There is a busafadkdb in the code."

  • Mild gibberish: Text that is somewhat incoherent but may still contain recognizable words or phrases.
    Example: "there bug is in the code"

  • Word Salad: A jumble of words that lacks logical structure or coherent meaning.
    Example: "Jumped quick dog over lazy the fox brown."

  • Number gibberish: Text primarily composed of random numbers or numerical sequences with no clear linguistic pattern.
    Example: "hello theree, 12345 67890 are you"

  • Developed by: Sunny Singh

  • Model type: BERT

  • Language(s) (NLP): English

  • License: MIT

  • Finetuned from model [optional]: distilbert-base-uncased

Validation Metrics

  • Evaluation Loss: 0.24577048420906067
  • Evaluation Accuracy: 0.8985
  • Evaluation F1 Score: 0.8972105646238008
  • Epoch: 5.0

How to Get Started with the Model

from peft import PeftModel
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
import torch.nn.functional as F

label_map = {0: "clean sentence", 1: "out of dictionary words", 2: "word salad", 3: "number gibberish", 4: "mild gibberish"}

def intlabel_to_strlabel(label):
    return label_map[label]

def get_prediction(model, tokenzier, input, label_fn):
    infer_inputs = tokenzier(input, return_tensors="pt")

    infer_device = model.device
    infer_inputs = {key: value.to(infer_device) for key, value in infer_inputs.items()}

    with torch.no_grad():
        outputs = model(**infer_inputs)

    logits = outputs.logits

    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    predicted_class = torch.argmax(probabilities, dim=-1).item()

    probabilities = F.softmax(logits, dim=-1)
    predicted_index = torch.argmax(probabilities, dim=1).item()
    predicted_prob = probabilities[0][predicted_index].item()

    label = label_fn(predicted_class)

    output = {"label": label, "score": predicted_prob}

    return output

lora_weights = "sunnysingh1011/gibberish-detection"
tokenizer_path = "sunnysingh1011/gibberish-detection"
base_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(label_map))

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

inference_model = PeftModel.from_pretrained(base_model, lora_weights).to(device)
inference_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

input = "Jumped quick dog over lazy the fox brown."
get_prediction(inference_model, inference_tokenizer, input, intlabel_to_strlabel)
Downloads last month
25
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for sunnysingh1011/gibberish-detection

Adapter
(298)
this model