SecureBERT+

SecureBERT+ is an enhanced version of SecureBERT, trained on a corpus eight times larger than its predecessor and leveraging the computational power of 8×A100 GPUs.

This model delivers an average 9% improvement in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.

Dataset

SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.

Using SecureBERT+

SecureBERT+ is available on the Hugging Face Hub.

Load the Model

from transformers import RobertaTokenizer, RobertaModel
import torch

tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")

inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state

Masked Language Modeling Example

Use the code below to predict masked words in text:

#!pip install transformers torch tokenizers

import torch
import transformers
from transformers import RobertaTokenizerFast

tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")

def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
    token_ids = tokenizer.encode(sent, return_tensors='pt')
    masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
    words = []

    with torch.no_grad():
        output = model(token_ids)

    for pos in masked_pos:
        logits = output.logits[0, pos]
        top_tokens = torch.topk(logits, k=topk).indices
        predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
        words.append(predictions)
        if print_results:
            print(f"Mask Predictions: {predictions}")

    return words

Limitations & Risks

Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.

Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.

Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.

Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.

Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.

Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.

Reference

@inproceedings{aghaei2023securebert, 
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity}, 
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab}, 
booktitle={Security and Privacy in Communication Networks: 
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings}, 
pages={39--56}, 
year={2023}, 
organization={Springer} 
}

ehsanaghaei
/

SecureBERT_Plus