SecureBERT+
SecureBERT+ is an enhanced version of SecureBERT, trained on a corpus eight times larger than its predecessor and leveraging the computational power of 8×A100 GPUs.
This model delivers an average 9% improvement in Masked Language Modeling (MLM) performance compared to SecureBERT, representing a significant advancement in language understanding and representation within the cybersecurity domain.
Dataset
SecureBERT+ was trained on a large-scale corpus of cybersecurity-related text, substantially expanding the coverage and depth of the original SecureBERT training data.
Using SecureBERT+
SecureBERT+ is available on the Hugging Face Hub.
Load the Model
from transformers import RobertaTokenizer, RobertaModel
import torch
tokenizer = RobertaTokenizer.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = RobertaModel.from_pretrained("ehsanaghaei/SecureBERT_Plus")
inputs = tokenizer("This is SecureBERT Plus!", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
Masked Language Modeling Example
Use the code below to predict masked words in text:
#!pip install transformers torch tokenizers
import torch
import transformers
from transformers import RobertaTokenizerFast
tokenizer = RobertaTokenizerFast.from_pretrained("ehsanaghaei/SecureBERT_Plus")
model = transformers.RobertaForMaskedLM.from_pretrained("ehsanaghaei/SecureBERT_Plus")
def predict_mask(sent, tokenizer, model, topk=10, print_results=True):
token_ids = tokenizer.encode(sent, return_tensors='pt')
masked_pos = (token_ids.squeeze() == tokenizer.mask_token_id).nonzero().tolist()
words = []
with torch.no_grad():
output = model(token_ids)
for pos in masked_pos:
logits = output.logits[0, pos]
top_tokens = torch.topk(logits, k=topk).indices
predictions = [tokenizer.decode(i).strip().replace(" ", "") for i in top_tokens]
words.append(predictions)
if print_results:
print(f"Mask Predictions: {predictions}")
return words
Limitations & Risks
Domain-Specific Scope: SecureBERT+ is optimized for cybersecurity text and may not generalize as well to unrelated domains.
Bias in Training Data: The training corpus was collected from online sources and may contain biases, outdated knowledge, or inaccuracies.
Potential Misuse: While designed for defensive research, the model could be misapplied to generate adversarial content or obfuscate malicious behavior.
Resource-Intensive: The larger dataset and model training process require significant compute resources, which may limit reproducibility for smaller research teams.
Evolving Threats: The cybersecurity landscape evolves rapidly. Without regular retraining, the model may not capture emerging threats or terminology.
Users should apply SecureBERT+ responsibly, with appropriate oversight from cybersecurity professionals.
Reference
@inproceedings{aghaei2023securebert,
title={SecureBERT: A Domain-Specific Language Model for Cybersecurity},
author={Aghaei, Ehsan and Niu, Xi and Shadid, Waseem and Al-Shaer, Ehab},
booktitle={Security and Privacy in Communication Networks:
18th EAI International Conference, SecureComm 2022, Virtual Event, October 2022, Proceedings},
pages={39--56},
year={2023},
organization={Springer}
}
- Downloads last month
- 4,289