Model Card for cisco-ai/SecureBERT2.0-base

SecureBERT 2.0 Base is a domain-specific transformer model optimized for cybersecurity tasks. It extends the ModernBERT architecture with cybersecurity-focused pretraining to produce contextualized embeddings for both technical text and code. SecureBERT 2.0 supports tasks like masked language modeling, semantic search, named entity recognition, vulnerability detection, and code analysis.


Model Details

Model Description

SecureBERT 2.0 Base is designed for deep contextual understanding of cybersecurity language and code. It leverages domain-specific pretraining on a large, heterogeneous corpus covering threat reports, blogs, documentation, and codebases, making it effective for reasoning across natural language and programming syntax.

  • Developed by: Cisco AI
  • Model type: Transformer (ModernBERT architecture)
  • Language: English
  • License: Apache 2.0
  • Finetuned from model: answerdotai/ModernBERT-base

Model Sources


Uses

Direct Use

  • Masked language modeling for cybersecurity text and code
  • Embedding generation for semantic search and retrieval
  • Code and text feature extraction for downstream classification or clustering
  • Named entity recognition (NER) on security-related entities
  • Vulnerability detection in source code

Downstream Use

Fine-tuning for:

  • Threat intelligence extraction
  • Security question answering
  • Incident analysis and summarization
  • Automated code review and vulnerability prediction

Out-of-Scope Use

  • Non-English or non-technical text
  • General-purpose conversational AI
  • Decision-making in real-time security systems without human oversight

Bias, Risks, and Limitations

The model reflects biases in the cybersecurity sources it was trained on, which may include:

  • Overrepresentation of certain threat actors, technologies, or organizations
  • Inconsistent code or documentation quality
  • Limited exposure to non-public or proprietary data formats

Recommendations

Users should evaluate outputs in their specific context and avoid automated high-stakes decisions without expert validation.


How to Get Started with the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

model_name = "cisco-ai/SecureBERT2.0-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "The malware exploits a vulnerability in the [MASK] system."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

predicted_token_id = outputs.logits.argmax(-1)
predicted_word = tokenizer.decode(predicted_token_id[0])
print(predicted_word)

Training Details

Training Procedure

Preprocessing

Hybrid tokenization for text and code (natural language + structured syntax).

Training Hyperparameters

  • Objective: Masked Language Modeling (MLM)
  • Masking probability: 0.10
  • Optimizer: AdamW
  • Learning rate: 5e-5
  • Weight decay: 0.01
  • Epochs: 20
  • Batch size: 16 per GPU ร— 8 GPUs
  • Curriculum: Microannealing (gradual dataset diversification)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal held-out subset of cybersecurity and code corpora.

Factors

Evaluated across token categories:

  • Objects (nouns)
  • Actions (verbs)
  • Code tokens

Metrics

Top-n accuracy on masked token prediction.

Results

Top-n Objects (Nouns) Verbs (Actions) Code Tokens
1 56.20 % 45.02 % 39.27 %
2 69.73 % 60.00 % 46.90 %
3 75.85 % 66.68 % 50.87 %
4 80.01 % 71.56 % 53.36 %
5 82.72 % 74.12 % 55.41 %
10 88.80 % 81.64 % 60.03 %

This figure presents a comparative study of SecureBERT 2.0, SecureBERT, and ModernBERT on the masked language modeling (MLM) task. This shows SecureBERT 2.0 outperforms both the original SecureBERT and generic ModernBERT, particularly in code understanding and domain-specific terms. image

Summary

SecureBERT 2.0 outperforms both the original SecureBERT and ModernBERT on cybersecurity-specific and code-related tasks.


Environmental Impact

  • Hardware Type: 8ร— GPU cluster
  • Hours used: [Information Not Available]
  • Cloud Provider: [Information Not Available]
  • Compute Region: [Information Not Available]
  • Carbon Emitted: [Estimate Not Available]

Carbon footprint can be estimated using Lacoste et al. (2019).


Technical Specifications

Model Architecture and Objective

  • Architecture: ModernBERT
  • Max sequence length: 1024 tokens
  • Parameters: 150 M
  • Objective: Masked Language Modeling (MLM)
  • Tensor type: F32

Compute Infrastructure

  • Framework: Transformers (PyTorch)
  • Mixed Precision: fp32
  • Hardware: 8 GPUs
  • Checkpoint Format: Safetensors

Citation

BibTeX:

@article{aghaei2025securebert,
  title={SecureBERT 2.0: Advanced Language Model for Cybersecurity Intelligence},
  author={Aghaei, Ehsan and Jain, Sarthak and Arun, Prashanth and Sambamoorthy, Arjun},
  journal={arXiv preprint arXiv:2510.00240},
  year={2025}
}

APA:

Cisco AI (2025). SecureBERT 2.0: A Domain-Specific Transformer for Cybersecurity and Code Understanding. arXiv:2510.00240.


Model Card Authors

Cisco AI

Model Card Contact

For inquiries, please contact [email protected]

Downloads last month
158
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cisco-ai/SecureBERT2.0-base

Finetuned
(844)
this model
Finetunes
4 models