Youngja Park commited on
Commit
4cd0a0e
·
verified ·
1 Parent(s): ea74b51

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -12
README.md CHANGED
@@ -9,23 +9,28 @@ model-index:
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
- # security-bert256-50k
13
 
14
- This model is a fine-tuned version of [](https://huggingface.co/) on the None dataset.
 
 
15
 
16
- ## Model description
17
 
18
- More information needed
19
 
20
- ## Intended uses & limitations
21
 
22
- More information needed
 
 
23
 
24
- ## Training and evaluation data
25
 
26
- More information needed
27
 
28
- ## Training procedure
 
 
 
29
 
30
  ### Training hyperparameters
31
 
@@ -41,9 +46,6 @@ The following hyperparameters were used during training:
41
  - lr_scheduler_warmup_steps: 10000
42
  - training_steps: 200000
43
 
44
- ### Training results
45
-
46
-
47
 
48
  ### Framework versions
49
 
 
9
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
10
  should probably proofread and complete it, then remove this comment. -->
11
 
12
+ # CTI-BERT
13
 
14
+ CTI-BERT is a pre-trained language model for the cybersecurity domain.
15
+ The model was trained on a large corpus of security-related text data, comprising approximately 1.2 billion tokens sourced from
16
+ a diverse range of sources, including security news articles, vulnerability descriptions, books, academic publications, and security-related Wikipedia pages.
17
 
18
+ For additional technical details and the model's performance metrics, please refer to [this paper](https://aclanthology.org/2023.emnlp-industry.12.pdf).
19
 
 
20
 
21
+ ## Model description
22
 
23
+ This model has a vocabulary of 50,000 tokens and the sequence length of 256.
24
+ Both the tokenizer and the BERT model were trained from scratch using the [run_mlm script](https://github.com/huggingface/transformers/blob/main/examples/pytorch/language-modeling/run_mlm.py)
25
+ with the Masked language modeling (MLM) objective.
26
 
 
27
 
28
+ ## Intended uses & limitations
29
 
30
+ You can use the model for masked language modeling or token embedding generation, but the model is aimed at being fine-tuned on a downstream task, such as
31
+ sequence classification, text classification or question answering.
32
+
33
+ The model has shown improved performance for various cybersecurity text classification. However, it is not designed to be used as the main model for general-domain text.
34
 
35
  ### Training hyperparameters
36
 
 
46
  - lr_scheduler_warmup_steps: 10000
47
  - training_steps: 200000
48
 
 
 
 
49
 
50
  ### Framework versions
51