CentralBank-BERT / README.md
bilalzafar's picture
Update README.md
9e11dcb verified
metadata
license: mit
language:
  - en
metrics:
  - perplexity
base_model:
  - google-bert/bert-base-uncased
pipeline_tag: fill-mask
library_name: transformers
tags:
  - bert
  - masked-language-modeling
  - mlm
  - fill-mask
  - transformers
  - finance
  - central-bank
  - financial-nlp
  - economic-policy
  - monetary-policy
  - BIS
  - speeches
  - BIS-Speeches
  - pretraining
  - domain-adaptation
  - financial-domain-adaptation

Central Bank-BERT: Domain-Adaptive Masked Language Model for Central Bank Communication

Central Bank-BERT is a domain-adapted masked language model based on bert-base-uncased, pretrained on more than 66 million tokens across over 2 million sentences extracted from central bank speeches published by the Bank for International Settlements (1996–2024). This model is specifically optimized for masked token prediction within the highly specialized domains of monetary policy, financial regulation, and macroeconomic communication, enabling deeper contextual understanding of central banking discourse and financial narratives.

Dataset Summary

  • Source: BIS Central Bank Speeches (1996–2024)
  • Total Speeches: 19,609
  • MLM Sentences: 2,087,615 (~2.09M)
  • Total Tokens: 66,359,113 (~66.36M)
  • Avg. Tokens per Sentence: 31.79

Model & Training Details

Category Details
Tokenizer BertTokenizerFast (base: bert-base-uncased)
Vocab Size: 30,522
Max Seq Length: 128
Model BertForMaskedLM (initialized from bert-base-uncased)
Total Params: 109,514,298 (~109.5M)
Trainable Params: 109,514,298
Training Setup Epochs: 1
Batch Size (per device): 16
Gradient Accumulation: 2
Effective Batch Size: 32
MLM Probability: 15%
Hardware Device: NVIDIA Tesla P100 (Kaggle)
Mixed Precision: fp16
Training Duration ~8 hrs 18 mins
Start: 2025-07-19 17:17
End: 2025-07-20 01:35
Evaluation Results Perplexity
bert-base: 13.06
CentralBank-BERT: 4.66

Lower perplexity demonstrates better fit on domain-specific central bank language.

Notebook: Training, Evaluation & Results The full training pipeline, including data preprocessing, tokenizer setup, model training, evaluation, and result visualizations, is documented in the notebook cb-bert-mlm.ipynb. This notebook includes actual outputs from the training run, perplexity comparisons, manual masked sentence evaluations, and Top-K accuracy analysis—ensuring full transparency and reproducibility of the model development process.

Model Files

  • model.safetensors: Trained model weights
  • config.json: Model architecture and hyperparameters
  • tokenizer.json: Serialized tokenizer
  • vocab.txt: Vocabulary file
  • tokenizer_config.json: Tokenizer configuration
  • special_tokens_map.json: Special tokens mapping
  • training_args.bin: Training arguments used during pretraining

This model repository includes all essential files required to load, evaluate, or fine-tune the CentralBank-BERT model using Hugging Face's transformers library. These components are necessary to ensure full compatibility with the original training environment and to support seamless deployment or transfer learning.


Downstream Models

In addition to the domain-adapted masked language model (CentralBank-BERT), a suite of fine-tuned downstream classifiers has been released to support CBDC-specific research and policy analysis. These models share the same encoder backbone and are designed for different classification and information extraction tasks on central bank communication.

Model Purpose Intended Use Link
bilalzafar/CBDC-BERT Binary classifier: CBDC vs. Non-CBDC. Flagging CBDC-related discourse in large corpora. CBDC-BERT
bilalzafar/CBDC-Stance 3-class stance model (Pro, Wait-and-See, Anti). Research on policy stances and discourse monitoring. CBDC-Stance
bilalzafar/CBDC-Sentiment 3-class sentiment model (Positive, Neutral, Negative). Tone analysis in central bank communications. CBDC-Sentiment
bilalzafar/CBDC-Type Classifies Retail, Wholesale, General CBDC mentions. Distinguishing policy focus (retail vs wholesale). CBDC-Type
bilalzafar/CBDC-Discourse 3-class discourse classifier (Feature, Process, Risk-Benefit). Structured categorization of CBDC communications. CBDC-Discourse
bilalzafar/CentralBank-NER Named Entity Recognition (NER) model for central banking discourse. Identifying institutions, persons, and policy entities in speeches. CentralBank-NER

Repository and Replication Package

All training pipelines, preprocessing scripts, evaluation notebooks, and result outputs are available in the companion GitHub repository:

🔗 https://github.com/bilalezafar/CentralBank-BERT

The repository includes:

  • End-to-end notebooks for CentralBank-BERT pretraining and all downstream classifiers (CBDC-BERT, Stance, Sentiment, Type, Discourse, NER).
  • Preprocessed BIS speech dataset subsets (CBDC-related sentences, annotated splits).
  • Reproducible code to generate figures, tables, and evaluation metrics reported in the manuscript.
  • Deployment-ready scripts for applying models to new corpora.

This ensures full transparency, reproducibility, and extension of the CentralBank-BERT family of models.


Citation

If you use this model, please cite as:

Zafar, M. B. (2025). CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse. SSRN. https://papers.ssrn.com/abstract=5404456

@article{zafar2025centralbankbert,
  title={CentralBank-BERT: Machine Learning Evidence on Central Bank Digital Currency Discourse},
  author={Zafar, Muhammad Bilal},
  year={2025},
  journal={SSRN Electronic Journal},
  url={https://papers.ssrn.com/abstract=5404456}
}