Central Bank-BERT for Named Entity Recognition (NER)

A domain-adapted BERT model (CentralBank-BERT) was fine-tuned for Named Entity Recognition (NER) in central banking discourse. The model automatically identifies and labels key entities in central bank speeches and related documents, focusing on three categories of interest:

  • AUTHOR / SPEAKER – the individual delivering the speech or statement
  • POSITION – the official title or role of the speaker (e.g., Governor, Deputy Governor, Board Member)
  • AFFILIATION – the institution or organization associated with the speaker (e.g., Bank of Japan, European Central Bank, Bank of England)

The COUNTRY label was not explicitly modeled, since this information can be reliably inferred from the affiliation of the central bank.

Data

  • Source: BIS database of central bank speeches (1996–2024)
  • Corpus Size: 17,648 speeches with 1,961 held out for validation.
  • Input Field: Speech descriptions, which typically contain a short speech title along with the name, position, and institutional affiliation of the speaker.

Annotation Process:

  1. A subset of short speech descriptions was manually annotated with entity spans for Author, Position, and Affiliation.
  2. This annotated subset was used to train an initial NER model.
  3. The model was then applied to the larger dataset (1996–2024) to generate preliminary labels.
  4. All generated labels were manually reviewed and corrected, ensuring complete and consistent annotation across the entire corpus of available speeches.

This approach combined manual expertise with machine-assisted annotation, making it feasible to build a large-scale, high-quality dataset covering nearly three decades of central bank communication.

Data Preparation

  1. Normalization: Lowercasing, removal of diacritics, and unification of punctuation.

  2. Alias resolution: Institution abbreviations normalized (e.g., “BOJ” → “Bank of Japan”, “ECB” → “European Central Bank”).

  3. Entity alignment: Fuzzy string matching used to locate annotated entities in raw text.

  4. BIO Encoding:

    • Tokenization with BERT WordPiece tokenizer.
    • Conversion of annotations into BIO tags (B-, I-, O) at token level.
    • Construction of a training file in JSONL format with tokens and ner_tags.

Model Training

  • Base model: CentralBank-BERT, a domain-adapted BERT trained on central banking corpora.
  • Task head: Token classification layer with num_labels = 7 (BIO scheme for Author, Position, Affiliation).
  • Token alignment: Word-to-token mapping with subword label propagation (-100 used for ignored positions).
  • Training setup:
    • Optimizer: AdamW with weight decay 0.01
    • Learning rate: 2e-5
    • Batch size: 16 (train & eval)
    • Epochs: 3
    • Mixed precision (fp16) when available
    • Evaluation with seqeval metrics (precision, recall, F1)

Results

The model was trained on 17,648 annotated speeches with a 1,961-speech validation set. Evaluation metrics are reported using entity-level precision, recall, and F1-score from the seqeval library. Final Validation Performance (Epoch 3):

Entity Type Precision Recall F1-score Support
Affiliation 0.9850 0.9862 0.9856 1,734
Author 0.9816 0.9912 0.9864 1,936
Position 0.9735 0.9846 0.9790 1,942
Overall 0.9798 0.9862 0.9830
  • Accuracy (token-level): 0.9956 | * Overall F1 (macro): 0.983

The results show high precision and recall across all three categories, confirming that the model provides reliable structured metadata extraction from central bank communications.

Usage

from transformers import pipeline

# HF model repo
model = "bilalzafar/CentralBank-NER"

ner = pipeline(
    task="token-classification",
    model=model,
    tokenizer=model,
    aggregation_strategy="simple"   # merges subword pieces
)

# Example text
text = "Speech by Mr Yi Gang, Governor of the People's Bank of China, at the IMF Annual Meeting."
for ent in ner(text):
    print(f"{ent['entity_group']:12}  {ent['word']:<25}  score={ent['score']:.3f}")

# Example output:
# [{AUTHOR        yi gang                    score=0.997}]
# [{POSITION      governor                   score=0.999}]
# [{AFFILIATION   people ' s bank of china   score=0.999}]
Downloads last month
-
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for bilalzafar/CentralBank-NER

Finetuned
(2)
this model

Dataset used to train bilalzafar/CentralBank-NER