{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Dataset Background & Loading\n",
"\n",
"The training dataset was sourced from the publicly available BIS central bank speeches, downloaded using the `gingado` package:\n",
"\n",
"```python\n",
"from gingado.datasets import load_CB_speeches\n",
"all_speeches = load_CB_speeches()\n",
"all_speeches.to_csv(\"central_bank_speeches.csv\", index=False)\n",
"```\n",
"\n",
"A preprocessing script was applied to clean the text, lowercase it, split speeches into well-formed sentences, and filter out short/noisy segments. This generated over **2 million sentence-level samples**, saved as `speeches_data_preprocessed.csv`.\n",
"\n",
"For training on Kaggle, the preprocessed dataset was uploaded as an external file and loaded.\n",
"\n",
"This ensures clean and consistent input for masked language modeling (MLM)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
"_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
"execution": {
"iopub.execute_input": "2025-07-19T17:12:03.395329Z",
"iopub.status.busy": "2025-07-19T17:12:03.395050Z",
"iopub.status.idle": "2025-07-19T17:12:16.719665Z",
"shell.execute_reply": "2025-07-19T17:12:16.719049Z",
"shell.execute_reply.started": "2025-07-19T17:12:03.395302Z"
},
"trusted": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(19609, 8)\n"
]
},
{
"data": {
"text/html": [
"
[\"mr. dai looks at the possibilities of streng...
\n",
"
\n",
"
\n",
"
2
\n",
"
https://www.bis.org/review/r970211a.pdf
\n",
"
Mr. Dai assesses the outlook for Hong Kong as ...
\n",
"
Speech by the Governor of the People's Bank of...
\n",
"
1996-09-30 00:00:00
\n",
"
Mr. Dai assesses the outlook for Hong Kong as ...
\n",
"
Dai Xianglong
\n",
"
China
\n",
"
[\"mr. dai assesses the outlook for hong kong a...
\n",
"
\n",
"
\n",
"
3
\n",
"
https://www.bis.org/review/r970203b.pdf
\n",
"
Mr. Rangarajan examines the objectives of mone...
\n",
"
Address by the Governor of the Reserve Bank of...
\n",
"
1996-12-28 00:00:00
\n",
"
Mr. Rangarajan examines the objectives of mone...
\n",
"
Bimal Jalan
\n",
"
India
\n",
"
[\"mr. rangarajan examines the objectives of mo...
\n",
"
\n",
"
\n",
"
4
\n",
"
https://www.bis.org/review/r970115a.pdf
\n",
"
M. Trichet presents the monetary policy guidel...
\n",
"
BANK OF FRANCE, PRESS RELEASE, 17/12/96.
\n",
"
1996-12-17 00:00:00
\n",
"
M. Trichet presents the monetary policy guidel...
\n",
"
Bank of France
\n",
"
France
\n",
"
['m. trichet presents the monetary policy guid...
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" url \\\n",
"0 https://www.bis.org/review/r970211c.pdf \n",
"1 https://www.bis.org/review/r970211b.pdf \n",
"2 https://www.bis.org/review/r970211a.pdf \n",
"3 https://www.bis.org/review/r970203b.pdf \n",
"4 https://www.bis.org/review/r970115a.pdf \n",
"\n",
" title \\\n",
"0 Mr. Chen discusses monetary relations between ... \n",
"1 Mr. Dai looks at the possibilities of strength... \n",
"2 Mr. Dai assesses the outlook for Hong Kong as ... \n",
"3 Mr. Rangarajan examines the objectives of mone... \n",
"4 M. Trichet presents the monetary policy guidel... \n",
"\n",
" description date \\\n",
"0 Speech by the Deputy Governor of the People's ... 1996-09-10 00:00:00 \n",
"1 Speech by the Governor of the People's Bank of... 1996-11-13 00:00:00 \n",
"2 Speech by the Governor of the People's Bank of... 1996-09-30 00:00:00 \n",
"3 Address by the Governor of the Reserve Bank of... 1996-12-28 00:00:00 \n",
"4 BANK OF FRANCE, PRESS RELEASE, 17/12/96. 1996-12-17 00:00:00 \n",
"\n",
" text author country \\\n",
"0 Mr. Chen discusses monetary relations between ... Chen Yuan China \n",
"1 Mr. Dai looks at the possibilities of strength... Dai Xianglong China \n",
"2 Mr. Dai assesses the outlook for Hong Kong as ... Dai Xianglong China \n",
"3 Mr. Rangarajan examines the objectives of mone... Bimal Jalan India \n",
"4 M. Trichet presents the monetary policy guidel... Bank of France France \n",
"\n",
" processed_text \n",
"0 [\"mr. chen discusses monetary relations betwee... \n",
"1 [\"mr. dai looks at the possibilities of streng... \n",
"2 [\"mr. dai assesses the outlook for hong kong a... \n",
"3 [\"mr. rangarajan examines the objectives of mo... \n",
"4 ['m. trichet presents the monetary policy guid... "
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import pandas as pd\n",
"\n",
"df = pd.read_csv('/kaggle/input/bis-speeches/speeches_data_preprocessed.csv')\n",
"print(df.shape)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenize BIS Sentences for MLM Training\n",
"\n",
"This section prepares the preprocessed central bank speech sentences for masked language modeling (MLM) by:\n",
"\n",
"- Flattening over 2 million cleaned sentences into a single list.\n",
"- Converting them into a Hugging Face `Dataset` object.\n",
"- Tokenizing using the `bert-base-uncased` tokenizer with:\n",
" - `max_length=128` (chosen based on sentence length distribution: ~99% of sentences fall within this limit),\n",
" - truncation and padding enabled.\n",
"- Applying tokenization in parallel using `num_proc=4` for efficiency.\n",
"- Saving the tokenized dataset locally for later training use.\n",
"\n",
"The tokenized dataset is saved to:\n",
"\n",
"```\n",
"/kaggle/working/tokenized_bis_dataset\n",
"```\n",
"\n",
"This ensures the input is consistently preprocessed and optimally sized for efficient MLM training.\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-19T17:12:20.157315Z",
"iopub.status.busy": "2025-07-19T17:12:20.157050Z",
"iopub.status.idle": "2025-07-19T17:16:28.711161Z",
"shell.execute_reply": "2025-07-19T17:16:28.710348Z",
"shell.execute_reply.started": "2025-07-19T17:12:20.157296Z"
},
"trusted": true
},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "a51d2c2858e644a585bd2c6e07b2d618",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer_config.json: 0%| | 0.00/48.0 [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9f54ee9d0a5243af9ff7c6f7e434d011",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"vocab.txt: 0%| | 0.00/232k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "406db007323a40aab7208aef89a08fad",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"tokenizer.json: 0%| | 0.00/466k [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "47f6de8956064f739e2775ac60de5311",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"config.json: 0%| | 0.00/570 [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Tokenizing dataset...\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "399b6b218b98493fa8569914186f0447",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"๐ Tokenizing with multiprocessing... (num_proc=4): 0%| | 0/2087615 [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7438a179d847487ba19dc132c0e52dd3",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Saving the dataset (0/4 shards): 0%| | 0/2087615 [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"โ Tokenized dataset saved to: /kaggle/working/tokenized_bis_dataset\n"
]
}
],
"source": [
"# 1. Install Hugging Face libraries\n",
"# !pip install -U transformers datasets --quiet\n",
"\n",
"# 2. Import libraries\n",
"from transformers import BertTokenizerFast\n",
"from datasets import Dataset\n",
"import pandas as pd\n",
"import os\n",
"\n",
"# 3. Load CSV and extract valid sentences\n",
"df = pd.read_csv(\"/kaggle/input/bis-speeches/speeches_data_preprocessed.csv\")\n",
"df = df[df[\"processed_text\"].notna()]\n",
"df[\"processed_text\"] = df[\"processed_text\"].apply(eval)\n",
"\n",
"# 4. Flatten all sentences\n",
"sentences = [sentence for sublist in df[\"processed_text\"] for sentence in sublist]\n",
"dataset = Dataset.from_dict({\"text\": sentences})\n",
"\n",
"# 5. Load tokenizer\n",
"tokenizer = BertTokenizerFast.from_pretrained(\"bert-base-uncased\")\n",
"\n",
"# 6. Tokenization function\n",
"def tokenize_function(example):\n",
" return tokenizer(example[\"text\"], truncation=True, padding=\"max_length\", max_length=128)\n",
"\n",
"# 7. Apply tokenization with multiprocessing\n",
"print(\"๐ Tokenizing dataset...\")\n",
"tokenized_dataset = dataset.map(\n",
" tokenize_function,\n",
" batched=True,\n",
" remove_columns=[\"text\"],\n",
" num_proc=4,\n",
" desc=\"๐ Tokenizing with multiprocessing...\"\n",
")\n",
"\n",
"# 8. Save tokenized dataset\n",
"save_path = \"/kaggle/working/tokenized_bis_dataset\"\n",
"tokenized_dataset.save_to_disk(save_path)\n",
"print(f\"โ Tokenized dataset saved to: {save_path}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pretrain BERT on Central Bank Speech Corpus (MLM)\n",
"\n",
"This section fine-tunes `bert-base-uncased` on a domain-specific corpus of over 2 million central bank sentences using **Masked Language Modeling (MLM)**.\n",
"\n",
"Key training details:\n",
"\n",
"- โ **Single GPU (P100)** with controlled device visibility.\n",
"- โ **Gradient Accumulation**: 16 ร 2 โ effective batch size of 32.\n",
"- โ **MLM Probability**: 15% tokens masked per sample.\n",
"- โ **Training Epochs**: 1 full pass through the complete dataset.\n",
"- โ **Mixed Precision (fp16)**: Enabled for speed and memory efficiency.\n",
"- โ **Saving Strategy**: Model saved at the end of training.\n",
"\n",
"Output:\n",
"\n",
"- The domain-adapted model is saved to:\n",
" \n",
" ```\n",
" /kaggle/working/bert-mlm-bis\n",
" ```\n",
"\n",
"This fine-tuned model (CB-BERT-MLM) is specialized for financial and economic language understanding in masked token prediction tasks.\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-19T17:16:39.868642Z",
"iopub.status.busy": "2025-07-19T17:16:39.868105Z",
"iopub.status.idle": "2025-07-20T01:35:51.439680Z",
"shell.execute_reply": "2025-07-20T01:35:51.438769Z",
"shell.execute_reply.started": "2025-07-19T17:16:39.868616Z"
},
"trusted": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"2025-07-19 17:16:46.215827: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered\n",
"WARNING: All log messages before absl::InitializeLog() is called are written to STDERR\n",
"E0000 00:00:1752945406.370939 36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered\n",
"E0000 00:00:1752945406.420402 36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"โ Tokenized dataset loaded with 2087615 samples.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "7277654107b64ba3b2cb5e7fa6bf416d",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"model.safetensors: 0%| | 0.00/440M [00:00, ?B/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
"- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"/tmp/ipykernel_36/1518358492.py:53: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
" trainer = Trainer(\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"โฑ๏ธ Training started at: 2025-07-19 17:17:02.401010\n"
]
},
{
"data": {
"text/html": [
"\n",
"
"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"โ Training completed at: 2025-07-20 01:35:50.877293\n",
"๐ Final model saved to /kaggle/working/bert-mlm-bis\n"
]
}
],
"source": [
"# 1. Install required packages\n",
"# !pip install -U transformers datasets --quiet\n",
"\n",
"# 2. Imports\n",
"from transformers import (\n",
" BertTokenizerFast,\n",
" BertForMaskedLM,\n",
" Trainer,\n",
" TrainingArguments,\n",
" DataCollatorForLanguageModeling\n",
")\n",
"from datasets import load_from_disk\n",
"from datetime import datetime\n",
"import torch\n",
"import os\n",
"\n",
"# 3. Force use of single GPU (for P100)\n",
"os.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n",
"\n",
"# 4. Load tokenizer and dataset\n",
"tokenizer = BertTokenizerFast.from_pretrained(\"bert-base-uncased\")\n",
"dataset = load_from_disk(\"/kaggle/working/tokenized_bis_dataset\")\n",
"print(f\"โ Tokenized dataset loaded with {len(dataset)} samples.\")\n",
"\n",
"# 5. Load model\n",
"model = BertForMaskedLM.from_pretrained(\"bert-base-uncased\")\n",
"\n",
"# 6. Data collator for MLM\n",
"data_collator = DataCollatorForLanguageModeling(\n",
" tokenizer=tokenizer,\n",
" mlm=True,\n",
" mlm_probability=0.15\n",
")\n",
"\n",
"# 7. Training arguments (gradient accumulation + smaller per-device batch)\n",
"training_args = TrainingArguments(\n",
" output_dir=\"/kaggle/working/bert-mlm-bis\",\n",
" overwrite_output_dir=True,\n",
" num_train_epochs=1, # โ Full dataset, 1 pass\n",
" per_device_train_batch_size=16, # โ Lower memory per device\n",
" gradient_accumulation_steps=2, # โ Effective batch size = 32\n",
" eval_strategy=\"no\", # โ No eval during training\n",
" save_strategy=\"epoch\", # โ Save once at end\n",
" logging_dir=\"/kaggle/working/logs\",\n",
" logging_steps=200,\n",
" fp16=torch.cuda.is_available(), # โ Mixed precision\n",
" dataloader_num_workers=4,\n",
" save_total_limit=1,\n",
" report_to=\"none\"\n",
")\n",
"\n",
"# 8. Initialize Trainer\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" train_dataset=dataset,\n",
" tokenizer=tokenizer,\n",
" data_collator=data_collator,\n",
")\n",
"\n",
"# 9. Train\n",
"print(\"โฑ๏ธ Training started at:\", datetime.now())\n",
"trainer.train()\n",
"print(\"โ Training completed at:\", datetime.now())\n",
"\n",
"# 10. Save final model\n",
"trainer.save_model(\"/kaggle/working/bert-mlm-bis\")\n",
"print(\"๐ Final model saved to /kaggle/working/bert-mlm-bis\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Evaluate Trained Model and Compute Perplexity\n",
"\n",
"To assess the quality of the pretrained CB-BERT-MLM model, evaluated it on a randomly sampled subset of 10,000 sentences from the tokenized dataset. This step computes:\n",
"\n",
"- **Evaluation loss** on masked language modeling (MLM)\n",
"- **Perplexity**, a standard metric indicating how confidently the model predicts masked tokens (lower is better)\n",
"\n",
"```python\n",
"from datasets import load_from_disk\n",
"from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling\n",
"import math\n",
"\n",
"# Load trained model and tokenizer\n",
"model = AutoModelForMaskedLM.from_pretrained(...)\n",
"tokenizer = AutoTokenizer.from_pretrained(...)\n",
"\n",
"# Select a subset of 10,000 sentences for quick evaluation\n",
"eval_dataset = dataset.shuffle(seed=42).select(range(10000))\n",
"\n",
"# Evaluate\n",
"metrics = trainer.evaluate()\n",
"eval_loss = metrics[\"eval_loss\"]\n",
"perplexity = math.exp(eval_loss)\n",
"```\n",
"\n",
"> **Perplexity Score** is printed at the end of the cell. A lower perplexity indicates stronger masked token prediction performance and better fit to the domain-specific language.\n",
"\n",
"This provides a quantitative baseline for how well the model understands and reconstructs financial and monetary policy language.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-20T02:00:40.964023Z",
"iopub.status.busy": "2025-07-20T02:00:40.963322Z",
"iopub.status.idle": "2025-07-20T02:01:32.119053Z",
"shell.execute_reply": "2025-07-20T02:01:32.118331Z",
"shell.execute_reply.started": "2025-07-20T02:00:40.963997Z"
},
"trusted": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Using device: cuda\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_36/4227637877.py:39: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
" trainer = Trainer(\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" [625/625 00:50]\n",
"
\n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Evaluation Loss: 1.5392\n",
"๐ Perplexity Score (subset of 10000): 4.66\n"
]
}
],
"source": [
"# ๐ฆ Imports\n",
"from transformers import (\n",
" AutoModelForMaskedLM,\n",
" AutoTokenizer,\n",
" DataCollatorForLanguageModeling,\n",
" Trainer,\n",
" TrainingArguments\n",
")\n",
"from datasets import load_from_disk\n",
"import torch\n",
"import math\n",
"\n",
"# ๐ง Ensure GPU is used\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"print(f\"๐ Using device: {device}\")\n",
"\n",
"# ๐ Load model and tokenizer from saved path\n",
"model = AutoModelForMaskedLM.from_pretrained(\"/kaggle/working/bert-mlm-bis\").to(device)\n",
"tokenizer = AutoTokenizer.from_pretrained(\"/kaggle/working/bert-mlm-bis\")\n",
"\n",
"# ๐ Load tokenized dataset and sample subset\n",
"dataset = load_from_disk(\"/kaggle/working/tokenized_bis_dataset\")\n",
"eval_dataset = dataset.shuffle(seed=42).select(range(10000)) # ๐ฝ reduce for speed\n",
"\n",
"# ๐ Data collator for masked LM\n",
"data_collator = DataCollatorForLanguageModeling(\n",
" tokenizer=tokenizer,\n",
" mlm=True,\n",
" mlm_probability=0.15\n",
")\n",
"\n",
"# โ๏ธ Trainer setup\n",
"training_args = TrainingArguments(\n",
" output_dir=\"/kaggle/working/tmp_eval\",\n",
" per_device_eval_batch_size=16,\n",
" report_to=\"none\"\n",
")\n",
"\n",
"trainer = Trainer(\n",
" model=model,\n",
" args=training_args,\n",
" data_collator=data_collator,\n",
" eval_dataset=eval_dataset,\n",
" tokenizer=tokenizer,\n",
")\n",
"\n",
"# ๐ Evaluate and compute perplexity\n",
"metrics = trainer.evaluate()\n",
"eval_loss = metrics[\"eval_loss\"]\n",
"perplexity = math.exp(eval_loss)\n",
"\n",
"print(f\"๐ Evaluation Loss: {eval_loss:.4f}\")\n",
"print(f\"๐ Perplexity Score (subset of 10000): {perplexity:.2f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Compare Perplexity: BERT-Base vs CB-BERT-MLM\n",
"\n",
"This section evaluates and compares the perplexity of the original `bert-base-uncased` model and the domain-adapted `cb-bert-mlm` on a subset of 10,000 masked sentences from the BIS corpus.\n",
"\n",
"#### Evaluation Setup:\n",
"- Both models use the same evaluation subset and masking strategy (MLM probability = 15%)\n",
"- Performed on GPU (P100) with batch size 16\n",
"- Perplexity is calculated from the evaluation loss: `perplexity = exp(loss)`\n",
"\n",
"#### Output:\n",
"- Perplexity scores are printed for both models\n",
"- Lower perplexity indicates better performance in masked token prediction on financial text\n",
"\n",
"This comparison highlights the impact of domain adaptation through MLM pretraining on central bank communication data."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-20T02:02:58.622861Z",
"iopub.status.busy": "2025-07-20T02:02:58.622182Z",
"iopub.status.idle": "2025-07-20T02:04:40.524560Z",
"shell.execute_reply": "2025-07-20T02:04:40.523804Z",
"shell.execute_reply.started": "2025-07-20T02:02:58.622839Z"
},
"trusted": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"๐ Evaluating: BERT-Base\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']\n",
"- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
"- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n",
"/tmp/ipykernel_36/810192027.py:37: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
" trainer = Trainer(\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" [625/625 00:50]\n",
"
\n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Eval Loss: 2.5698\n",
"๐ Perplexity: 13.06\n",
"\n",
"๐ Evaluating: BIS-BERT-MLM\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_36/810192027.py:37: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Trainer.__init__`. Use `processing_class` instead.\n",
" trainer = Trainer(\n"
]
},
{
"data": {
"text/html": [
"\n",
"
\n",
" \n",
" \n",
" [625/625 00:50]\n",
"
\n",
" "
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Eval Loss: 1.5392\n",
"๐ Perplexity: 4.66\n",
"\n",
"๐งพ Summary:\n",
"โก๏ธ BERT-Base Perplexity : 13.06\n",
"โก๏ธ BIS-BERT-MLM Perplexity : 4.66\n"
]
}
],
"source": [
"from transformers import (\n",
" AutoModelForMaskedLM,\n",
" AutoTokenizer,\n",
" DataCollatorForLanguageModeling,\n",
" Trainer,\n",
" TrainingArguments\n",
")\n",
"from datasets import load_from_disk\n",
"import math\n",
"import torch\n",
"\n",
"# โ Load the tokenized dataset (use a subset for fast eval)\n",
"dataset = load_from_disk(\"/kaggle/working/tokenized_bis_dataset\")\n",
"eval_dataset = dataset.shuffle(seed=42).select(range(10000)) # adjust size if needed\n",
"\n",
"# โ Common data collator for both models\n",
"def get_data_collator(tokenizer):\n",
" return DataCollatorForLanguageModeling(\n",
" tokenizer=tokenizer,\n",
" mlm=True,\n",
" mlm_probability=0.15\n",
" )\n",
"\n",
"# ๐ Evaluation function\n",
"def evaluate_perplexity(model_path, label):\n",
" print(f\"\\n๐ Evaluating: {label}\")\n",
" tokenizer = AutoTokenizer.from_pretrained(model_path)\n",
" model = AutoModelForMaskedLM.from_pretrained(model_path).to(\"cuda\")\n",
"\n",
" collator = get_data_collator(tokenizer)\n",
" args = TrainingArguments(\n",
" output_dir=\"/kaggle/working/tmp_eval_\" + label.replace(\"-\", \"_\"),\n",
" per_device_eval_batch_size=16,\n",
" report_to=\"none\"\n",
" )\n",
"\n",
" trainer = Trainer(\n",
" model=model,\n",
" args=args,\n",
" eval_dataset=eval_dataset,\n",
" data_collator=collator,\n",
" tokenizer=tokenizer\n",
" )\n",
"\n",
" metrics = trainer.evaluate()\n",
" loss = metrics[\"eval_loss\"]\n",
" perplexity = math.exp(loss)\n",
"\n",
" print(f\"๐ Eval Loss: {loss:.4f}\")\n",
" print(f\"๐ Perplexity: {perplexity:.2f}\")\n",
" return perplexity\n",
"\n",
"# โ๏ธ Compare both models\n",
"p1 = evaluate_perplexity(\"bert-base-uncased\", \"BERT-Base\")\n",
"p2 = evaluate_perplexity(\"/kaggle/working/bert-mlm-bis\", \"BIS-BERT-MLM\")\n",
"\n",
"# ๐ Summary\n",
"print(\"\\n๐งพ Summary:\")\n",
"print(f\"โก๏ธ BERT-Base Perplexity : {p1:.2f}\")\n",
"print(f\"โก๏ธ BIS-BERT-MLM Perplexity : {p2:.2f}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Manual Masked Sentence Evaluation\n",
"\n",
"This section tests the `cb-bert-mlm` model on 20 manually constructed masked sentences based on real central banking and financial policy language.\n",
"\n",
"Each sentence contains a single `[MASK]` token, and is evaluated for whether the model correctly predicts the expected token.\n",
"\n",
"#### Evaluation Highlights:\n",
"- Sentences represent realistic use cases in financial regulation, digital currency, and monetary policy\n",
"- Most mismatches were plausible paraphrases (e.g., synonyms or domain-relevant alternates)\n",
"\n",
"The test demonstrates the model's strong contextual understanding of domain-specific language, particularly in predicting terminology used in central bank communication.\n",
"Results are displayed in a tabular format showing the masked sentence, expected token, predicted token, and whether it matched exactly.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-20T02:15:43.730657Z",
"iopub.status.busy": "2025-07-20T02:15:43.730330Z",
"iopub.status.idle": "2025-07-20T02:15:45.482523Z",
"shell.execute_reply": "2025-07-20T02:15:45.481827Z",
"shell.execute_reply.started": "2025-07-20T02:15:43.730635Z"
},
"trusted": true
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
"
\n",
"
\n",
"
Sentence
\n",
"
Expected
\n",
"
Predicted
\n",
"
Match?
\n",
"
\n",
" \n",
" \n",
"
\n",
"
0
\n",
"
Central banks are exploring the potential of d...
\n",
"
currencies
\n",
"
##isation
\n",
"
โ
\n",
"
\n",
"
\n",
"
1
\n",
"
The governor highlighted the importance of mon...
\n",
"
policy
\n",
"
policy
\n",
"
โ
\n",
"
\n",
"
\n",
"
2
\n",
"
Inflation expectations remain [MASK] anchored ...
\n",
"
well
\n",
"
well
\n",
"
โ
\n",
"
\n",
"
\n",
"
3
\n",
"
Cross-border [MASK] are still slow and expensive.
\n",
"
payments
\n",
"
payments
\n",
"
โ
\n",
"
\n",
"
\n",
"
4
\n",
"
Financial [MASK] is a key objective for many c...
\n",
"
inclusion
\n",
"
stability
\n",
"
โ
\n",
"
\n",
"
\n",
"
5
\n",
"
Stablecoins pose new [MASK] for regulators and...
\n",
"
challenges
\n",
"
challenges
\n",
"
โ
\n",
"
\n",
"
\n",
"
6
\n",
"
Monetary [MASK] must adapt to technological in...
\n",
"
policy
\n",
"
policy
\n",
"
โ
\n",
"
\n",
"
\n",
"
7
\n",
"
The BIS supports the development of secure dig...
\n",
"
payment
\n",
"
payment
\n",
"
โ
\n",
"
\n",
"
\n",
"
8
\n",
"
Central banks need to coordinate on [MASK] fra...
\n",
"
regulatory
\n",
"
these
\n",
"
โ
\n",
"
\n",
"
\n",
"
9
\n",
"
Emerging markets are experiencing strong capit...
\n",
"
inflows
\n",
"
flows
\n",
"
โ
\n",
"
\n",
"
\n",
"
10
\n",
"
The committee emphasized the need for macropru...
\n",
"
oversight
\n",
"
policies
\n",
"
โ
\n",
"
\n",
"
\n",
"
11
\n",
"
Tokenization of [MASK] could transform financi...
\n",
"
assets
\n",
"
risk
\n",
"
โ
\n",
"
\n",
"
\n",
"
12
\n",
"
Interoperability between payment [MASK] is cru...
\n",
"
systems
\n",
"
systems
\n",
"
โ
\n",
"
\n",
"
\n",
"
13
\n",
"
Cybersecurity [MASK] increase with digital fin...
\n",
"
risks
\n",
"
risks
\n",
"
โ
\n",
"
\n",
"
\n",
"
14
\n",
"
Central banks must ensure [MASK] in digital in...
\n",
"
resilience
\n",
"
trust
\n",
"
โ
\n",
"
\n",
"
\n",
"
15
\n",
"
The future of [MASK] may involve public and pr...
\n",
"
money
\n",
"
finance
\n",
"
โ
\n",
"
\n",
"
\n",
"
16
\n",
"
Pilot [MASK] help central banks understand new...
\n",
"
projects
\n",
"
exercises
\n",
"
โ
\n",
"
\n",
"
\n",
"
17
\n",
"
Legal frameworks need to [MASK] for modern fin...
\n",
"
evolve
\n",
"
evolve
\n",
"
โ
\n",
"
\n",
"
\n",
"
18
\n",
"
Foreign exchange [MASK] have remained relative...
\n",
"
markets
\n",
"
reserves
\n",
"
โ
\n",
"
\n",
"
\n",
"
19
\n",
"
The central bank raised its key interest [MASK...
\n",
"
rate
\n",
"
rate
\n",
"
โ
\n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Sentence Expected Predicted \\\n",
"0 Central banks are exploring the potential of d... currencies ##isation \n",
"1 The governor highlighted the importance of mon... policy policy \n",
"2 Inflation expectations remain [MASK] anchored ... well well \n",
"3 Cross-border [MASK] are still slow and expensive. payments payments \n",
"4 Financial [MASK] is a key objective for many c... inclusion stability \n",
"5 Stablecoins pose new [MASK] for regulators and... challenges challenges \n",
"6 Monetary [MASK] must adapt to technological in... policy policy \n",
"7 The BIS supports the development of secure dig... payment payment \n",
"8 Central banks need to coordinate on [MASK] fra... regulatory these \n",
"9 Emerging markets are experiencing strong capit... inflows flows \n",
"10 The committee emphasized the need for macropru... oversight policies \n",
"11 Tokenization of [MASK] could transform financi... assets risk \n",
"12 Interoperability between payment [MASK] is cru... systems systems \n",
"13 Cybersecurity [MASK] increase with digital fin... risks risks \n",
"14 Central banks must ensure [MASK] in digital in... resilience trust \n",
"15 The future of [MASK] may involve public and pr... money finance \n",
"16 Pilot [MASK] help central banks understand new... projects exercises \n",
"17 Legal frameworks need to [MASK] for modern fin... evolve evolve \n",
"18 Foreign exchange [MASK] have remained relative... markets reserves \n",
"19 The central bank raised its key interest [MASK... rate rate \n",
"\n",
" Match? \n",
"0 โ \n",
"1 โ \n",
"2 โ \n",
"3 โ \n",
"4 โ \n",
"5 โ \n",
"6 โ \n",
"7 โ \n",
"8 โ \n",
"9 โ \n",
"10 โ \n",
"11 โ \n",
"12 โ \n",
"13 โ \n",
"14 โ \n",
"15 โ \n",
"16 โ \n",
"17 โ \n",
"18 โ \n",
"19 โ "
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from transformers import BertForMaskedLM, BertTokenizerFast\n",
"import torch\n",
"import pandas as pd\n",
"from IPython.display import display\n",
"\n",
"# 1) Load trained MLM\n",
"model_path = \"/kaggle/working/bert-mlm-bis\"\n",
"tokenizer = BertTokenizerFast.from_pretrained(model_path)\n",
"model = BertForMaskedLM.from_pretrained(model_path)\n",
"model.eval()\n",
"\n",
"# 2) Manual maskedโsentence test set\n",
"masked_data = [\n",
" (\"Central banks are exploring the potential of digital [MASK].\", \"currencies\"),\n",
" (\"The governor highlighted the importance of monetary [MASK] transparency.\", \"policy\"),\n",
" (\"Inflation expectations remain [MASK] anchored across most economies.\", \"well\"),\n",
" (\"Cross-border [MASK] are still slow and expensive.\", \"payments\"),\n",
" (\"Financial [MASK] is a key objective for many central banks.\", \"inclusion\"),\n",
" (\"Stablecoins pose new [MASK] for regulators and policymakers.\", \"challenges\"),\n",
" (\"Monetary [MASK] must adapt to technological innovation.\", \"policy\"),\n",
" (\"The BIS supports the development of secure digital [MASK] systems.\", \"payment\"),\n",
" (\"Central banks need to coordinate on [MASK] frameworks.\", \"regulatory\"),\n",
" (\"Emerging markets are experiencing strong capital [MASK].\", \"inflows\"),\n",
" (\"The committee emphasized the need for macroprudential [MASK].\", \"oversight\"),\n",
" (\"Tokenization of [MASK] could transform financial markets.\", \"assets\"),\n",
" (\"Interoperability between payment [MASK] is crucial.\", \"systems\"),\n",
" (\"Cybersecurity [MASK] increase with digital financial services.\", \"risks\"),\n",
" (\"Central banks must ensure [MASK] in digital infrastructure.\", \"resilience\"),\n",
" (\"The future of [MASK] may involve public and private sector collaboration.\", \"money\"),\n",
" (\"Pilot [MASK] help central banks understand new financial instruments.\", \"projects\"),\n",
" (\"Legal frameworks need to [MASK] for modern financial technology.\", \"evolve\"),\n",
" (\"Foreign exchange [MASK] have remained relatively stable.\", \"markets\"),\n",
" (\"The central bank raised its key interest [MASK] by 25 basis points.\", \"rate\"),\n",
"]\n",
"\n",
"# 3) Run predictions\n",
"results = []\n",
"for sent, true_word in masked_data:\n",
" # encode + mask\n",
" inputs = tokenizer(sent, return_tensors=\"pt\")\n",
" mask_index = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0]\n",
"\n",
" # forward pass\n",
" with torch.no_grad():\n",
" logits = model(**inputs).logits\n",
"\n",
" # pick top-1\n",
" token_id = logits[0, mask_index, :].argmax(dim=-1).item()\n",
" pred = tokenizer.decode([token_id]).strip()\n",
"\n",
" results.append({\n",
" \"Sentence\": sent,\n",
" \"Expected\": true_word,\n",
" \"Predicted\": pred,\n",
" \"Match?\": \"โ \" if pred.lower() == true_word.lower() else \"โ\"\n",
" })\n",
"\n",
"# 4) Show as DataFrame\n",
"df = pd.DataFrame(results)\n",
"display(df)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Top-K Accuracy Evaluation on 100,000 Randomly Masked Sentences\n",
"\n",
"This section evaluates the `cb-bert-mlm` model's ability to recover randomly masked words in context across **100,000 test sentences**. The procedure involves:\n",
"\n",
"#### Procedure:\n",
"\n",
"1. **Sentence Sampling** \n",
" 100,000 random sentences were sampled from the BIS preprocessed dataset.\n",
"\n",
"2. **Masking Strategy** \n",
" One random eligible word (min sentence length = 5, alphabetic tokens only) was replaced with `[MASK]` in each sentence.\n",
"\n",
"3. **Prediction** \n",
" The model generated **Top-K token predictions** for the masked position, with `k` ranging from 1 to 20.\n",
"\n",
"4. **Accuracy Computation** \n",
" A prediction is considered correct if the original word appears in the top-K list. The accuracy is computed as: \n",
" \\[\n",
" \\text{Top-k Accuracy} = \\frac{\\text{\\# correct predictions}}{\\text{total samples}} \\times 100\n",
" \\]\n",
"\n",
"\n",
"#### Results:\n",
"\n",
"> *Exact values are printed at the end of the cell and visualized in the curve below.*\n",
"\n",
"\n",
"#### Top-K Accuracy Curve\n",
"\n",
"A line plot visualizes model performance across increasing values of `k`, showing how quickly prediction confidence saturates.\n",
"\n",
"This benchmark confirms the model's strong ability to predict masked financial-domain tokens, with over **90% Top-20 accuracy**.\n"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-20T02:42:10.239306Z",
"iopub.status.busy": "2025-07-20T02:42:10.239002Z",
"iopub.status.idle": "2025-07-20T03:00:16.278533Z",
"shell.execute_reply": "2025-07-20T03:00:16.277747Z",
"shell.execute_reply.started": "2025-07-20T02:42:10.239285Z"
},
"trusted": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"โ๏ธ Using device: cuda\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"๐ Evaluating (Topโk): 100%|โโโโโโโโโโ| 100000/100000 [17:43<00:00, 94.01it/s]\n"
]
},
{
"data": {
"image/png": "",
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"Top- 1 Accuracy: 63.84%\n",
"Top- 2 Accuracy: 74.24%\n",
"Top- 3 Accuracy: 78.77%\n",
"Top- 4 Accuracy: 81.41%\n",
"Top- 5 Accuracy: 83.10%\n",
"Top- 6 Accuracy: 84.45%\n",
"Top- 7 Accuracy: 85.43%\n",
"Top- 8 Accuracy: 86.25%\n",
"Top- 9 Accuracy: 86.90%\n",
"Top-10 Accuracy: 87.49%\n",
"Top-11 Accuracy: 87.94%\n",
"Top-12 Accuracy: 88.37%\n",
"Top-13 Accuracy: 88.75%\n",
"Top-14 Accuracy: 89.07%\n",
"Top-15 Accuracy: 89.33%\n",
"Top-16 Accuracy: 89.59%\n",
"Top-17 Accuracy: 89.85%\n",
"Top-18 Accuracy: 90.07%\n",
"Top-19 Accuracy: 90.28%\n",
"Top-20 Accuracy: 90.46%\n"
]
}
],
"source": [
"import pandas as pd\n",
"import random\n",
"import torch\n",
"from transformers import BertTokenizerFast, BertForMaskedLM\n",
"from tqdm import tqdm\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# ===============================\n",
"# ๐น Step 1: Load raw BIS sentences\n",
"# ===============================\n",
"df = pd.read_csv(\"/kaggle/input/bis-speeches/speeches_data_preprocessed.csv\")\n",
"df = df[df[\"processed_text\"].notna()]\n",
"df[\"processed_text\"] = df[\"processed_text\"].apply(eval)\n",
"sentences = [sentence for sublist in df[\"processed_text\"] for sentence in sublist]\n",
"\n",
"# ===============================\n",
"# ๐น Step 2: Setup device, model & tokenizer\n",
"# ===============================\n",
"device = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n",
"print(\"โ๏ธ Using device:\", device)\n",
"\n",
"model_path = \"/kaggle/working/bert-mlm-bis\"\n",
"tokenizer = BertTokenizerFast.from_pretrained(model_path)\n",
"model = BertForMaskedLM.from_pretrained(model_path).to(device)\n",
"model.eval()\n",
"\n",
"# ===============================\n",
"# ๐น Step 3: Function to mask one word in a sentence\n",
"# ===============================\n",
"def mask_random_word(sentence):\n",
" words = sentence.strip().split()\n",
" if len(words) < 5:\n",
" return None\n",
" # choose only alphabetic tokens\n",
" candidates = [i for i, w in enumerate(words) if w.isalpha()]\n",
" if not candidates:\n",
" return None\n",
" idx = random.choice(candidates)\n",
" true_word = words[idx]\n",
" words[idx] = \"[MASK]\"\n",
" return \" \".join(words), true_word\n",
"\n",
"# ===============================\n",
"# ๐น Step 4: Generate 100,000 masked test samples\n",
"# ===============================\n",
"masked_samples = []\n",
"for sent in random.sample(sentences, len(sentences)):\n",
" pair = mask_random_word(sent)\n",
" if pair:\n",
" masked_samples.append(pair)\n",
" if len(masked_samples) >= 100000:\n",
" break\n",
"\n",
"df_masked = pd.DataFrame(masked_samples, columns=[\"Sentence with [MASK]\", \"Masked Word\"])\n",
"\n",
"# ===============================\n",
"# ๐น Step 5: Evaluate Topโk Accuracy\n",
"# ===============================\n",
"results = []\n",
"max_k = 20\n",
"\n",
"for _, row in tqdm(df_masked.iterrows(), total=len(df_masked), desc=\"๐ Evaluating (Topโk)\"):\n",
" masked_sentence = row[\"Sentence with [MASK]\"]\n",
" true_word = row[\"Masked Word\"].lower().strip()\n",
"\n",
" # Tokenize with truncation & padding\n",
" inputs = tokenizer(\n",
" masked_sentence,\n",
" return_tensors=\"pt\",\n",
" truncation=True,\n",
" max_length=128,\n",
" padding=\"max_length\"\n",
" ).to(device)\n",
"\n",
" mask_indices = torch.where(inputs.input_ids[0] == tokenizer.mask_token_id)[0]\n",
" if len(mask_indices) != 1:\n",
" continue\n",
" mask_idx = mask_indices.item()\n",
"\n",
" # Forward pass\n",
" with torch.no_grad():\n",
" outputs = model(**inputs)\n",
" logits = outputs.logits\n",
"\n",
" # Get topโk predictions\n",
" mask_logits = logits[0, mask_idx]\n",
" topk = torch.topk(mask_logits, k=max_k).indices.tolist()\n",
" top_tokens = [tokenizer.decode([tid]).strip().lower() for tid in topk]\n",
"\n",
" results.append({\n",
" \"Masked Word\": true_word,\n",
" \"Top-k Predictions\": top_tokens\n",
" })\n",
"\n",
"# ===============================\n",
"# ๐น Step 6: Compute Topโk Accuracy Curve\n",
"# ===============================\n",
"k_range = list(range(1, max_k+1))\n",
"accuracies = []\n",
"total = len(results)\n",
"\n",
"for k in k_range:\n",
" correct = sum(true in preds[:k] for true, preds in \n",
" [(r[\"Masked Word\"], r[\"Top-k Predictions\"]) for r in results])\n",
" accuracies.append(correct/total*100)\n",
"\n",
"# ===============================\n",
"# ๐น Step 7: Plot Topโk Curve\n",
"# ===============================\n",
"plt.figure(figsize=(10,6))\n",
"plt.plot(k_range, accuracies, marker='o')\n",
"plt.title(\"Topโk Accuracy Curve (BISโBERTโMLM)\", fontsize=14)\n",
"plt.xlabel(\"k\", fontsize=12)\n",
"plt.ylabel(\"Accuracy (%)\", fontsize=12)\n",
"plt.xticks(k_range)\n",
"plt.grid(True)\n",
"plt.ylim(0, 100)\n",
"plt.show()\n",
"\n",
"# ===============================\n",
"# ๐น Step 8: Print Summary\n",
"# ===============================\n",
"for k, acc in zip(k_range, accuracies):\n",
" print(f\"Top-{k:2d} Accuracy: {acc:5.2f}%\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Corpus Statistics and Training Metadata Summary\n",
"\n",
"This section computes descriptive statistics for the corpus, tokenizer, and model, and documents training configurations used for pretraining `cb-bert-mlm`.\n",
"\n",
"These figures provide reproducibility and clarity for evaluating the scale and setup of the domain-adaptive masked language modeling process."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"โ Loaded tokenized dataset with 2087615 sentences.\n"
]
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "9a15e03981b9432da3ea1226c3269018",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Map: 0%| | 0/2087615 [00:00, ? examples/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ข Total number of MLM sentences: 2087615\n",
"๐ก Total number of tokens used: 66359113\n"
]
}
],
"source": [
"from datasets import load_from_disk\n",
"from transformers import BertTokenizerFast\n",
"\n",
"# === Step 1: Load tokenized dataset ===\n",
"dataset_path = \"./tokenized-bis-dataset\" \n",
"dataset = load_from_disk(dataset_path)\n",
"print(f\"โ Loaded tokenized dataset with {len(dataset)} sentences.\")\n",
"\n",
"# === Step 2: Load tokenizer ===\n",
"tokenizer_path = \"./cb-bert-mlm\" \n",
"tokenizer = BertTokenizerFast.from_pretrained(tokenizer_path)\n",
"\n",
"# === Step 3: Count tokens ===\n",
"def count_tokens(example):\n",
" return {\"num_tokens\": sum(example['attention_mask'])} \n",
"\n",
"token_counts = dataset.map(count_tokens, remove_columns=dataset.column_names)\n",
"total_tokens = sum(token_counts[\"num_tokens\"])\n",
"\n",
"# === Output results ===\n",
"print(f\"๐ข Total number of MLM sentences: {len(dataset)}\")\n",
"print(f\"๐ก Total number of tokens used: {total_tokens}\")\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"๐ Average tokens per sentence: 31.79\n",
"๐ Tokenizer vocab size: 30522\n",
"๐ง Total model parameters: 109,514,298\n",
"๐ง Trainable parameters: 109,514,298\n",
"\n",
"๐ Training Metadata:\n",
"๐ Epochs: 1\n",
"๐ฆ Batch size per device: 16\n",
"๐งฎ Gradient Accumulation: 2\n",
"๐งช Effective Batch Size: 32\n",
"๐ข Max sequence length: 128\n",
"๐ญ MLM Probability: 15.0%\n",
"๐ป Device: GPU P100\n",
"๐งฎ Mixed Precision (fp16): True\n"
]
}
],
"source": [
"import torch\n",
"from transformers import BertForMaskedLM, BertTokenizerFast\n",
"\n",
"# === Corpus Stats ===\n",
"avg_tokens_per_sentence = total_tokens / len(dataset)\n",
"print(f\"๐ Average tokens per sentence: {avg_tokens_per_sentence:.2f}\")\n",
"\n",
"# === Tokenizer Stats ===\n",
"tokenizer = BertTokenizerFast.from_pretrained(\"./cb-bert-mlm\") # or \"bert-base-uncased\"\n",
"vocab_size = tokenizer.vocab_size\n",
"print(f\"๐ Tokenizer vocab size: {vocab_size}\")\n",
"\n",
"# === Model Stats ===\n",
"model = BertForMaskedLM.from_pretrained(\"./cb-bert-mlm\") # or saved model dir\n",
"total_params = sum(p.numel() for p in model.parameters())\n",
"trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
"\n",
"print(f\"๐ง Total model parameters: {total_params:,}\")\n",
"print(f\"๐ง Trainable parameters: {trainable_params:,}\")\n",
"\n",
"# === Training Meta (manually input) ===\n",
"training_epochs = 1\n",
"max_seq_length = 128\n",
"batch_size = 16\n",
"grad_accum = 2\n",
"mlm_prob = 0.15\n",
"device_used = \"GPU P100\"\n",
"mixed_precision = True # โ based on actual training logs\n",
"\n",
"print(\"\\n๐ Training Metadata:\")\n",
"print(f\"๐ Epochs: {training_epochs}\")\n",
"print(f\"๐ฆ Batch size per device: {batch_size}\")\n",
"print(f\"๐งฎ Gradient Accumulation: {grad_accum}\")\n",
"print(f\"๐งช Effective Batch Size: {batch_size * grad_accum}\")\n",
"print(f\"๐ข Max sequence length: {max_seq_length}\")\n",
"print(f\"๐ญ MLM Probability: {mlm_prob * 100}%\")\n",
"print(f\"๐ป Device: {device_used}\")\n",
"print(f\"๐งฎ Mixed Precision (fp16): {mixed_precision}\")\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Package Model for Upload\n",
"\n",
"The trained model is zipped for easy download and uploading to Hugging Face.\n",
"\n",
"```python\n",
"# Zip the fine-tuned model\n",
"shutil.make_archive(\"/kaggle/working/BIS-BERT-MLM\", 'zip', \"/kaggle/working/bert-mlm-bis\")\n",
"```\n",
"\n",
"- `BIS-BERT-MLM.zip`: Contains all model files (`config`, `pytorch_model`, tokenizer, vocab, etc.).\n",
"\n",
"These archives are ready for upload to the Hugging Face Model Hub and Dataset Hub respectively."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2025-07-20T01:38:42.849727Z",
"iopub.status.busy": "2025-07-20T01:38:42.849095Z",
"iopub.status.idle": "2025-07-20T01:40:08.481753Z",
"shell.execute_reply": "2025-07-20T01:40:08.481099Z",
"shell.execute_reply.started": "2025-07-20T01:38:42.849695Z"
},
"trusted": true
},
"outputs": [
{
"data": {
"text/plain": [
"'/kaggle/working/BIS-BERT-MLM.zip'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import shutil\n",
"\n",
"# Zip the entire model directory\n",
"shutil.make_archive(\"/kaggle/working/BIS-BERT-MLM\", 'zip', \"/kaggle/working/bert-mlm-bis\")\n"
]
}
],
"metadata": {
"kaggle": {
"accelerator": "gpu",
"dataSources": [
{
"datasetId": 7900125,
"sourceId": 12515905,
"sourceType": "datasetVersion"
}
],
"dockerImageVersionId": 31090,
"isGpuEnabled": true,
"isInternetEnabled": true,
"language": "python",
"sourceType": "notebook"
},
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.12.7"
}
},
"nbformat": 4,
"nbformat_minor": 4
}