metadata
language: en
license: mit
library_name: transformers
tags:
- climate-change
- domain-adaptation
- masked-language-modeling
- scientific-nlp
- transformer
- BERT
- SciBERT
metrics:
- f1
model-index:
- name: CliSciBERT
results:
- task:
type: text-classification
name: Climate NLP Tasks (ClimaBench)
dataset:
name: ClimaBench
type: benchmark
metrics:
- type: f1
name: Macro F1 (avg)
value: 60.502
CliSciBERT 🌿📚
CliSciBERT is a domain-adapted version of SciBERT, further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.
🔍 Overview
- Base Model: SciBERT (BERT-base architecture, scientific vocab)
- Pretraining Method: Continued pretraining (domain adaptation) using Masked Language Modeling (MLM)
- Corpus: Scientific papers focused on climate change and environmental science
- Tokenizer: SciBERT tokenizer (unchanged)
- Language: English
- Domain: Climate change research
📊 Performance
Evaluated on ClimaBench, a benchmark for climate-focused NLP tasks:
Metric | Value |
---|---|
Macro F1 (avg) | 60.50 |
Tasks won | 0/7 |
Avg. Std Dev | 0.01772 |
Note: While CliSciBERT builds on SciBERT’s scientific grounding, its domain specialization improves relevance for climate-related NLP tasks.
Climate performance model card:
CliSciBERT | |
---|---|
1. Model publicly available? | Yes |
2. Time to train final model | 463h |
3. Time for all experiments | 1,226h ~ 51 days |
4. Power of GPU and CPU | 0.250 kW + 0.013 kW |
5. Location for computations | Croatia |
6. Energy mix at location | 224.71 gCO2eq/kWh |
7. CO$_2$eq for final model | 28 kg CO2 |
8. CO$_2$eq for all experiments | 74 kg CO2 |
🧪 Intended Uses
Use for:
- Scientific text classification and relation extraction in climate change literature
- Domain-specific document tagging or summarization
- Supporting knowledge graph population for climate research
Not recommended for:
- Non-climate or general news content
- Non-English corpora
- Highly informal or colloquial text
Example:
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch
# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1
# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)
# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."
# Run prediction
predictions = fill_mask(text)
# Show top predictions
print(text)
print(10*">")
for p in predictions:
print(f"{p['sequence']} — {p['score']:.4f}")
Output:
The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth. — 0.3911
the increase in greenhouse gas ... affected the radiative balance of the earth. — 0.2640
the increase in greenhouse gas ... affected the radiation balance of the earth. — 0.1233
the increase in greenhouse gas ... affected the carbon balance of the earth. — 0.0589
the increase in greenhouse gas ... affected the ecological balance of the earth. — 0.0332
⚠️ Limitations
- Retains SciBERT’s limitations outside the scientific domain
- May inherit biases from climate change literature
- No tokenizer retraining — tokenization optimized for general science, not climate-specific vocabulary
🧾 Citation
If you use this model, please cite:
@article{poleksic_etal_2025,
title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
author={Poleksić, Andrija and
Martinčić-Ipšić, Sanda},
journal={None},
year={2025}
}