metadata

language: en
license: mit
library_name: transformers
tags:
  - climate-change
  - domain-specific
  - masked-language-modeling
  - scientific-nlp
  - transformer
  - BERT
metrics:
  - f1
model-index:
  - name: CliReBERT
    results:
      - task:
          type: text-classification
          name: Climate NLP Tasks (ClimaBench)
        dataset:
          name: ClimaBench
          type: benchmark
        metrics:
          - type: f1
            name: Macro F1 (avg)
            value: 65.447

CliReBERT 🌍🧠

CliReBERT (Climate Research BERT) is a domain-specific BERT model pretrained from scratch on a curated corpus of peer-reviewed climate change research papers. It is built to support natural language processing tasks in climate science and environmental studies.

🔍 Overview

Architecture: BERT-base (uncased)
Parameters: ~110M
Pretraining Objective: Masked Language Modeling (MLM)
Tokenizer: Trained from scratch (WordPiece) on the same domain corpus
Language: English
Domain: Climate change research (scientific)

📊 Performance

Evaluated on ClimaBench (a climate-focused NLP benchmark):

Metric	Value
Macro F1 (avg)	65.45
Tasks won	3 / 7
Avg. Std Dev	0.0118

Outperformed baseline models like SciBERT, RoBERTa, and ClimateBERT on key tasks.

Climate performance model card:

CliReBERT
1. Model publicly available?	Yes
2. Time to train final model	463h
3. Time for all experiments	1,226h ~ 51 days
4. Power of GPU and CPU	0.250 kW + 0.013 kW
5. Location for computations	Croatia
6. Energy mix at location	224.71 gCO₂eq/kWh
7. CO$_2$eq for final model	28 kg CO₂
8. CO$_2$eq for all experiments	74 kg CO₂

🧪 Intended Uses

Use for:

Scientific information extraction in climate change research
Classification, relation extraction, and document tagging in climate-related corpora
Enhancing climate-focused knowledge graph construction

Not suitable for:

General-purpose NLP tasks
Text outside the scientific environmental domain
Non-English applications

Example:

from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} — {p['score']:.4f}")

Output:

The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth . — 0.6922
the increase in greenhouse gas ... affected the mass balance of the earth . — 0.0631
the increase in greenhouse gas ... affected the radiation balance of the earth . — 0.0606
the increase in greenhouse gas ... affected the radiative balance of the earth . — 0.0517
the increase in greenhouse gas ... affected the carbon balance of the earth . — 0.0365

⚠️ Limitations

Trained only on scientific literature (limited sociopolitical text exposure)
Monolingual (English)
May reflect publication biases from the scientific community

🧾 Citation

If you use this model, please cite:

@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksić, Andrija  and
      Martinčić-Ipšić, Sanda},
  journal={PREPRINT (Version 1)},
  year={2025},
  doi={https://doi.org/10.21203/rs.3.rs-6644722/v1}
}