README.md · P0L3/cliscibert_scivocab_uncased at c3ccf3e07291a30af6516ce6b45a1b119ee7b8eb

File size: 4,676 Bytes

---
language: en
license: mit
library_name: transformers
tags:
  - climate-change
  - domain-adaptation
  - masked-language-modeling
  - scientific-nlp
  - transformer
  - BERT
  - SciBERT
metrics:
  - f1
model-index:
  - name: CliSciBERT
    results:
      - task:
          type: text-classification
          name: Climate NLP Tasks (ClimaBench)
        dataset:
          name: ClimaBench
          type: benchmark
        metrics:
          - type: f1
            name: Macro F1 (avg)
            value: 60.502
---

# CliSciBERT 🌿📚

**CliSciBERT** is a domain-adapted version of [**SciBERT**](https://huggingface.co/allenai/scibert_scivocab_uncased), further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.

## 🔍 Overview

- **Base Model**: SciBERT (BERT-base architecture, scientific vocab)
- **Pretraining Method**: Continued pretraining (domain adaptation) using Masked Language Modeling (MLM)
- **Corpus**: Scientific papers focused on climate change and environmental science
- **Tokenizer**: SciBERT tokenizer (unchanged)
- **Language**: English
- **Domain**: Climate change research

## 📊 Performance

Evaluated on **ClimaBench**, a benchmark for climate-focused NLP tasks:

| Metric         | Value        |
|----------------|--------------|
| Macro F1 (avg) | 60.50|
| Tasks won      | 0/7|
| Avg. Std Dev   | 0.01772|

Note: While CliSciBERT builds on SciBERT’s scientific grounding, its domain specialization improves relevance for climate-related NLP tasks.

Climate performance model card:
|CliSciBERT||
|---------------------------------|-----------------------------|
| 1. Model publicly available?    | Yes                         |
| 2. Time to train final model    | 463h                        | 
| 3. Time for all experiments     | 1,226h ~ 51 days       |
| 4. Power of GPU and CPU         | 0.250 kW + 0.013 kW         |
| 5. Location for computations    | Croatia                     |
| 6. Energy mix at location       | 224.71 gCO<sub>2</sub>eq/kWh        |
| 7. CO$_2$eq for final model     | 28 kg CO<sub>2</sub>  |
| 8. CO$_2$eq for all experiments | 74 kg CO<sub>2</sub>                |

## 🧪 Intended Uses

**Use for:**
- Scientific text classification and relation extraction in climate change literature
- Domain-specific document tagging or summarization
- Supporting knowledge graph population for climate research

**Not recommended for:**
- Non-climate or general news content
- Non-English corpora
- Highly informal or colloquial text

Example:
``` python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} — {p['score']:.4f}")
```
Output:
``` shell
The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth. — 0.3911
the increase in greenhouse gas ... affected the radiative balance of the earth. — 0.2640
the increase in greenhouse gas ... affected the radiation balance of the earth. — 0.1233
the increase in greenhouse gas ... affected the carbon balance of the earth. — 0.0589
the increase in greenhouse gas ... affected the ecological balance of the earth. — 0.0332
```

## ⚠️ Limitations

- Retains SciBERT’s limitations outside the scientific domain
- May inherit biases from climate change literature
- No tokenizer retraining — tokenization optimized for general science, not climate-specific vocabulary

## 🧾 Citation

If you use this model, please cite:

```bibtex
@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksić, Andrija  and
      Martinčić-Ipšić, Sanda},
  journal={None},
  year={2025}
}