File size: 4,676 Bytes
b423425
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9e8318
b423425
 
 
 
9c050bf
b423425
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1d403ac
 
 
 
 
c3ccf3e
1d403ac
 
c3ccf3e
 
 
1d403ac
b423425
 
 
 
 
 
 
 
 
 
 
 
045d0e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b423425
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
language: en
license: mit
library_name: transformers
tags:
  - climate-change
  - domain-adaptation
  - masked-language-modeling
  - scientific-nlp
  - transformer
  - BERT
  - SciBERT
metrics:
  - f1
model-index:
  - name: CliSciBERT
    results:
      - task:
          type: text-classification
          name: Climate NLP Tasks (ClimaBench)
        dataset:
          name: ClimaBench
          type: benchmark
        metrics:
          - type: f1
            name: Macro F1 (avg)
            value: 60.502
---

# CliSciBERT 🌿📚

**CliSciBERT** is a domain-adapted version of [**SciBERT**](https://huggingface.co/allenai/scibert_scivocab_uncased), further pretrained on a curated corpus of peer-reviewed research papers in the climate change domain. It is designed to enhance performance on climate-focused scientific NLP tasks by adapting the general scientific knowledge of SciBERT to the specialized subdomain of climate research.

## 🔍 Overview

- **Base Model**: SciBERT (BERT-base architecture, scientific vocab)
- **Pretraining Method**: Continued pretraining (domain adaptation) using Masked Language Modeling (MLM)
- **Corpus**: Scientific papers focused on climate change and environmental science
- **Tokenizer**: SciBERT tokenizer (unchanged)
- **Language**: English
- **Domain**: Climate change research

## 📊 Performance

Evaluated on **ClimaBench**, a benchmark for climate-focused NLP tasks:

| Metric         | Value        |
|----------------|--------------|
| Macro F1 (avg) | 60.50|
| Tasks won      | 0/7|
| Avg. Std Dev   | 0.01772|

Note: While CliSciBERT builds on SciBERT’s scientific grounding, its domain specialization improves relevance for climate-related NLP tasks.

Climate performance model card:
|CliSciBERT||
|---------------------------------|-----------------------------|
| 1. Model publicly available?    | Yes                         |
| 2. Time to train final model    | 463h                        | 
| 3. Time for all experiments     | 1,226h ~ 51 days       |
| 4. Power of GPU and CPU         | 0.250 kW + 0.013 kW         |
| 5. Location for computations    | Croatia                     |
| 6. Energy mix at location       | 224.71 gCO<sub>2</sub>eq/kWh        |
| 7. CO$_2$eq for final model     | 28 kg CO<sub>2</sub>  |
| 8. CO$_2$eq for all experiments | 74 kg CO<sub>2</sub>                |

## 🧪 Intended Uses

**Use for:**
- Scientific text classification and relation extraction in climate change literature
- Domain-specific document tagging or summarization
- Supporting knowledge graph population for climate research

**Not recommended for:**
- Non-climate or general news content
- Non-English corpora
- Highly informal or colloquial text

Example:
``` python
from transformers import AutoTokenizer, AutoModelForMaskedLM, pipeline
import torch

# Load the pretrained model and tokenizer
model_name = "P0L3/clirebert_clirevocab_uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

# Move model to GPU if available
device = 0 if torch.cuda.is_available() else -1

# Create a fill-mask pipeline
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer, device=device)

# Example input from scientific climate literature
text = "The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth."

# Run prediction
predictions = fill_mask(text)

# Show top predictions
print(text)
print(10*">")
for p in predictions:
    print(f"{p['sequence']} — {p['score']:.4f}")
```
Output:
``` shell
The increase in greenhouse gas emissions has significantly affected the [MASK] balance of the Earth.
>>>>>>>>>>
the increase in greenhouse gas ... affected the energy balance of the earth. — 0.3911
the increase in greenhouse gas ... affected the radiative balance of the earth. — 0.2640
the increase in greenhouse gas ... affected the radiation balance of the earth. — 0.1233
the increase in greenhouse gas ... affected the carbon balance of the earth. — 0.0589
the increase in greenhouse gas ... affected the ecological balance of the earth. — 0.0332
```

## ⚠️ Limitations

- Retains SciBERT’s limitations outside the scientific domain
- May inherit biases from climate change literature
- No tokenizer retraining — tokenization optimized for general science, not climate-specific vocabulary

## 🧾 Citation

If you use this model, please cite:

```bibtex
@article{poleksic_etal_2025,
  title={Climate Research Domain BERTs: Pretraining, Adaptation, and Evaluation},
  author={Poleksić, Andrija  and
      Martinčić-Ipšić, Sanda},
  journal={None},
  year={2025}
}