File size: 2,173 Bytes
b069925
 
 
 
 
1482c58
b069925
1482c58
b069925
1482c58
b069925
1482c58
b069925
 
1482c58
b069925
1482c58
b069925
1482c58
b069925
1482c58
b069925
1482c58
b069925
1482c58
b069925
1482c58
b069925
 
1482c58
b069925
1482c58
b069925
 
1482c58
 
 
 
 
 
 
b069925
 
 
 
1482c58
b069925
1482c58
b069925
1482c58
 
b069925
1482c58
b069925
1482c58
 
 
b069925
1482c58
b069925
1482c58
 
 
 
b069925
 
50d8b72
 
 
 
b069925
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
library_name: transformers
tags: []
---

# 🐦 Curió 1.1B

## 📖 Overview

Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.

This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.


## 🏗 Training Setup

- Base model: TinyLlama 1.1B (LLaMA-2 architecture)

- Parameters: 1.1B

- Continued pretraining tokens: 150B (ClassiCC-PT)

- Sequence length: 4096 tokens (with packing)

- Hardware: TPU v2-128 (thanks to Google TRC program)

- Frameworks: T5X


## 📊 Evaluation

Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).


| Model             | Training Regimen                             | Poeta v2 NPM |
| ----------------- | -------------------------------------------- | ------------ |
| TinyLlama 1T (EN) | –                                            | 17.4         |
| TinyLlama 2T (EN) | +1T EN continued pretraining                 | 20.9         |
| training with mC4-PT            | +150B PT (mC4-PT) continued pretraining           | \~20         |
| training with ClueWeb-22-PT     | +150B PT (Clueweb-22-PT) continued pretraining           | \~27         |
| **Curió 1.1B**    | +150B PT (ClassiCC-PT) continued pretraining | **27.1**     |




## 📥 Usage

Please note that **Curio 1.1B has not trained to be used as a chat model**

```
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "ClassiCC-Corpus/Curio-1.1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

## 📜 Citation

If you use Curió 1.1B, please cite:
```
Coming soon
```


## Acknowledgements

We thank the google TRC program, which generously granted us the necessary resources for the development of this research.