File size: 2,173 Bytes

---
library_name: transformers
tags: []
---

# 🐦 Curió 1.1B

## 📖 Overview

Curió 1.1B is a Portuguese-adapted language model created via continued pretraining of TinyLlama 1.1B (1T), originally trained on 1 trillion English tokens, on 150B Portuguese tokens from the ClassiCC-PT corpus.

This model was designed to explore the impact of language-specific corpora on adapting an English-trained base model to Portuguese, yielding performance improvements on Portuguese benchmarks without large-scale retraining from scratch.


## 🏗 Training Setup

- Base model: TinyLlama 1.1B (LLaMA-2 architecture)

- Parameters: 1.1B

- Continued pretraining tokens: 150B (ClassiCC-PT)

- Sequence length: 4096 tokens (with packing)

- Hardware: TPU v2-128 (thanks to Google TRC program)

- Frameworks: T5X


## 📊 Evaluation

Evaluated on the Poeta benchmark — 14 diverse Portuguese tasks (RTE, STS, MCQ exams, sentiment analysis, QA, etc.) — using the Normalized Preferred Metric (NPM).


| Model             | Training Regimen                             | Poeta v2 NPM |
| ----------------- | -------------------------------------------- | ------------ |
| TinyLlama 1T (EN) | –                                            | 17.4         |
| TinyLlama 2T (EN) | +1T EN continued pretraining                 | 20.9         |
| training with mC4-PT            | +150B PT (mC4-PT) continued pretraining           | \~20         |
| training with ClueWeb-22-PT     | +150B PT (Clueweb-22-PT) continued pretraining           | \~27         |
| **Curió 1.1B**    | +150B PT (ClassiCC-PT) continued pretraining | **27.1**     |




## 📥 Usage

Please note that **Curio 1.1B has not trained to be used as a chat model**

```
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "ClassiCC-Corpus/Curio-1.1B"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```

## 📜 Citation

If you use Curió 1.1B, please cite:
```
Coming soon
```


## Acknowledgements

We thank the google TRC program, which generously granted us the necessary resources for the development of this research.