File size: 6,316 Bytes
b0891ae 137b07f b0891ae 137b07f 8df6ed8 137b07f b0891ae 2e64978 9d2140c b0891ae 93ffa11 a308424 137b07f cce86bc b0891ae 2e64978 137b07f 2e64978 137b07f 2e64978 137b07f 9d2140c 93ffa11 137b07f 38f65bb 9d2140c 93ffa11 137b07f 93ffa11 9d2140c 93ffa11 137b07f 413349a 9d2140c 93ffa11 137b07f a7ea11b 8377b19 2e64978 9d2140c 2e64978 137b07f 93ffa11 9d2140c 93ffa11 137b07f 9d2140c 93ffa11 137b07f 040eece |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 |
---
license: mit
language:
- pt
tags:
- gervasio-pt*
- gervasio-ptpt
- gervasio-8b-portuguese-ptpt-decoder
- portulan
- albertina-pt*
- serafim-pt*
- clm
- gpt
- portuguese
- decoder
- foundation model
base_model:
- meta-llama/Llama-3.1-8B-Instruct
base_model_relation: finetune
pipeline_tag: text-generation
library_name: transformers
---
</br>
</br>
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
<p style="text-align: center;"> This is the model card for <b>Gervásio 8B PTPT</b> decoder.
</br>
This model is integrated in the <a href="https://evaristo.ai"><b>Evaristo.ai chatbot</b></a>, where its generative capabilities can be experimented with on the fly through a GUI.
</br>
You may be interested also in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Serafim (sentence encoder) families</a>.
</p>
</br>
</br>
<img width="500" src="logo_gervasio_long_color.png">
</br>
# Gervásio 8B PTPT
</br>
**Gervásio 8B PTPT** is an **open** decoder for the **Portuguese language**.
It is a **decoder** of the LLaMA family, based on the neural architecture Transformer and developed over the LLaMA 3.1 8B Instruct model.
Its further improvement through additional training was done over language resources that include data sets of Portuguese prepared for this purpose, that include [extraGLUE-Instruct
](https://huggingface.co/datasets/PORTULAN/extraglue-instruct), as well as other data sets whose release is being prepared (MMLU PT, Natural Instructions PT, Wikipedia subset, Provérbios PT).
**Gervásio 8B PTPT** is openly distributed for free under an open license, including thus for research and commercial purposes, and given its size, can be run on consumer-grade hardware.
**Gervásio 8B PTPT** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal.
For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**,
and which is known more shortly as **Gervásio PT*** or, even more briefly, just as **Gervásio**, among its acquaintances.
**Gervásio 8B PTPT** is developed by a team from the University of Lisbon, Portugal.
<br>
<br>
# Model Description
The model has 8 billion parameters, over 32 layers, with a hidden size of 4096, an intermediate size of 14336, and 32 attention heads. It uses a RoPE tokenizer with a vocabulary of size 128256.
<br>
<br>
# Training Data
**Gervásio 8B PTPT** was trained on various datasets, either native to European Portuguese or translated into European Portuguese.
For the latter, we selected only those datasets where the outcome of their translation into European Portuguese could preserve, in the target language, the linguistic properties at stake.
The training data comprises:
- [extraGLUE-Instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct)
- MMLU PT (multiple choice question answering).
- A subset of Natural Instructions (mostly multiple choice question answering tasks).
- A manually curated subset of Wikipedia.
- A manually curated list of proverbs.
<br>
<br>
# Training Details
We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process. Specifically, while the entire prompt and chat template received attention during fine-tuning, only the response tokens were subjected to back-propagation.
To accelerate training, the Fully Sharded Data Parallel (FSDP) paradigm was used over 10 L40S GPUs.
<br>
<br>
# Performance
For testing, we use translations of the standard benchmarks GPQA Diamond, MMLU and MMLU Pro, as well as the CoPA, MRPC and RTE datasets in [extraGLUE](https://huggingface.co/datasets/PORTULAN/extraglue).
| Model | GPQA Diamond PT | MMLU PT | MMLU Pro PT | CoPA | MRPC | RTE | Average |
| ---------------------- | --------------: | --------: | ----------: | --------: | --------: | --------: | --------: |
| Gervásio 8B PTPT | **34.85** | **62.15** | **36.79** | **87.00** | **77.45** | 77.62 | **62.64** |
| LLaMA 3.1 8B Instruct | 32.32 | 61.49 | 36.10 | 83.00 | 75.25 | **79.42** | 61.26 |
<br>
<br>
# How to use
You can use this model directly with a pipeline for causal language modeling:
```python3
>>> from transformers import pipeline
>>> generator = pipeline(model='PORTULAN/gervasio-8b-portuguese-ptpt-decoder')
>>> generator("A comida portuguesa é", max_new_tokens=10)
```
<br>
<br>
# Chatbot
This model is integrated in the **chatbot** [**Evaristo.ai**](https://evaristo.ai), where its generative capabilities can be experimented with on the fly through a GUI.
<br>
<br>
# Please cite
``` latex
@misc{gervasio,
title={Advancing Generative AI for Portuguese with
Open Decoder Gervásio PT-*},
author={Rodrigo Santos, João Silva, Luís Gomes,
João Rodrigues, António Branco},
year={2024},
eprint={2402.18766},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
Please use the above canonical reference when using or citing this model.
<br>
<br>
# Acknowledgments
The research reported here was partially supported by:
PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016;
innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação I.P. under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização;
research project "Hey, Hal, curb your hallucination! / Enhancing AI chatbots with enhanced RAG solutions", funded by FCT-Fundação para a Ciência e a Tecnologia under the grant 2024.07592.IACDC;
project "CLARIN – Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem", funded by programme Lisboa2030 under the grant LISBOA2030-FEDER-01316900PORTULAN.
|