|
--- |
|
license: mit |
|
language: |
|
- pt |
|
tags: |
|
- gervasio-pt* |
|
- gervasio-ptpt |
|
- gervasio-8b-portuguese-ptpt-decoder |
|
- portulan |
|
- albertina-pt* |
|
- serafim-pt* |
|
- clm |
|
- gpt |
|
- portuguese |
|
- decoder |
|
- foundation model |
|
base_model: |
|
- meta-llama/Llama-3.1-8B-Instruct |
|
base_model_relation: finetune |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
</br> |
|
</br> |
|
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png"> |
|
<p style="text-align: center;"> This is the model card for <b>Gervásio 8B PTPT</b> decoder. |
|
</br> |
|
This model is integrated in the <a href="https://evaristo.ai"><b>Evaristo.ai chatbot</b></a>, where its generative capabilities can be experimented with on the fly through a GUI. |
|
</br> |
|
You may be interested also in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Serafim (sentence encoder) families</a>. |
|
</p> |
|
</br> |
|
</br> |
|
|
|
<img width="500" src="logo_gervasio_long_color.png"> |
|
|
|
</br> |
|
|
|
|
|
|
|
|
|
# Gervásio 8B PTPT |
|
|
|
</br> |
|
|
|
**Gervásio 8B PTPT** is an **open** decoder for the **Portuguese language**. |
|
|
|
It is a **decoder** of the LLaMA family, based on the neural architecture Transformer and developed over the LLaMA 3.1 8B Instruct model. |
|
Its further improvement through additional training was done over language resources that include data sets of Portuguese prepared for this purpose, that include [extraGLUE-Instruct |
|
](https://huggingface.co/datasets/PORTULAN/extraglue-instruct), as well as other data sets whose release is being prepared (MMLU PT, Natural Instructions PT, Wikipedia subset, Provérbios PT). |
|
|
|
**Gervásio 8B PTPT** is openly distributed for free under an open license, including thus for research and commercial purposes, and given its size, can be run on consumer-grade hardware. |
|
|
|
**Gervásio 8B PTPT** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal. |
|
|
|
For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**, |
|
and which is known more shortly as **Gervásio PT*** or, even more briefly, just as **Gervásio**, among its acquaintances. |
|
|
|
**Gervásio 8B PTPT** is developed by a team from the University of Lisbon, Portugal. |
|
|
|
<br> |
|
<br> |
|
|
|
# Model Description |
|
|
|
The model has 8 billion parameters, over 32 layers, with a hidden size of 4096, an intermediate size of 14336, and 32 attention heads. It uses a RoPE tokenizer with a vocabulary of size 128256. |
|
<br> |
|
<br> |
|
|
|
# Training Data |
|
|
|
**Gervásio 8B PTPT** was trained on various datasets, either native to European Portuguese or translated into European Portuguese. |
|
For the latter, we selected only those datasets where the outcome of their translation into European Portuguese could preserve, in the target language, the linguistic properties at stake. |
|
|
|
The training data comprises: |
|
- [extraGLUE-Instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) |
|
- MMLU PT (multiple choice question answering). |
|
- A subset of Natural Instructions (mostly multiple choice question answering tasks). |
|
- A manually curated subset of Wikipedia. |
|
- A manually curated list of proverbs. |
|
<br> |
|
<br> |
|
|
|
# Training Details |
|
|
|
We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process. Specifically, while the entire prompt and chat template received attention during fine-tuning, only the response tokens were subjected to back-propagation. |
|
|
|
To accelerate training, the Fully Sharded Data Parallel (FSDP) paradigm was used over 10 L40S GPUs. |
|
<br> |
|
<br> |
|
|
|
# Performance |
|
|
|
For testing, we use translations of the standard benchmarks GPQA Diamond, MMLU and MMLU Pro, as well as the CoPA, MRPC and RTE datasets in [extraGLUE](https://huggingface.co/datasets/PORTULAN/extraglue). |
|
|
|
| Model | GPQA Diamond PT | MMLU PT | MMLU Pro PT | CoPA | MRPC | RTE | Average | |
|
| ---------------------- | --------------: | --------: | ----------: | --------: | --------: | --------: | --------: | |
|
| Gervásio 8B PTPT | **34.85** | **62.15** | **36.79** | **87.00** | **77.45** | 77.62 | **62.64** | |
|
| LLaMA 3.1 8B Instruct | 32.32 | 61.49 | 36.10 | 83.00 | 75.25 | **79.42** | 61.26 | |
|
|
|
<br> |
|
<br> |
|
|
|
# How to use |
|
|
|
You can use this model directly with a pipeline for causal language modeling: |
|
|
|
```python3 |
|
>>> from transformers import pipeline |
|
>>> generator = pipeline(model='PORTULAN/gervasio-8b-portuguese-ptpt-decoder') |
|
>>> generator("A comida portuguesa é", max_new_tokens=10) |
|
``` |
|
<br> |
|
<br> |
|
|
|
# Chatbot |
|
|
|
This model is integrated in the **chatbot** [**Evaristo.ai**](https://evaristo.ai), where its generative capabilities can be experimented with on the fly through a GUI. |
|
<br> |
|
<br> |
|
|
|
|
|
# Please cite |
|
|
|
``` latex |
|
@misc{gervasio, |
|
title={Advancing Generative AI for Portuguese with |
|
Open Decoder Gervásio PT-*}, |
|
author={Rodrigo Santos, João Silva, Luís Gomes, |
|
João Rodrigues, António Branco}, |
|
year={2024}, |
|
eprint={2402.18766}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
Please use the above canonical reference when using or citing this model. |
|
<br> |
|
<br> |
|
|
|
# Acknowledgments |
|
|
|
The research reported here was partially supported by: |
|
PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016; |
|
innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação I.P. under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização; |
|
research project "Hey, Hal, curb your hallucination! / Enhancing AI chatbots with enhanced RAG solutions", funded by FCT-Fundação para a Ciência e a Tecnologia under the grant 2024.07592.IACDC; |
|
project "CLARIN – Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem", funded by programme Lisboa2030 under the grant LISBOA2030-FEDER-01316900PORTULAN. |
|
|