|
|
--- |
|
|
library_name: transformers |
|
|
language: |
|
|
- pt |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
This model and its tokenizer were fully pretrained on Portuguese text. |
|
|
I don't have time right now to write more about the training process. |
|
|
Contact me at elias.jacob at ufrn.br if you need some info before I have the time to publish something more detailed. The training data was the cleaned Portuguese subset of the HPLT V2 dataset. I've followed the exact same training recipe as in the original paper. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
|
|
|
|
|
model_id = "eliasjacob/ModernBERT-large-portuguese" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForMaskedLM.from_pretrained(model_id) |
|
|
|
|
|
device = torch.device("cpu") |
|
|
model.to(device) |
|
|
|
|
|
text = "O código penal brasileiro estabelece, em seu artigo [MASK], o crime de homicídio" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
|
|
|
# To get predictions for the mask: |
|
|
masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) |
|
|
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1) |
|
|
predicted_token = tokenizer.decode(predicted_token_id) |
|
|
print("Predicted token:", predicted_token) |
|
|
``` |
|
|
|
|
|
``` |
|
|
Predicted token: 121 |
|
|
``` |