--- language: it --- # UmBERTo Commoncrawl Cased [UmBERTo](https://github.com/musixmatchresearch/umberto) is a Roberta-based Language Model trained on large Italian Corpora and uses two innovative approaches: SentencePiece and Whole Word Masking. Now available at [github.com/huggingface/transformers](https://huggingface.co/Musixmatch/umberto-commoncrawl-cased-v1)
Marco Lodola, Monument to Umberto Eco, Alessandria 2019
## Dataset UmBERTo-Commoncrawl-Cased utilizes the Italian subcorpus of [OSCAR](https://traces1.inria.fr/oscar/) as training set of the language model. We used deduplicated version of the Italian corpus that consists in 70 GB of plain text data, 210M sentences with 11B words where the sentences have been filtered and shuffled at line level in order to be used for NLP research. ## Pre-trained model | Model | WWM | Cased | Tokenizer | Vocab Size | Train Steps | Download | | ------ | ------ | ------ | ------ | ------ |------ | ------ | | `umberto-commoncrawl-cased-v1` | YES | YES | SPM | 32K | 125k | [Link](http://bit.ly/35zO7GH) | This model was trained with [SentencePiece](https://github.com/google/sentencepiece) and Whole Word Masking. ## Downstream Tasks These results refers to umberto-commoncrawl-cased model. All details are at [Umberto](https://github.com/musixmatchresearch/umberto) Official Page. #### Named Entity Recognition (NER) | Dataset | F1 | Precision | Recall | Accuracy | | ------ | ------ | ------ | ------ | ------ | | **ICAB-EvalITA07** | **87.565** | 86.596 | 88.556 | 98.690 | | **WikiNER-ITA** | **92.531** | 92.509 | 92.553 | 99.136 | #### Part of Speech (POS) | Dataset | F1 | Precision | Recall | Accuracy | | ------ | ------ | ------ | ------ | ------ | | **UD_Italian-ISDT** | 98.870 | 98.861 | 98.879 | **98.977** | | **UD_Italian-ParTUT** | 98.786 | 98.812 | 98.760 | **98.903** | ## Usage ##### Load UmBERTo with AutoModel, Autotokenizer: ```python import torch from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1") umberto = AutoModel.from_pretrained("Musixmatch/umberto-commoncrawl-cased-v1") encoded_input = tokenizer.encode("Umberto Eco è stato un grande scrittore") input_ids = torch.tensor(encoded_input).unsqueeze(0) # Batch size 1 outputs = umberto(input_ids) last_hidden_states = outputs[0] # The last hidden-state is the first element of the output ``` ##### Predict masked token: ```python from transformers import pipeline fill_mask = pipeline( "fill-mask", model="Musixmatch/umberto-commoncrawl-cased-v1", tokenizer="Musixmatch/umberto-commoncrawl-cased-v1" ) result = fill_mask("Umberto Eco è