NILC Portuguese Word Embeddings — FastText Skip-Gram 300d

This repository contains the FastText Skip-Gram 300d model in safetensors format.

About

NILC-Embeddings is a repository for storing and sharing word embeddings for the Portuguese language. The goal is to provide ready-to-use vector resources for Natural Language Processing (NLP) and Machine Learning tasks.

The embeddings were trained on a large Portuguese corpus (Brazilian + European), composed of 17 corpora (~1.39B tokens). Training was carried out with the following algorithms: Word2Vec, FastText, Wang2Vec, and GloVe.

📂 Files

embeddings.safetensors → embedding matrix ([vocab_size, 300])
vocab.txt → vocabulary (one token per line, aligned with rows)

🚀 Usage

from huggingface_hub import hf_hub_download
from safetensors.numpy import load_file

path = hf_hub_download(repo_id="nilc-nlp/fasttext-skip-gram-300d",
                       filename="embeddings.safetensors")

data = load_file(path)
vectors = data["embeddings"]

vocab_path = hf_hub_download(repo_id="nilc-nlp/fasttext-skip-gram-300d",
                             filename="vocab.txt")
with open(vocab_path) as f:
    vocab = [w.strip() for w in f]

print(vectors.shape)

Or in PyTorch:

from safetensors.torch import load_file
tensors = load_file("embeddings.safetensors")
vectors = tensors["embeddings"]  # torch.Tensor

📊 Corpus

The embeddings were trained on a combination of 17 corpora (~1.39B tokens):

Corpus	Tokens	Types	Genre	Description
LX-Corpus [Rodrigues et al. 2016]	714,286,638	2,605,393	Mixed genres	Large collection of texts from 19 sources, mostly European Portuguese
Wikipedia	219,293,003	1,758,191	Encyclopedic	Wikipedia dump (2016-10-20)
GoogleNews	160,396,456	664,320	Informative	News crawled from Google News
SubIMDB-PT	129,975,149	500,302	Spoken	Movie subtitles from IMDb
G1	105,341,070	392,635	Informative	News from G1 portal (2014–2015)
PLN-Br [Bruckschen et al. 2008]	31,196,395	259,762	Informative	Corpus of PLN-BR project (1994–2005)
Domínio Público	23,750,521	381,697	Prose	138,268 literary works
Lacio-Web [Aluísio et al. 2003]	8,962,718	196,077	Mixed	Literary, informative, scientific, law, didactic texts
Literatura Brasileira	1,299,008	66,706	Prose	Classical Brazilian fiction e-books
Mundo Estranho	1,047,108	55,000	Informative	Texts from Mundo Estranho magazine
CHC	941,032	36,522	Informative	Texts from Ciência Hoje das Crianças
FAPESP	499,008	31,746	Science communication	Texts from Pesquisa FAPESP magazine
Textbooks	96,209	11,597	Didactic	Elementary school textbooks
Folhinha	73,575	9,207	Informative	Children’s news from Folhinha (Folha de São Paulo)
NILC subcorpus	32,868	4,064	Informative	Children’s texts (3rd–4th grade)
Para Seu Filho Ler	21,224	3,942	Informative	Children’s news from Zero Hora
SARESP	13,308	3,293	Didactic	School evaluation texts
Total	1,395,926,282	3,827,725	—	—

📖 Paper

Portuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks
Hartmann, N. et al. (2017), STIL 2017.
ArXiv Paper

BibTeX

@inproceedings{hartmann-etal-2017-portuguese,
  title        = {{P}ortuguese Word Embeddings: Evaluating on Word Analogies and Natural Language Tasks},
  author       = {Hartmann, Nathan  and Fonseca, Erick  and Shulby, Christopher  and Treviso, Marcos  and Silva, J{'e}ssica  and Alu{'i}sio, Sandra},
  year         = 2017,
  month        = oct,
  booktitle    = {Proceedings of the 11th {B}razilian Symposium in Information and Human Language Technology},
  publisher    = {Sociedade Brasileira de Computa{\c{c}}{\~a}o},
  address      = {Uberl{\^a}ndia, Brazil},
  pages        = {122--131},
  url          = {https://aclanthology.org/W17-6615/},
  editor       = {Paetzold, Gustavo Henrique  and Pinheiro, Vl{'a}dia}
}

📜 License

Creative Commons Attribution 4.0 International (CC BY 4.0)

Downloads last month: -; Downloads are not tracked for this model. How to track

Collection including nilc-nlp/fasttext-skip-gram-300d

NILC-Embeddings

Collection

Pretrained static word embeddings for Portuguese (BR+PT), trained by NILC on a large multi-genre corpus (~1.39B tokens, 17 sources). • 34 items • Updated Sep 20 • 3