--- language: tg license: mit tags: - fasttext - tajik - word-embeddings - nlp --- # Tajik FastText Word Embedding Model This repository contains a pretrained **FastText** model for the Tajik language. - **Training data**: Tokenized Tajik corpus - **Total tokens**: 21,171,522 - **Vocabulary size**: 316,637 - **Model type**: FastText (with subword information) ## Files Included | File | Description | |------|-------------| | `tajik_fasttext.model` | Gensim model file | | `tajik_fasttext.model.wv.vectors_ngrams.npy` | Subword (n-gram) vectors | | `tajik_fasttext.model.wv.vectors_vocab.npy` | Word vectors | All three files are required to load the model correctly using Gensim. ## Usage ```python from gensim.models import FastText model = FastText.load("tajik_fasttext.model") vector = model.wv["Точикистон"] # Example word ``` ## Citation If you use this model, please cite the repository: > ArabovMK, Tajik FastText Model, Hugging Face, 2025-05-08