metadata

language: tg
license: mit
tags:
  - fasttext
  - tajik
  - word-embeddings
  - nlp

Tajik FastText Word Embedding Model

Training data: Tokenized Tajik corpus
Total tokens: 21,171,522
Vocabulary size: 316,637
Model type: FastText (with subword information)

This repository contains a pretrained FastText model for the Tajik language.

Files Included

File	Description
`tajik_fasttext.model`	Gensim model file
`tajik_fasttext.model.wv.vectors_ngrams.npy`	Subword (n-gram) vectors
`tajik_fasttext.model.wv.vectors_vocab.npy`	Word vectors

All three files are required to load the model correctly using Gensim.

from gensim.models import FastText

model = FastText.load("tajik_fasttext.model")
vector = model.wv["Точикистон"]  # Example word

If you use this model, please cite the repository:

ArabovMK, Tajik FastText Model, Hugging Face, 2025-05-08