metadata
language: tg
license: mit
tags:
- fasttext
- tajik
- word-embeddings
- nlp
Tajik FastText Word Embedding Model
This repository contains a pretrained FastText model for the Tajik language.
- Training data: Tokenized Tajik corpus
- Total tokens: 21,171,522
- Vocabulary size: 316,637
- Model type: FastText (with subword information)
Files Included
File | Description |
---|---|
tajik_fasttext.model |
Gensim model file |
tajik_fasttext.model.wv.vectors_ngrams.npy |
Subword (n-gram) vectors |
tajik_fasttext.model.wv.vectors_vocab.npy |
Word vectors |
All three files are required to load the model correctly using Gensim.
Usage
from gensim.models import FastText
model = FastText.load("tajik_fasttext.model")
vector = model.wv["Точикистон"] # Example word
Citation
If you use this model, please cite the repository:
ArabovMK, Tajik FastText Model, Hugging Face, 2025-05-08