language: tg | |
license: mit | |
tags: | |
- fasttext | |
- tajik | |
- word-embeddings | |
- nlp | |
# Tajik FastText Word Embedding Model | |
This repository contains a pretrained **FastText** model for the Tajik language. | |
- **Training data**: Tokenized Tajik corpus | |
- **Total tokens**: 21,171,522 | |
- **Vocabulary size**: 316,637 | |
- **Model type**: FastText (with subword information) | |
## Files Included | |
| File | Description | | |
|------|-------------| | |
| `tajik_fasttext.model` | Gensim model file | | |
| `tajik_fasttext.model.wv.vectors_ngrams.npy` | Subword (n-gram) vectors | | |
| `tajik_fasttext.model.wv.vectors_vocab.npy` | Word vectors | | |
All three files are required to load the model correctly using Gensim. | |
## Usage | |
```python | |
from gensim.models import FastText | |
model = FastText.load("tajik_fasttext.model") | |
vector = model.wv["Точикистон"] # Example word | |
``` | |
## Citation | |
If you use this model, please cite the repository: | |
> ArabovMK, Tajik FastText Model, Hugging Face, 2025-05-08 | |