ArabovMK's picture
Upload README.md with huggingface_hub
94935a7 verified
|
raw
history blame
979 Bytes
metadata
language: tg
license: mit
tags:
  - fasttext
  - tajik
  - word-embeddings
  - nlp

Tajik FastText Word Embedding Model

This repository contains a pretrained FastText model for the Tajik language.

  • Training data: Tokenized Tajik corpus
  • Total tokens: 21,171,522
  • Vocabulary size: 316,637
  • Model type: FastText (with subword information)

Files Included

File Description
tajik_fasttext.model Gensim model file
tajik_fasttext.model.wv.vectors_ngrams.npy Subword (n-gram) vectors
tajik_fasttext.model.wv.vectors_vocab.npy Word vectors

All three files are required to load the model correctly using Gensim.

Usage

from gensim.models import FastText

model = FastText.load("tajik_fasttext.model")
vector = model.wv["Точикистон"]  # Example word

Citation

If you use this model, please cite the repository:

ArabovMK, Tajik FastText Model, Hugging Face, 2025-05-08