Tatar2Vec: Word Embeddings for the Tatar Language

This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: Word2Vec and FastText, with different architectures and vector sizes.

All models are ready to use with the gensim library and can be easily downloaded via the Hugging Face Hub.

📦 Available Models

The following models are included:

Model Name	Type	Architecture	Vector Size	#Vectors	Notes
`w2v_cbow_100`	Word2Vec	CBOW	100	1.29M	Best overall for semantic analogy tasks
`w2v_cbow_200`	Word2Vec	CBOW	200	1.29M	Higher dimensionality, more expressive
`w2v_sg_100`	Word2Vec	Skip-gram	100	1.29M	Often better for rare words
`ft_cbow_100`	FastText	CBOW	100	1.29M	Handles subword information, good for morphology
`ft_cbow_200`	FastText	CBOW	200	1.29M	Larger FastText model

All models share the same vocabulary of 1,293,992 unique tokens, achieving 100% coverage on the training corpus.

📁 Repository Structure

The files are organised in subdirectories for easy access:

Tatar2Vec/
├── word2vec/
│   ├── cbow100/          # w2v_cbow_100 model files
│   ├── cbow200/          # w2v_cbow_200 model files
│   └── sg100/            # w2v_sg_100 model files
└── fasttext/
    ├── cbow100/          # ft_cbow_100 model files
    └── cbow200/          # ft_cbow_200 model files

Each model folder contains the files saved by gensim (.model, .npy vectors, etc.).

🚀 Usage

Installation

First, install the required libraries:

pip install huggingface_hub gensim

Download a Model

Use snapshot_download to download all files of a specific model to a local directory:

from huggingface_hub import snapshot_download
import gensim
import os

# Download the best Word2Vec CBOW 100 model
model_path = snapshot_download(
    repo_id="TatarNLPWorld/Tatar2Vec",
    allow_patterns="word2vec/cbow100/*",   # only download this model
    local_dir="./tatar2vec_cbow100"        # optional local folder
)

# Load the model with gensim
model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model")
model = gensim.models.Word2Vec.load(model_file)

# Test it
print(model.wv.most_similar("татар"))

Alternatively, you can download the whole repository or individual files using hf_hub_download.

📊 Model Comparison

We evaluated all models on a set of intrinsic tasks:

Word analogies (e.g., Мәскәү:Россия = Казан:?)
Semantic similarity (cosine similarity of related word pairs)
Out-of-vocabulary (OOV) handling (for FastText)
Nearest neighbours inspection

The Word2Vec CBOW (100-dim) model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics:

Metric	Word2Vec (cbow100)	FastText (cbow100)
Analogy accuracy	60.0%	0.0%
Avg. semantic similarity	0.568	0.582
OOV handling	N/A	Good (subword)
Vocabulary coverage	100%	100%
Training time	1760s	3323s

Why Word2Vec? It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation.

For a detailed report, see the model comparison results (included in the repository).

📝 License

All models are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.

📜 Certificate

This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate:

Certificate number: 2026610619
Title: Tatar2Vec
Filing date: December 23, 2025
Publication date: January 14, 2026
Author: Mullosharaf K. Arabov
Applicant: Kazan Federal University

Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».

🤝 Citation

If you use these models in your research, please cite the software registration:

@software{tatar2vec_2026,
    title = {Tatar2Vec},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Kazan Federal University},
    note = {Registered software, Certificate No. 2026610619},
    url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec}
}

🌐 Language

The models are trained on Tatar text and are intended for use with the Tatar language (language code tt).

🙌 Acknowledgements

These models were trained by TatarNLPWorld as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TatarNLPWorld/Tatar2Vec

Base model

facebook/fasttext-language-identification

Finetuned

(2)

this model