Tatar2Vec: Word Embeddings for the Tatar Language
This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: Word2Vec and FastText, with different architectures and vector sizes.
All models are ready to use with the gensim library and can be easily downloaded via the Hugging Face Hub.
📦 Available Models
The following models are included:
| Model Name | Type | Architecture | Vector Size | #Vectors | Notes |
|---|---|---|---|---|---|
w2v_cbow_100 |
Word2Vec | CBOW | 100 | 1.29M | Best overall for semantic analogy tasks |
w2v_cbow_200 |
Word2Vec | CBOW | 200 | 1.29M | Higher dimensionality, more expressive |
w2v_sg_100 |
Word2Vec | Skip-gram | 100 | 1.29M | Often better for rare words |
ft_cbow_100 |
FastText | CBOW | 100 | 1.29M | Handles subword information, good for morphology |
ft_cbow_200 |
FastText | CBOW | 200 | 1.29M | Larger FastText model |
All models share the same vocabulary of 1,293,992 unique tokens, achieving 100% coverage on the training corpus.
📁 Repository Structure
The files are organised in subdirectories for easy access:
Tatar2Vec/
├── word2vec/
│ ├── cbow100/ # w2v_cbow_100 model files
│ ├── cbow200/ # w2v_cbow_200 model files
│ └── sg100/ # w2v_sg_100 model files
└── fasttext/
├── cbow100/ # ft_cbow_100 model files
└── cbow200/ # ft_cbow_200 model files
Each model folder contains the files saved by gensim (.model, .npy vectors, etc.).
🚀 Usage
Installation
First, install the required libraries:
pip install huggingface_hub gensim
Download a Model
Use snapshot_download to download all files of a specific model to a local directory:
from huggingface_hub import snapshot_download
import gensim
import os
# Download the best Word2Vec CBOW 100 model
model_path = snapshot_download(
repo_id="TatarNLPWorld/Tatar2Vec",
allow_patterns="word2vec/cbow100/*", # only download this model
local_dir="./tatar2vec_cbow100" # optional local folder
)
# Load the model with gensim
model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model")
model = gensim.models.Word2Vec.load(model_file)
# Test it
print(model.wv.most_similar("татар"))
Alternatively, you can download the whole repository or individual files using hf_hub_download.
📊 Model Comparison
We evaluated all models on a set of intrinsic tasks:
- Word analogies (e.g.,
Мәскәү:Россия = Казан:?) - Semantic similarity (cosine similarity of related word pairs)
- Out-of-vocabulary (OOV) handling (for FastText)
- Nearest neighbours inspection
The Word2Vec CBOW (100-dim) model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics:
| Metric | Word2Vec (cbow100) | FastText (cbow100) |
|---|---|---|
| Analogy accuracy | 60.0% | 0.0% |
| Avg. semantic similarity | 0.568 | 0.582 |
| OOV handling | N/A | Good (subword) |
| Vocabulary coverage | 100% | 100% |
| Training time | 1760s | 3323s |
Why Word2Vec? It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation.
For a detailed report, see the model comparison results (included in the repository).
📝 License
All models are released under the MIT License. You are free to use, modify, and distribute them for any purpose, with proper attribution.
📜 Certificate
This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate:
- Certificate number: 2026610619
- Title: Tatar2Vec
- Filing date: December 23, 2025
- Publication date: January 14, 2026
- Author: Mullosharaf K. Arabov
- Applicant: Kazan Federal University
Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».
🤝 Citation
If you use these models in your research, please cite the software registration:
@software{tatar2vec_2026,
title = {Tatar2Vec},
author = {Arabov, Mullosharaf Kurbonvoich},
year = {2026},
publisher = {Kazan Federal University},
note = {Registered software, Certificate No. 2026610619},
url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec}
}
🌐 Language
The models are trained on Tatar text and are intended for use with the Tatar language (language code tt).
🙌 Acknowledgements
These models were trained by TatarNLPWorld as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.
Model tree for TatarNLPWorld/Tatar2Vec
Base model
facebook/fasttext-language-identification