MiniDalaLM - Embedding Extractor for Latin Kazakh 🇰🇿

'Dala' means 'steppe' in Kazakh - a nod to where the voice of this model might echo.

MiniDalaLM is a fine-tuned version of paraphrase-multilingual-MiniLM-L12-v2, trained to extract embeddings from Kazakh text that make use of the officially adopted, 2021 alphabet reform-based Latin script. It is meant to serve as a foundational model to be improved upon as needed and used alongside its more powerful transliteration-based cousin, DalaT5.

⚠️ Limitations

May produce unexpected outputs for very short inputs or mixed-script text
Accuracy may vary across dialects or uncommon characters

🧠 Purpose

Much like DalaT5, this model wasn’t built for production-grade embedding extraction or for linguistic study alone.

It was born from something else:

A deep respect for Kazakh culture
A belief that no language should ever be forgotten
A desire to aid the country's modernisation efforts through AI

I'm not Kazakh, but I believe that there is beauty in helping those that may be in need - with the sole expectation being that it may prove useful to them. So, I help and give away freely.

🌍 Жоба туралы / About the Project

🏕 Қазақша

MiniDalaLM - Қазақстанның ұлттық модернизациялау күш-жігерін қолдауға арналған, қазақша латын деректеріне дәл бапталған трансформатор. Модель ендірілгендер арқылы мәтіндік мүмкіндіктерді шығаруға бағытталған, бұл оны күшті лингвистикалық құралдардың негізі ретінде тамаша етеді.

Бұл жоба:

AI жүйесінде аз ұсынылған тілдерге қолдау көрсетеді
Қазақтың латыншаланған болашағына ашық қолжетімділік ұсынады
Шетелдік – кішіпейілділікпен, ізденімпаздықпен, терең қамқорлықпен жасаған

🌐 English

MiniDalaLM is a transformer fine-tuned on Kazakh Latin data, designed to support Kazakhstan’s national modernisation efforts. The model focuses on textual feature extraction via embeddings, making it ideal as the backbone of more powerful linguistic tools.

This project:

Supports underrepresented languages in AI
Offers open access to the Latinised future of Kazakh
Was created by a foreigner - with humility, curiosity, and deep care

💻 Байқап көріңіз / Try it out

Құшақтап тұрған бет арқылы тікелей пайдаланыңыз🤗 Sentence Transformers / Use directly via Hugging Face 🤗 Sentence Transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("crossroderick/minidalalm")

sentences = [
    "Vakkaritstso-Albaneze (ital. 'Vaccarizzo Albanese') — Italiiadağy kommuna, Kalabriia äkımşılık aimağyna qarasty Kozentsa provintsiiasynda ornalasqan.",
    "Qalanyñ tūraqty tūrğyndarynyñ sany 1236 adamdy qūraidy (2008). Halyq tyğyzdyğy 154 adam/km². Alyp jatqan jer aumağy 8 km² şamasynda. Poşta indeksı — 87060.",
    "Eldı mekennıñ qamqorşysy — Madonna di Costantinopoli.",
]

embeddings = model.encode(sentences)

print(embeddings)

🙏 Алғыс / Acknowledgements

Тәуелсіз жоба болғанына қарамастан, MiniDalaLM өте маңызды үш деректер жиынтығын пайдаланады / Despite being an independent project, MiniDalaLM makes use of three very important datasets:

The first ~50 thousand records of the Kazakh subset of the CC100 dataset by Conneau et al. (2020)
The first ~55 thousand records of the raw, Kazakh-focused part of the Kazakh Parallel Corpus (KazParC) from Nazarbayev University's Institute of Smart Systems and Artificial Intelligence (ISSAI), graciously made available on Hugging Face
The Wikipedia dump of articles in the Kazakh language, obtained via the wikiextractor Python package

🤖 Нақты баптау нұсқаулары / Fine-tuning instructions

Деректер жиынының жалпы өлшемін ескере отырып, олар осы үлгінің репозиторийіне қосылмаған. Дегенмен, MiniDalaLM-ті өзіңіз дәл баптағыңыз келсе, келесі әрекеттерді орындаңыз / Given the total size of the datasets, they haven't been included in this model's repository. However, should you wish to fine-tune MiniDalaLM yourself, please do the following:

get_data.sh қабық сценарий файлын "src/data" қалтасында іске қосыңыз / Run the get_data.sh shell script file in the "src/data" folder
Сол қалтадағы generate_lat_pairs.py файлын іске қосыңыз / Run the generate_lat_pairs.py file in the same folder
Қазақ корпус файлын тазалау және деректер жинағын араластыру үшін generate_clean_corpus.sh іске қосыңыз / Run generate_clean_corpus.sh to clean the Kazakh corpus file and shuffle the dataset

KazParC деректер жинағын жүктеп алу үшін сізге Hugging Face есептік жазбасы қажет екенін ескеріңіз. Бұған қоса, жүктеп алуды бастау үшін өзіңізді аутентификациялау үшін huggingface-cli орнатуыңыз қажет. Бұл туралы толығырақ мына жерден оқыңыз / Please note that you'll need a Hugging Face account to download the KazParC dataset. Additionally, you'll need to install huggingface-cli to authenticate yourself for the download to commence. Read more about it here.

Егер сіз Windows жүйесінде болсаңыз, «get_data.sh» сценарийі жұмыс істемеуі мүмкін. Дегенмен, файлдағы сілтемелерді орындап, ондағы қадамдарды қолмен орындау арқылы әлі де деректерді алуға болады. Сол сияқты, generate_clean_corpus.sh файлында да қате пайда болады, бұл сізге kazakh_latin_pairs.json файлындағы бос немесе бос жолдарды сүзу, сондай-ақ оны араластыру үшін баламалы Windows функциясын табуды талап етеді. Бұған қоса, 'wikiextractor' және sentencetransformers бумаларын алдын ала орнатуды ұмытпаңыз (нақты нұсқаларды 'requirements.txt' файлынан табуға болады) / If you're on Windows, the get_data.sh script likely won't work. However, you can still get the data by following the links in the file and manually doing the steps in there. Likewise, generate_clean_corpus.sh will also error out, requiring you to find an equivalent Windows functionality to filter out blank or empty lines in the kazakh_latin_pairs.json file, as well as shuffle it. Additionally, be sure to install the wikiextractor and sentencetransformers packages beforehand (the exact versions can be found in the requirements.txt file).

📋 Өзгеріс журналы / Changelog

MiniDalaLM v1: 5 мамырда жөнделді және сол күні қолжетімді болды. Қазақ морфологиясына тез бейімделіп, өзінің негізгі үлгісінің табиғатын пайдаланған бастапқы нұсқа / Fine-tuned on May 5 and made available on the same day. Initial version that benefitted from the nature of its base model, quickly adapting to Kazakh morphology

📚 Несиелер / Credits

Егер сіз MiniDalaLM-ті туынды жұмыстарды зерттеуде қолдансаңыз - біріншіден, рахмет. Екіншіден, егер сіз қаласаңыз, дәйексөз келтіріңіз / If you use MiniDalaLM in research of derivative works - first off, thank you. Secondly, should you be willing, feel free to cite:

@misc{pereira_cruz_dalat5_2025,
  author = {Rodrigo Pereira Cruz},
  title = {MiniDalaLM: Feature extraction on Latin Kazakh via embeddings},
  year = 2025,
  url = {https://huggingface.co/crossroderick/minidalalm},
  doi = {10.57967/hf/5369},
  publisher = {Hugging Face}
}

crossroderick
/

minidalalm