Llama-3.2-1B-Kannada-Tokenizer

This repository describes a custom tokenizer that extends the original Llama 3.2-1B tokenizer with enhanced support for the Kannada language.

Model Details

Model Description

This tokenizer is an extended version of the meta-llama/Llama-3.2-1B SentencePiece tokenizer. It has been specifically augmented with a custom vocabulary derived from a large Kannada text corpus, utilizing the Unigram algorithm. The primary goal of this extension is to provide more efficient and accurate tokenization for Kannada language text, which is crucial for improving the performance of Large Language Models (LLMs) on Kannada-specific tasks. It retains the full original Llama 3.2-1B vocabulary, ensuring backward compatibility with models pre-trained on the original tokenizer.

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: Manjunath S N
Model type: SentencePiece Unigram Tokenizer (extended)
Language(s) (NLP): English (base), Kannada (enhanced), and other languages supported by the base Llama 3.2-1B tokenizer.
License: MIT License
Extended from tokenizer: meta-llama/Llama-3.2-1B

Uses

Direct Use

This tokenizer is designed for:

Preprocessing Kannada text for Llama 3.2 1B model.

Analyzing tokenization patterns in Kannada text.

Providing efficient subword segmentation for Kannada in NLP pipelines.

Downstream Use

This tokenizer is a critical component for:

Training or fine-tuning any Llama 3.2-compatible LLM to achieve high performance in Kannada.

Developing applications that require accurate and efficient tokenization of Kannada text (e.g., search, sentiment analysis, text classification).

Research on multilingual tokenization and subword units for Indic languages.

Out-of-Scope Use

Using this tokenizer with models that have not had their embedding layers resized to accommodate the extended vocabulary, as this will lead to incorrect or meaningless outputs.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer

tokenizer_id = "imanjunathn/Llama-3.2-1B-Kannada-Tokenizer"

# Load the custom tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

# Example tokenization in Kannada
test_text_kannada = "ನಮಸ್ತೆ, ಇದು ಕನ್ನಡದಲ್ಲಿ ಹೊಸ ಟೋಕನೈಸರ್ ಆಗಿದೆ. ಕನ್ನಡವು ಒಂದು ಸುಂದರ ಭಾಷೆ. ಬೆಂಗಳೂರು ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ."
test_tokens = tokenizer.tokenize(test_text_kannada)
print(f"Original Kannada text: '{test_text_kannada}'")
print(f"Tokenized output (adapted tokenizer): {test_tokens}")
print(f"Decoded output: '{tokenizer.decode(tokenizer.encode(test_text_kannada),skip_special_tokens=True)}'")

# Check vocabulary size
print(f"Tokenizer Vocabulary Size: {len(tokenizer)}")

Training Details

Training Procedure

Tokenizer Algorithm: SentencePiece (Unigram model) was trained on the Kannada corpus.

Vocabulary Extension Process:

The base meta-llama/Llama-3.2-1B tokenizer was loaded.

A new SentencePiece Unigram model was trained on the preprocessed Kannada dataset with a target vocabulary size sufficient to capture Kannada subwords effectively.

Tokens generated by the new Kannada SentencePiece model were compared against the existing Llama 3.2-1B vocabulary.

Only unique Kannada tokens (i.e., those not already present in the Llama 3.2-1B vocabulary) were extracted.

These unique Kannada tokens were then added to the original Llama 3.2-1B tokenizer's vocabulary using tokenizer.add_tokens().

Training Hyperparameters (for new SPM):

Model Type: unigram

Vocabulary Size: 32000

Character Coverage: 1.0 (recommended for languages with rich character sets like Kannada)

Summary

This custom tokenizer extends the powerful Llama 3.2-1B tokenizer with a specialized Kannada vocabulary. By leveraging the Unigram algorithm on a dedicated Kannada corpus, it provides more granular and culturally relevant subword segmentation for Kannada text.

imanjunathn
/

Llama-3.2-1B-Kannada-Tokenizer