Llama-3.2-1B-Kannada-Tokenizer
This repository describes a custom tokenizer that extends the original Llama 3.2-1B tokenizer with enhanced support for the Kannada language.
Model Details
Model Description
This tokenizer is an extended version of the meta-llama/Llama-3.2-1B SentencePiece tokenizer. It has been specifically augmented with a custom vocabulary derived from a large Kannada text corpus, utilizing the Unigram algorithm. The primary goal of this extension is to provide more efficient and accurate tokenization for Kannada language text, which is crucial for improving the performance of Large Language Models (LLMs) on Kannada-specific tasks. It retains the full original Llama 3.2-1B vocabulary, ensuring backward compatibility with models pre-trained on the original tokenizer.
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: Manjunath S N
- Model type: SentencePiece Unigram Tokenizer (extended)
- Language(s) (NLP): English (base), Kannada (enhanced), and other languages supported by the base Llama 3.2-1B tokenizer.
- License: MIT License
- Extended from tokenizer: meta-llama/Llama-3.2-1B
Uses
Direct Use
This tokenizer is designed for:
Preprocessing Kannada text for Llama 3.2 1B model.
Analyzing tokenization patterns in Kannada text.
Providing efficient subword segmentation for Kannada in NLP pipelines.
Downstream Use
This tokenizer is a critical component for:
Training or fine-tuning any Llama 3.2-compatible LLM to achieve high performance in Kannada.
Developing applications that require accurate and efficient tokenization of Kannada text (e.g., search, sentiment analysis, text classification).
Research on multilingual tokenization and subword units for Indic languages.
Out-of-Scope Use
Using this tokenizer with models that have not had their embedding layers resized to accommodate the extended vocabulary, as this will lead to incorrect or meaningless outputs.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import AutoTokenizer
tokenizer_id = "imanjunathn/Llama-3.2-1B-Kannada-Tokenizer"
# Load the custom tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)
# Example tokenization in Kannada
test_text_kannada = "ನಮಸ್ತೆ, ಇದು ಕನ್ನಡದಲ್ಲಿ ಹೊಸ ಟೋಕನೈಸರ್ ಆಗಿದೆ. ಕನ್ನಡವು ಒಂದು ಸುಂದರ ಭಾಷೆ. ಬೆಂಗಳೂರು ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ."
test_tokens = tokenizer.tokenize(test_text_kannada)
print(f"Original Kannada text: '{test_text_kannada}'")
print(f"Tokenized output (adapted tokenizer): {test_tokens}")
print(f"Decoded output: '{tokenizer.decode(tokenizer.encode(test_text_kannada),skip_special_tokens=True)}'")
# Check vocabulary size
print(f"Tokenizer Vocabulary Size: {len(tokenizer)}")
Training Details
Training Procedure
Tokenizer Algorithm: SentencePiece (Unigram model) was trained on the Kannada corpus.
Vocabulary Extension Process:
The base meta-llama/Llama-3.2-1B tokenizer was loaded.
A new SentencePiece Unigram model was trained on the preprocessed Kannada dataset with a target vocabulary size sufficient to capture Kannada subwords effectively.
Tokens generated by the new Kannada SentencePiece model were compared against the existing Llama 3.2-1B vocabulary.
Only unique Kannada tokens (i.e., those not already present in the Llama 3.2-1B vocabulary) were extracted.
These unique Kannada tokens were then added to the original Llama 3.2-1B tokenizer's vocabulary using tokenizer.add_tokens().
Training Hyperparameters (for new SPM):
Model Type: unigram
Vocabulary Size: 32000
Character Coverage: 1.0 (recommended for languages with rich character sets like Kannada)
Summary
This custom tokenizer extends the powerful Llama 3.2-1B tokenizer with a specialized Kannada vocabulary. By leveraging the Unigram algorithm on a dedicated Kannada corpus, it provides more granular and culturally relevant subword segmentation for Kannada text.
Model tree for imanjunathn/Llama-3.2-1B-Kannada-Tokenizer
Base model
meta-llama/Llama-3.2-1B