Sinhala BPE Tokenizer
A Byte-Pair Encoding (BPE) tokenizer tailored for Sinhala text, designed for seamless integration with Hugging Face's Transformers library. Optimized for modern NLP tasks, it ensures efficient and accurate tokenization of Sinhala language data.
π Overview
- Tokenizer Type: Byte-Pair Encoding (BPE)
- Vocabulary Size: 32,000 tokens
- Training Dataset: Navanjana/sinhala-articles
- Language: Sinhala (ΰ·ΰ·ΰΆΰ·ΰΆ½)
- License: Apache 2.0
π§ Features
- Utilizes Unicode NFD normalization for accurate Sinhala character segmentation
- Employs whitespace and punctuation pre-tokenization for precise token boundaries
- Fully compatible with Hugging Face's
transformers
andtokenizers
libraries - Optimized for tasks like text generation, classification, and translation
π§Ή Special Tokens
Token | Purpose | ||
---|---|---|---|
<bos> |
Beginning of sequence | ||
<eos> |
End of sequence | ||
</s> |
Separator token | ||
<unk> |
Unknown token | ||
<pad> |
Padding token | ||
[MASK] |
Masking token for MLM tasks | ||
`< | startoftext | >` | Start of raw text segment |
`< | endoftext | >` | End of raw text segment |
π Getting Started
β¨ Installation
pip install transformers tokenizers
π οΈ Usage
from transformers import AutoTokenizer
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("your-username/sinhala-bpe-tokenizer")
# Sample text
text = "ΰΆΈΰ· ΰ·ΰ·ΰΆΰ·ΰΆ½ ΰΆ΄ΰ·ΰΆ¨ΰΆΊΰΆΰ·"
# Tokenize
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)
# Encode
token_ids = tokenizer.encode(text)
print("Token IDs:", token_ids)
# Decode
decoded = tokenizer.decode(token_ids)
print("Decoded:", decoded)
# Encode with special tokens
with_special = tokenizer.encode(text, add_special_tokens=True)
print("With special tokens:", with_special)
π§ Advanced Usage
texts = [
"ΰ·ΰ·βΰΆ»ΰ· ΰΆ½ΰΆΰΆΰ·ΰ· ΰΆ
ΰΆ΄ΰ· ΰΆ»ΰΆ§",
"ΰΆ
ΰΆ― ΰ·ΰ·ΰΆ³ ΰΆ―ΰ·ΰΆ±ΰΆΊΰΆΰ·",
"ΰ·ΰ·ΰΆΰ·ΰΆ½ ΰΆ·ΰ·ΰ·ΰ·ΰ· ΰΆ½ΰ·ΰ·ΰ·ΰΆ±"
]
# Batch encode
encoded = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
print(encoded)
βοΈ Training Specifications
- Vocabulary Size: 32,000
- Minimum Frequency: 2
- Normalization: NFD
- Pre-tokenizer: Whitespace and Punctuation
- End-of-word marker:
</w>
Configuration Example
tokenizer_config = {
"vocab_size": 32000,
"min_frequency": 2,
"normalizer": "NFD",
"pre_tokenizer": ["Whitespace", "Punctuation"],
"decoder": "BPEDecoder",
"post_processor": "TemplateProcessing"
}
π Use Cases
NLP Tasks
- Language Modeling
- Text Generation
- Machine Translation
- Text Classification
- Named Entity Recognition (NER)
Research
- Low-resource NLP
- Multilingual and cross-lingual model training
π― Example Tokenizations
text = "ΰΆΈΰΆΈ ΰ·ΰ·ΰΆΰ·ΰΆ½ ΰΆΰΆΰ·ΰΆ± ΰΆΰΆ±ΰ·ΰΆ±ΰ·ΰ·"
# Output: ['βΰΆΈΰΆΈ', 'βΰ·ΰ·ΰΆΰ·ΰΆ½', 'βΰΆΰΆΰ·ΰΆ±', 'βΰΆΰΆ±ΰ·ΰΆ±ΰ·ΰ·']
text = "ΰ·ΰ·βΰΆ»ΰ· ΰΆ½ΰΆΰΆΰ·ΰ·ΰ· ΰΆΰΆΰ·ΰ·ΰ·ΰ·ΰ·ΰΆ ΰΆ±ΰΆΰΆ»ΰΆΊΰΆΰ· ΰ·ΰΆ± ΰΆ
ΰΆ±ΰ·ΰΆ»ΰ·ΰΆ°ΰΆ΄ΰ·ΰΆ»ΰΆΊ"
# Output: ['βΰ·ΰ·βΰΆ»ΰ·', 'βΰΆ½ΰΆΰΆΰ·ΰ·ΰ·', 'βΰΆΰΆΰ·ΰ·ΰ·ΰ·ΰ·ΰΆ', 'βΰΆ±ΰΆΰΆ»ΰΆΊΰΆΰ·', 'βΰ·ΰΆ±', 'βΰΆ
ΰΆ±ΰ·ΰΆ»ΰ·ΰΆ°ΰΆ΄ΰ·ΰΆ»ΰΆΊ']
text = "2024 ΰ·ΰΆ»ΰ·ΰ·ΰΆΊΰ· ΰ·ΰ·βΰΆ»ΰ· ΰΆ½ΰΆΰΆΰ·ΰ·"
# Output: ['β2024', 'βΰ·ΰΆ»ΰ·ΰ·ΰΆΊΰ·', 'βΰ·ΰ·βΰΆ»ΰ·', 'βΰΆ½ΰΆΰΆΰ·ΰ·']
π§ Model Integration
GPT-style Models
from transformers import GPT2LMHeadModel, GPT2Config
config = GPT2Config(vocab_size=tokenizer.vocab_size)
model = GPT2LMHeadModel(config)
BERT-style Models
from transformers import BertForMaskedLM, BertConfig
config = BertConfig(vocab_size=tokenizer.vocab_size)
model = BertForMaskedLM(config)
π Evaluation Metrics
Metric | Value | Description |
---|---|---|
Vocabulary Coverage | >95% | Coverage of common Sinhala words |
Compression Ratio | ~3.2 | Average characters per token |
Special Token Ratio | 0.025% | Ratio of special to regular tokens |
OOV Rate | <2% | Out-of-vocabulary rate on test set |
ποΈ Version History
v1.0: Initial release with 32K vocabulary
- Trained on Navanjana/sinhala-articles dataset
- Compatible with Transformers 4.0+
π Citation
If you use this tokenizer in your research, please cite:
@misc{sinhala-bpe-tokenizer-2025,
title={Sinhala BPE Tokenizer: A Specialized Tokenizer for Sinhala Text Processing},
author={Your Name},
year={2025},
url={https://huggingface.co/your-username/sinhala-bpe-tokenizer},
note={Hugging Face Model Hub}
}
π€ Contributing
Contributions are welcome! You can:
- Report issues with specific edge cases
- Suggest improvements or optimizations
- Submit evaluation results on downstream Sinhala NLP tasks
- Share model training results using this tokenizer
π License
This tokenizer is released under the Apache 2.0 License. See the LICENSE file for details.
π Acknowledgments
- Built using Hugging Face's Tokenizers library
- Trained on the Navanjana/sinhala-articles dataset
- Inspired by modern tokenization best practices
- Special thanks to the Sinhala NLP community
π¬ Contact
For questions, issues, or collaboration:
- Open an issue in this repository
- Email: [[email protected]]
- Twitter: [@your-twitter-handle]
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support