Kurdish Kurmanji BPE Tokenizer
This tokenizer is trained on Kurdish Kurmanji text data using Byte-Pair Encoding (BPE). It's trained using this dataset
Details
- Tokenization Method: BPE (Byte-Pair Encoding)
- Vocabulary Size: 20,000 tokens
- Special Tokens: [UNK], [CLS], [SEP], [PAD], [MASK]
- Language: Kurdish Kurmanji
Usage
Encoding
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")
ids = tokenizer.encode("Ez diçim malê")
tokens = tokenizer.tokenize("Ez diçim malê")
print(f"Tokens: {tokens}")
print(f"IDs: {ids}")
Decoding
This tokenizer includes spaces within some tokens (e.g., 'Ez ê ', 'di vê '), which causes the default tokenizer.decode() method from the transformers library to add extra spaces between tokens during decoding. To decode text correctly and preserve the original formatting, decode each token ID individually and join the results without spaces.
Use the following code example:
from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast.from_pretrained("muzaffercky/kurdish-kurmanji-tokenizer", revision="v1.0")
text = "Ez diçim malê"
ids = tokenizer.encode(text)
individual_tokens = [tokenizer.decode([id]) for id in ids]
decoded_text = "".join(individual_tokens)
print(decoded_text) # Output: Ez diçim malê
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support