|
|
--- |
|
|
license: mit |
|
|
tags: |
|
|
- persian |
|
|
- bpe |
|
|
- tokenizer |
|
|
language: |
|
|
- fa |
|
|
--- |
|
|
|
|
|
# Persian BPE Tokenizer (30K) |
|
|
|
|
|
A Byte-Pair Encoding (BPE) tokenizer with a vocabulary size of 30,000, trained on ~2M Persian texts with an average length of 10,000 characters for NLP tasks. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Encoding |
|
|
```python |
|
|
from tokenizers import Tokenizer |
|
|
tokenizer= Tokenizer.from_file("Persian_BPE_Tokenizer_30K.json") |
|
|
encoded_text= tokenizer.encode("این یک متن آزمایشی است.") |
|
|
print("Tokens:", encoded_text.tokens) |
|
|
print("IDs:", encoded_text.ids) |
|
|
``` |
|
|
|
|
|
### Decoding |
|
|
```python |
|
|
decoded_text= tokenizer.decode_batch([[id] for id in encoded_text.ids]) |
|
|
print("Decoded:", decoded_text) |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
This tokenizer was trained on the following datasets: |
|
|
- Wikipedia (20231101.fa): https://huggingface.co/datasets/wikimedia/wikipedia |
|
|
- Persian Blog: https://huggingface.co/datasets/RohanAiLab/persian_blog |
|
|
- HomoRich: https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian |
|
|
|
|
|
## License |
|
|
Code and tokenizer: MIT License |
|
|
|
|
|
## Evaluation Metrics |
|
|
- UNK Rate: 0.0% (on 100,000 samples) |
|
|
- Compression Ratio: 4.56 (on 100,000 samples) |
|
|
|
|
|
## Requirements |
|
|
- **For using the tokenizer**: |
|
|
- Python >= 3.9 |
|
|
- tokenizers |
|
|
- **For training the tokenizer**: |
|
|
- pandas |
|
|
- datasets |
|
|
- requests |
|
|
- hazm |