File size: 570 Bytes

29ebf90
 
bbd845f
 
 
 
 
29ebf90
 
bbd845f
29ebf90
bbd845f
29ebf90
bbd845f

---
library_name: transformers
license: apache-2.0
datasets:
- pints-ai/Expository-Prose-V1
language:
- en
---

# bpe tokenizer w byte-fallback: 32k vocab, uncased

uncased BPE tokenizer for encoders/MLM objective with byte-pair fallback:

- Trained on `pints-ai/Expository-Prose-V1`; this tokenizer is primarily for English and code.
- this tokenizer is **uncased**: "HELLO WORLD" is **the same** as "hello world"
- `model_max_length` is set to 1e9 to not cause hidden issues. **Set `tokenizer.model_max_length` to your model's max position embeddings** when training.