File size: 570 Bytes
29ebf90 bbd845f 29ebf90 bbd845f 29ebf90 bbd845f 29ebf90 bbd845f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
---
library_name: transformers
license: apache-2.0
datasets:
- pints-ai/Expository-Prose-V1
language:
- en
---
# bpe tokenizer w byte-fallback: 32k vocab, uncased
uncased BPE tokenizer for encoders/MLM objective with byte-pair fallback:
- Trained on `pints-ai/Expository-Prose-V1`; this tokenizer is primarily for English and code.
- this tokenizer is **uncased**: "HELLO WORLD" is **the same** as "hello world"
- `model_max_length` is set to 1e9 to not cause hidden issues. **Set `tokenizer.model_max_length` to your model's max position embeddings** when training.
|