Transformers
English
pszemraj's picture
Update README.md
bbd845f verified
metadata
library_name: transformers
license: apache-2.0
datasets:
  - pints-ai/Expository-Prose-V1
language:
  - en

bpe tokenizer w byte-fallback: 32k vocab, uncased

uncased BPE tokenizer for encoders/MLM objective with byte-pair fallback:

  • Trained on pints-ai/Expository-Prose-V1; this tokenizer is primarily for English and code.
  • this tokenizer is uncased: "HELLO WORLD" is the same as "hello world"
  • model_max_length is set to 1e9 to not cause hidden issues. Set tokenizer.model_max_length to your model's max position embeddings when training.