lexdec-small-char is a small, autoregressive llama model featuring character-level tokenization, trained on the 2024/2025 BabyLM dataset. The checkpoints branch contains 19 checkpoints, 10 across the first 10% of pretraining and 9 more for the remaining 9 percent of pretraining.

We used this model to trace the development of linguistic knowledge (word-level, syntax) across pretraining and to compare it to both larger character-level models and comparable subword models:

small-char medium-char large-char small-bpe medium-bpe large-bpe
Embedding size 128 256 512 128 256 512
Hidden size 128 256 512 128 256 512
Layers 4 8 12 4 8 12
Attention heads 4 8 12 4 8 12
Context size 128 128 128 128 128 128
Vocab. size 102 102 102 8,002 8,002 8,002
Parameters 486,016 3,726,592 21,940,736 2,508,416 7,771,392 30,030,336

If you use this model, please cite the following preprint (the final version will be added as soon as it is published):

@misc{bunzeck2025subwordmodelsstruggleword,
      title={Subword models struggle with word learning, but surprisal hides it}, 
      author={Bastian Bunzeck and Sina Zarrieß},
      year={2025},
      eprint={2502.12835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12835},}
Downloads last month
7
Safetensors
Model size
486k params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including bbunzeck/lexdec-small-char