bbunzeck/lexdec-small-char

lexdec-small-char is a small, autoregressive llama model featuring character-level tokenization, trained on the 2024/2025 BabyLM dataset. The checkpoints branch contains 19 checkpoints, 10 across the first 10% of pretraining and 9 more for the remaining 9 percent of pretraining.

We used this model to trace the development of linguistic knowledge (word-level, syntax) across pretraining and to compare it to both larger character-level models and comparable subword models:

	small-char	medium-char	large-char	small-bpe	medium-bpe	large-bpe
Embedding size	128	256	512	128	256	512
Hidden size	128	256	512	128	256	512
Layers	4	8	12	4	8	12
Attention heads	4	8	12	4	8	12
Context size	128	128	128	128	128	128
Vocab. size	102	102	102	8,002	8,002	8,002
Parameters	486,016	3,726,592	21,940,736	2,508,416	7,771,392	30,030,336

If you use this model, please cite the following preprint (the final version will be added as soon as it is published):

@misc{bunzeck2025subwordmodelsstruggleword,
      title={Subword models struggle with word learning, but surprisal hides it}, 
      author={Bastian Bunzeck and Sina Zarrieß},
      year={2025},
      eprint={2502.12835},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12835},}

bbunzeck
/

lexdec-small-char

Collection including bbunzeck/lexdec-small-char

Word learning in small LMs