Word learning in small LMs
Collection
8 items
•
Updated
lexdec-small-char is a small, autoregressive llama model featuring character-level tokenization, trained on the 2024/2025 BabyLM dataset. The checkpoints branch contains 19 checkpoints, 10 across the first 10% of pretraining and 9 more for the remaining 9 percent of pretraining.
We used this model to trace the development of linguistic knowledge (word-level, syntax) across pretraining and to compare it to both larger character-level models and comparable subword models:
small-char | medium-char | large-char | small-bpe | medium-bpe | large-bpe | |
---|---|---|---|---|---|---|
Embedding size | 128 | 256 | 512 | 128 | 256 | 512 |
Hidden size | 128 | 256 | 512 | 128 | 256 | 512 |
Layers | 4 | 8 | 12 | 4 | 8 | 12 |
Attention heads | 4 | 8 | 12 | 4 | 8 | 12 |
Context size | 128 | 128 | 128 | 128 | 128 | 128 |
Vocab. size | 102 | 102 | 102 | 8,002 | 8,002 | 8,002 |
Parameters | 486,016 | 3,726,592 | 21,940,736 | 2,508,416 | 7,771,392 | 30,030,336 |
If you use this model, please cite the following preprint (the final version will be added as soon as it is published):
@misc{bunzeck2025subwordmodelsstruggleword,
title={Subword models struggle with word learning, but surprisal hides it},
author={Bastian Bunzeck and Sina Zarrieß},
year={2025},
eprint={2502.12835},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.12835},}