BabyLM's First Words
Collection
Models trained on IPA-CHILDES and evaluated for phonological knowledge using the word segmentation task, linked to child language acquisition.
•
7 items
•
Updated
Phoneme-based GPT-2 models trained on all 31 sections of the IPA-CHILDES dataset for the paper BabyLM's First Words: Word Segmentation as a Phonological Probing Task.
The models have 600k non-embedding parameters and were trained on 100k tokens of their language. They were evaluated for phonological knowledge using the word segmentation task. Check out the paper for more details. Training and analysis scripts can be found here.
To load a model:
from transformers import AutoModel
farsi_model = AutoModel.from_pretrained('phonemetransformers/ipa-childes-models-tiny', subfolder='Farsi')