IPA CHILDES
Collection
The IPA-CHILDES dataset along with the models and tokenizers used for phoneme-based language modeling for the 31 languages in CHILDES.
•
5 items
•
Updated
Tokenizers for each language in IPA-CHILDES used to train cross-lingual phoneme LLMs in our papers:
Scripts for creating the tokenizers can be found here. Scripts for training models using these tokenizers can be found here.
To load a tokenizer:
from transformers import AutoTokenizer
dutch_tokenizer = AutoTokenizer.from_pretrained('phonemetransformers/ipa-childes-tokenizers', subfolder='Dutch')