IPA CHILDES
Collection
The IPA-CHILDES dataset along with the models and tokenizers used for phoneme-based language modeling for the 31 languages in CHILDES.
•
5 items
•
Updated
Phoneme-based GPT-2 models trained on the largest 11 sections of the IPA-CHILDES dataset for our paper IPA-CHILDES & G2P+: Feature-Rich Resources for Cross-Lingual Phonology and Phonemic Language Modeling.
All models have 5M non-embedding parameters and were trained on 1.8M tokens from their language. These models were then probed for phonetic features using the corresponding inventories in Phoible. Check out the paper for more details. Training and analysis scripts can be found here.
To load a model:
from transformers import AutoModel
dutch_model = AutoModel.from_pretrained('phonemetransformers/ipa-childes-models', subfolder='Dutch')