Baby Tokenizer (Uncased)

Compact sentencepiece tokenizer for sample-efficient English language modeling, simply tokenizing natural language.

Usage

Transformers

from transformers import AutoTokenizer

tokenizer_baby = AutoTokenizer.from_pretrained("nilq/baby-tokenizer")

Tokenizers

from tokenizers import Tokenizer

tokenizer_baby = Tokenizer.from_pretrained("nilq/baby-tokenizer")

Data

This tokeniser is derived from the BabyLM 100M dataset of mixed domain data, consisting of the following sources:

  • CHILDES (child-directed speech)
  • Subtitles (speech)
  • BNC (speech)
  • TED talks (speech)
  • children's books (simple written language).

Specifications

  • Vocabulary size: 20k
  • Alphabet limit: 150
  • Minimum token frequency: 100
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Dataset used to train nilq/baby-tokenizer-uncased