Custom tokenizer used to tokenize my dataset terrycraddock/GPT2-PretrainV1-en. This tokenizer is basically the default gpt2 tokenizer except I added a [PAD] token for the trainer to perform distilled knowledge training as well as trained this on my dataset to be more efficent.

Link to tokenized dataset: https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-Tokenized-en

Link to non-tokenized dataset: https://huggingface.co/datasets/terrycraddock/GPT2-PretrainV1-en

This customer tokenizer as well as the datasets mentioned are intended to be used as a pretrain corpus for knowledge distillation learning from a larger GPT2 model to a smaller custom one.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.