Model S
Model S has a sequence-to-sequence model, we selected Pegasus as the model type. This architecture uses masked language modeling (MLM), which is a technique used in pretraining to mask out some percentage of the tokens in the input sequence. The model is then trained to predict the masked tokens based on the context provided by the unmasked tokens. Pegasus also applies gapped sentence generation (GSG), where the model is trained to generate a sequence of text from a given input sequence, filling in gaps or generating new sentences based on the input. The model architecture is described in figure 3.1. Pegasus distinguishes itself from other
Model Details
The model is trained from scratch, i.e. its weights are randomly initialized. That includes a vocabulary size of 96103, a model dimension of 1024, 2 layers, and a max position of 512 for the em- beddings.
Model Description
- Developed by: Ronny Paul
- Model type: Pegasus
- Language(s) (NLP): Northern Sami
Uses
This model was used in an experiment to determine which architecture is favourable in a low-resource-setting with Northern Sami.
Dataset
The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
How to Get Started with the Model
model = AutoModelForSeq2SeqLM.from_pretrained("rpa020/S")
Performance
CE Loss: 7.01 Perplexity: 1160 SELF-BLEU: 0.39
- Downloads last month
- 104