Model S

Model S has a sequence-to-sequence model, we selected Pegasus as the model type. This architecture uses masked language modeling (MLM), which is a technique used in pretraining to mask out some percentage of the tokens in the input sequence. The model is then trained to predict the masked tokens based on the context provided by the unmasked tokens. Pegasus also applies gapped sentence generation (GSG), where the model is trained to generate a sequence of text from a given input sequence, filling in gaps or generating new sentences based on the input. The model architecture is described in figure 3.1. Pegasus distinguishes itself from other

Model Details

The model is trained from scratch, i.e. its weights are randomly initialized. That includes a vocabulary size of 96103, a model dimension of 1024, 2 layers, and a max position of 512 for the em- beddings.

Model Description

Developed by: Ronny Paul
Model type: Pegasus
Language(s) (NLP): Northern Sami

Uses

This model was used in an experiment to determine which architecture is favourable in a low-resource-setting with Northern Sami.

Dataset

The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.

How to Get Started with the Model

model = AutoModelForSeq2SeqLM.from_pretrained("rpa020/S")

Performance

CE Loss: 7.01 Perplexity: 1160 SELF-BLEU: 0.39