gpt2023 / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
73a9837
|
raw
history blame
7.22 kB
metadata
license: mit
language:
  - en
tags:
  - causal-lm

GPT2(023) Model Card

This is the smallest GPT-2 model (124m) from OpenAi finetuned on approximately 2.23B tokens (almost the 2.48B needed to 'chinchilla-optimally' pretrain it! It's also more tokens than Cerebras-GPT-111M was trained on in total) consisting of 1.3B from common crawl sites from 2023, 540M from ArXiv, and 390M from GitHub.

The model was trained with a learning rate of 1e-4, with a warmup of 1024 steps, then decaying to 0. There were 4400 total steps during training at a batch size of 512 examples with a context length of 1024. The batch size and context length are the same as the pre-training of GPT2 itself. Training took a total of 1.18e+18 FLOs over the course of 79.32 hours locally with a 12gb RTX3060. Final train loss was 2.73.

Evaluation of GPT2023

(in progress)

model piqa acc winogrande acc lambada ppl lambada acc arc acc sciq acc wsc acc
pythia-70m 59.85 51.22 140.81 21.40 17.15 65.00 36.53
pythia-160m 62.68 51.07 30.03 36.76 19.62 76.20 36.58
pythia-410m 66.54 52.24 11.75 49.93 21.67 80.80 60.58
opt-125m 63.00 50.27 26.02 37.90 18.94 75.1 36.54
--- --- --- --- --- --- --- ---
gpt2 (124m) 62.89 51.61 40.06 32.56 19.03 75 43.27
gpt2023 (124m) 62.02 49.64 34.55 33.98 18.94 76.1 36.54

The resulting model achieves a puplexity of 339.38, making it competative with Cerebras-590m with only 21% of the parameters, and much better than the original GPT-2 which scores 491.57!

(metric explanation here: https://twitter.com/aicrumb/status/1650350363898265601 , tldr it's a joke)

To demonstrate how GPT2(023) is aware of recent events, let’s take a look at a given example:

# About Covid-19
 - -
The Covid-19

The model completes the text as:

# About Covid-19
 - -
The Covid-19 pandemic is the worldwide pandemic that has left thousands of people unable to enter and work in or continue their normal daily normal life. In this brief post, we examine three of the main factors that have accelerated the pandemic and predict the path the pandemic will take through the rest of the world.

As you can see, GPT2(023) can generate coherent and relevant text pertaining to the Covid-19 pandemic, showcasing its ability to understand recent events. However, it struggles with certain subjects that weren’t extremely relevant in it’s training data. As only 2.23 billion tokens were used during finetuning, the model may have missed out on many recent events. One of those events being the latest US election.

Given text in a question and answer format:

Q: Who is the last president?
A: Donald Trump

Q: Who is the most recent president?
A:

The model completes the text with: Barack Obama

Model description

(from GPT-2 model card)

GPT-2 is a transformer model pretrained on a very large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences.

More precisely, inputs are sequences of continuous text of a certain length and the targets are the same sequence, shifted one token (word or piece of word) to the right. The model uses internally a mask-mechanism to make sure the predictions for the token i only uses the inputs from 1 to i but not the future tokens.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks. The model is best at what it was pretrained for however, which is generating texts from a prompt.

This is the smallest version of GPT-2, with 124M parameters.

How to use

You can use this model directly with a pipeline for text generation. Since the generation relies on some randomness, we set a seed for reproducibility:

>>> from transformers import pipeline, set_seed
>>> generator = pipeline('text-generation', model='crumb/gpt2023')
>>> set_seed(42)
>>> generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)

[{'generated_text': "Hello, I'm a language model, a language for thinking, a language for expressing thoughts."},
 {'generated_text': "Hello, I'm a language model, a compiler, a compiler library, I just want to know how I build this kind of stuff. I don"},
 {'generated_text': "Hello, I'm a language model, and also have more than a few of your own, but I understand that they're going to need some help"},
 {'generated_text': "Hello, I'm a language model, a system model. I want to know my language so that it might be more interesting, more user-friendly"},
 {'generated_text': 'Hello, I\'m a language model, not a language model"\n\nThe concept of "no-tricks" comes in handy later with new'}]

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import GPT2Tokenizer, GPT2Model
tokenizer = GPT2Tokenizer.from_pretrained('crumb/gpt2023')
model = GPT2Model.from_pretrained('crumb/gpt2023')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Limitations and bias

The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of unfiltered content from the internet, which is far from neutral. As the openAI team themselves point out in their model card:

Because large-scale language models like GPT-2 do not distinguish fact from fiction, we don’t support use-cases that require the generated text to be true.

Additionally, language models like GPT-2 reflect the biases inherent to the systems they were trained on, so we do not recommend that they be deployed into systems that interact with humans unless the deployers first carry out a study of biases relevant to the intended use-case. We found no statistically significant difference in gender, race, and religious bias probes between 774M and 1.5B, implying all versions of GPT-2 should be approached with similar levels of caution around use cases that are sensitive to biases around human attributes.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 24.85
ARC (25-shot) 21.93
HellaSwag (10-shot) 31.11
MMLU (5-shot) 25.05
TruthfulQA (0-shot) 40.71
Winogrande (5-shot) 50.12
GSM8K (5-shot) 0.3
DROP (3-shot) 4.73