File size: 7,541 Bytes
afdb7c4
 
53e1b90
 
 
 
 
 
 
 
b45a6fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
afdb7c4
b45a6fe
afdb7c4
b45a6fe
afdb7c4
b45a6fe
afdb7c4
b45a6fe
 
 
 
 
 
 
 
 
 
afdb7c4
b45a6fe
 
 
18e3ca6
 
 
48107dd
387ea7d
afdb7c4
 
 
 
b45a6fe
afdb7c4
b45a6fe
 
 
 
 
 
 
 
 
1a24421
 
18e3ca6
b45a6fe
 
afdb7c4
778291b
b45a6fe
be7deb0
b45a6fe
afdb7c4
be7deb0
b45a6fe
 
afdb7c4
be7deb0
778291b
be7deb0
 
b45a6fe
778291b
 
 
 
 
 
 
 
 
f26cad0
778291b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8f142b
b45a6fe
afdb7c4
b45a6fe
 
 
 
 
 
afdb7c4
b45a6fe
afdb7c4
b45a6fe
 
 
 
afdb7c4
b45a6fe
 
 
 
afdb7c4
 
b45a6fe
 
 
afdb7c4
 
 
1a24421
fa72749
1a24421
afdb7c4
 
fa72749
afdb7c4
fa72749
1a24421
afdb7c4
fa72749
5c2d7dc
 
 
fa72749
afdb7c4
 
b45a6fe
c19925f
b45a6fe
 
 
 
 
afdb7c4
b45a6fe
afdb7c4
b45a6fe
 
 
 
 
afdb7c4
 
b45a6fe
 
9e5d1a9
afdb7c4
b45a6fe
afdb7c4
b45a6fe
afdb7c4
b45a6fe
 
 
 
 
 
 
 
afdb7c4
b45a6fe
778291b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
---
library_name: transformers
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
tags:
- GPT
- GPT-3 Small
- GPT-3 Medium
- GPT-3 Large
- GPT-3 XL
- GPT-3 2.7B
- GPT-3 6.7B
- GPT-3 13B
- GPT-3 175B
- GPT-3
- GPT-2
- GPT-2 124M
- transformers
- mit
- HuggingFace
- fineweb-edu
- Decoder-Only
---
# Model Card for GPT-124M

## Overview

GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.

- **Library:** 🤗 `transformers`
- **License:** MIT
- **Datasets:** `HuggingFaceFW/fineweb-edu`
- **Language:** English
- **Base Model:** `openai-community/gpt2`
- **Pipeline Tag:** `text-generation`
- **Developer:** Samkeet Sangai
- **Funded By:** Samkeet Sangai
- **Shared By:** Samkeet Sangai
- **Model Type:** GPT Decoder-Only

## Model Sources

- **Paper:** [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Paper:** [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165)
- **Paper:** [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556)
- **Video:** [Andrej Karpathy-Let's reproduce GPT-2 (124M)](https://youtu.be/l8pRSuU81PU?si=KAo1y9dHYQAGJmj5)
- **Demo:** [GPT 124M Demo](https://huggingface.co/spaces/samkeet/GPT_124M)
- **GitHub:** [SamkeetSangai/GPT_124M](https://github.com/SamkeetSangai/GPT_124M)

## Model Details

### Model Description
GPT-124M is a lightweight generative language model fine-tuned on the `fineweb-edu` dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.

### Training Configuration
- **Block Size:** `1024`
- **Vocabulary Size:** `50304`
- **Number of Layers:** `12`
- **Number of Attention Heads:** `12`
- **Embedding Size:** `768`
- **Hardware:** `8x NVIDIA RTX 4090 GPUs`
- **Training Duration:** `13 hours`
- **Dataset:** `fineweb-edu` (10 billion tokens)
- **Training Date:** `January 2025`
- **Validation Dataset:** 100 million tokens of HuggingFaceFW/fineweb-edu
  
## Usage
You can use this model for text generation using the `transformers` library.

### Method 1: Using Pipeline
```python
# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")

# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)
```

### Method 1: Direct Generation
```python
# Import necessary libraries
import torch

# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
    tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
    model.to(device)
    
    # Generate output
    output = model.generate(
        tokens, 
        do_sample=True, 
        max_length=40, 
        temperature=0.9,
        top_p=0.5,
        top_k=50,
    )
    
    # Decode generated text
    generated_sentence = tokenizer.decode(output)
    return generated_sentence

# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))
```

### Fine-tuning & Downstream Use
This model can be fine-tuned for specific NLP applications like:
- Dialogue generation
- Text summarization
- Creative writing
- Code generation

## Limitations & Risks

### Out-of-Scope Use
- The model is **not instruction-tuned** for safety, ethics, or factual accuracy.
- It may produce **biased, misleading, or unsafe outputs**.
- It should **not** be used for tasks requiring high reliability, such as medical, legal, or financial applications.

### Bias, Risks, and Limitations
- The dataset may contain biases present in public web data.
- The model does not filter or detect offensive content.
- The model may **hallucinate** incorrect facts.

### Recommendations
- Always **verify** generated content before use.
- Implement **content filtering mechanisms** for deployment.
- Use in supervised environments only.

## Evaluation

### Training & Validation Loss
Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/fAwiSHr4f9SmO9PYiCntY.png)

### Results
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.

According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/670142e648894dfbedacacaf/Ne2MYAB2C0yHWFJLjCww3.png)

### Key Insights from Evaluation
- **Efficient Training:** The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
- **Data-Specific Advantage:** Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
- **Scaling Considerations:** GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.

## Environmental Impact

- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
- **Training Time:** `13 hours -> 104 GPU hours`
- **Estimated Carbon Emissions:** `13.48 kg CO2 eq.`
- **Equivalent to:**
  - `54.5 km` driven by an average ICE car
  - `6.75 kg` of coal burned
  - `0.22` tree seedlings sequestering carbon for 10 years

## Technical Specifications

### Model Architecture
GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:
- **Transformer-based decoder model**
- **Self-attention mechanism**
- **Layer normalization & feed-forward networks**

### Compute Infrastructure
- **Hardware:** 8x NVIDIA RTX 4090 GPUs
- **Software:** PyTorch, Hugging Face Transformers
- **Precision:** FP32

## Citation

If you use this model, please cite:

```bibtex
@article{gpt124m,
  title={GPT-124M: A Compact Transformer Model for NLP},
  author={Samkeet Sangai},
  year={2024},
  url={https://huggingface.co/samkeet/GPT_124M}
}
```

## Contact
For inquiries, contact [Samkeet Sangai](https://www.linkedin.com/in/samkeet-sangai/).