File size: 7,541 Bytes
afdb7c4 53e1b90 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe 18e3ca6 48107dd 387ea7d afdb7c4 b45a6fe afdb7c4 b45a6fe 1a24421 18e3ca6 b45a6fe afdb7c4 778291b b45a6fe be7deb0 b45a6fe afdb7c4 be7deb0 b45a6fe afdb7c4 be7deb0 778291b be7deb0 b45a6fe 778291b f26cad0 778291b b8f142b b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 1a24421 fa72749 1a24421 afdb7c4 fa72749 afdb7c4 fa72749 1a24421 afdb7c4 fa72749 5c2d7dc fa72749 afdb7c4 b45a6fe c19925f b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe 9e5d1a9 afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe afdb7c4 b45a6fe 778291b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 |
---
library_name: transformers
license: mit
datasets:
- HuggingFaceFW/fineweb-edu
language:
- en
base_model:
- openai-community/gpt2
pipeline_tag: text-generation
tags:
- GPT
- GPT-3 Small
- GPT-3 Medium
- GPT-3 Large
- GPT-3 XL
- GPT-3 2.7B
- GPT-3 6.7B
- GPT-3 13B
- GPT-3 175B
- GPT-3
- GPT-2
- GPT-2 124M
- transformers
- mit
- HuggingFace
- fineweb-edu
- Decoder-Only
---
# Model Card for GPT-124M
## Overview
GPT-124M is a decoder-only transformer model based on OpenAI’s GPT-2 architecture. It is trained for text generation and other natural language processing (NLP) tasks. The model is designed for general-purpose language modeling, making it useful for applications such as text completion.
- **Library:** 🤗 `transformers`
- **License:** MIT
- **Datasets:** `HuggingFaceFW/fineweb-edu`
- **Language:** English
- **Base Model:** `openai-community/gpt2`
- **Pipeline Tag:** `text-generation`
- **Developer:** Samkeet Sangai
- **Funded By:** Samkeet Sangai
- **Shared By:** Samkeet Sangai
- **Model Type:** GPT Decoder-Only
## Model Sources
- **Paper:** [Language Models are Unsupervised Multitask Learners](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)
- **Paper:** [Language Models are Few-Shot Learners](https://arxiv.org/pdf/2005.14165)
- **Paper:** [Training Compute-Optimal Large Language Models](https://arxiv.org/pdf/2203.15556)
- **Video:** [Andrej Karpathy-Let's reproduce GPT-2 (124M)](https://youtu.be/l8pRSuU81PU?si=KAo1y9dHYQAGJmj5)
- **Demo:** [GPT 124M Demo](https://huggingface.co/spaces/samkeet/GPT_124M)
- **GitHub:** [SamkeetSangai/GPT_124M](https://github.com/SamkeetSangai/GPT_124M)
## Model Details
### Model Description
GPT-124M is a lightweight generative language model fine-tuned on the `fineweb-edu` dataset. It can generate coherent and contextually relevant text but is not fine-tuned for instruction-following, safety, or factual accuracy.
### Training Configuration
- **Block Size:** `1024`
- **Vocabulary Size:** `50304`
- **Number of Layers:** `12`
- **Number of Attention Heads:** `12`
- **Embedding Size:** `768`
- **Hardware:** `8x NVIDIA RTX 4090 GPUs`
- **Training Duration:** `13 hours`
- **Dataset:** `fineweb-edu` (10 billion tokens)
- **Training Date:** `January 2025`
- **Validation Dataset:** 100 million tokens of HuggingFaceFW/fineweb-edu
## Usage
You can use this model for text generation using the `transformers` library.
### Method 1: Using Pipeline
```python
# Import necessary modules from transformers
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
# Load tokenizer and model
model_name = "samkeet/GPT_124M"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Create text generation pipeline
pipe = pipeline("text-generation", model=model_name, tokenizer=tokenizer, trust_remote_code=True, device="cpu")
# Generate text
result = pipe("Earth revolves around the", do_sample=True, max_length=40, temperature=0.9, top_p=0.5, top_k=50)
print("Pipeline Output:", result)
```
### Method 1: Direct Generation
```python
# Import necessary libraries
import torch
# Function for direct tokenization and text generation
def generate_text(input_text, device='cpu'):
tokens = tokenizer.encode(input_text, return_tensors='pt').to(device)
model.to(device)
# Generate output
output = model.generate(
tokens,
do_sample=True,
max_length=40,
temperature=0.9,
top_p=0.5,
top_k=50,
)
# Decode generated text
generated_sentence = tokenizer.decode(output)
return generated_sentence
# Generate text
input_text = "Earth revolves around the"
print("Direct Output:", generate_text(input_text))
```
### Fine-tuning & Downstream Use
This model can be fine-tuned for specific NLP applications like:
- Dialogue generation
- Text summarization
- Creative writing
- Code generation
## Limitations & Risks
### Out-of-Scope Use
- The model is **not instruction-tuned** for safety, ethics, or factual accuracy.
- It may produce **biased, misleading, or unsafe outputs**.
- It should **not** be used for tasks requiring high reliability, such as medical, legal, or financial applications.
### Bias, Risks, and Limitations
- The dataset may contain biases present in public web data.
- The model does not filter or detect offensive content.
- The model may **hallucinate** incorrect facts.
### Recommendations
- Always **verify** generated content before use.
- Implement **content filtering mechanisms** for deployment.
- Use in supervised environments only.
## Evaluation
### Training & Validation Loss
Validation was conducted using `100 million tokens` from the `HuggingFaceFW/fineweb-edu` dataset. The training and validation loss graph indicates a stable convergence with minimal overfitting. The training loss achieved a minimum value of 2.88, while the validation loss stabilized at 2.97.

### Results
The model was benchmarked against OpenAI’s GPT-2 Small and GPT-3 Small (both ~124M parameters). Remarkably, despite being trained on only `10 billion tokens`, compared to GPT-3 Small's `300 billion tokens`, GPT-124M was able to outperform both models in `HellaSwag` evaluation. This performance advantage is attributed to the specialized training data (educational content), which contrasts with GPT-3 Small’s broader multilingual and multi-domain training data.
According to Chinchilla’s scaling laws, an optimal token-to-parameter ratio suggests that a 124M-parameter model ideally requires `2.48 billion tokens` for training. The excess training tokens used in GPT-3 Small might have led to diminishing returns in performance.

### Key Insights from Evaluation
- **Efficient Training:** The model demonstrates impressive performance relative to its training token count, suggesting an efficient use of resources due to training using the Distributed Data Parallel (DDP) technique.
- **Data-Specific Advantage:** Training exclusively on educational data may have given GPT-124M an edge in evaluation metrics like `HellaSwag`.
- **Scaling Considerations:** GPT-3 Small, despite being trained on 300B tokens, does not exhibit proportionally better performance due to scaling limitations.
## Environmental Impact
- **Hardware Used:** `8x NVIDIA RTX 4090 GPUs`
- **Training Time:** `13 hours -> 104 GPU hours`
- **Estimated Carbon Emissions:** `13.48 kg CO2 eq.`
- **Equivalent to:**
- `54.5 km` driven by an average ICE car
- `6.75 kg` of coal burned
- `0.22` tree seedlings sequestering carbon for 10 years
## Technical Specifications
### Model Architecture
GPT-124M follows the architecture of OpenAI's GPT-2, which consists of:
- **Transformer-based decoder model**
- **Self-attention mechanism**
- **Layer normalization & feed-forward networks**
### Compute Infrastructure
- **Hardware:** 8x NVIDIA RTX 4090 GPUs
- **Software:** PyTorch, Hugging Face Transformers
- **Precision:** FP32
## Citation
If you use this model, please cite:
```bibtex
@article{gpt124m,
title={GPT-124M: A Compact Transformer Model for NLP},
author={Samkeet Sangai},
year={2024},
url={https://huggingface.co/samkeet/GPT_124M}
}
```
## Contact
For inquiries, contact [Samkeet Sangai](https://www.linkedin.com/in/samkeet-sangai/). |