juewang commited on
Commit
0187783
1 Parent(s): 3dbabd8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -9
README.md CHANGED
@@ -78,9 +78,14 @@ widget:
78
 
79
  # Model Summary
80
 
81
- We present GPT-JT, a fork of GPT-6B, trained on 3.53 billion tokens, that outperforms most 100B+ parameter models at classification.
82
- GPT-JT was trained with a new decentralized algorithm on computers networked with 1Gbps interconnect, in contrast with typical 100Gbps-1.6Tbps data center networks.
83
- GPT-JT is a bidirectional dense model, which processes the prompt with bidirectional attention to fully leverage the context information, and uses causal attention only for token generation.
 
 
 
 
 
84
 
85
  ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
86
 
@@ -105,8 +110,9 @@ model = AutoModelForCausalLM.from_pretrained("togethercomputer/GPT-JT-6B-v1")
105
  ## UL2 Training Objective
106
 
107
  We train GPT-J using UL2 training objective [1][2].
108
- The usual GPT model, including GPT-J, uses the lower left causal mask to do autoregressive generation, so for each token, it can only see the context information before itself.
109
- In order to fully leverage the context information, we continue training with UL2 training objectives, and uses the lower right causal mask with prefix -- using bidirectional attention for the prompt and causal attention for token generation.
 
110
 
111
  $$
112
  \begin{bmatrix}
@@ -126,15 +132,13 @@ $$
126
  \end{bmatrix}
127
  $$
128
 
129
- ## Data
130
-
131
- We fine-tune [GPT-J-6B](https://huggingface.co/EleutherAI/gpt-j-6B) on NI, P3, COT, the pile data.
132
  - [Natural-Instructions](https://github.com/allenai/natural-instructions)
133
  - [P3](https://huggingface.co/datasets/Muennighoff/P3)
134
  - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
135
  - [the pile](https://huggingface.co/datasets/the_pile)
136
 
137
- We first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
138
 
139
  ## Hyperparameters
140
 
@@ -146,6 +150,7 @@ During training, we truncate the input sequence to 2048 tokens, and for input se
146
  ## Infrastructure
147
 
148
  We used [the Together Research Computer](https://together.xyz/) to conduct training.
 
149
 
150
  # References
151
 
 
78
 
79
  # Model Summary
80
 
81
+ > With a new decentralized training algorithm, we fine-tuned GPT-J (6B) on 3.53 billion tokens, resulting in GPT-JT (6B), a model that outperforms many 100B+ parameter models on classification benchmarks.
82
+
83
+ We incorporated a collection of open techniques and datasets to build GPT-JT:
84
+ - GPT-JT was trained based on GPT-J (6B), created by [EleutherAI](https://www.eleuther.ai);
85
+ - We used [UL2](https://github.com/google-research/google-research/tree/master/ul2)'s training objective, which allows it to use bidirectional context to process the prompt;
86
+ - The model was trained on a large collection of diverse data, including [Chain-of-Thought (CoT)](https://ai.googleblog.com/2022/05/language-models-perform-reasoning-via.html), [Public Pool of Prompts (P3) dataset](https://huggingface.co/datasets/bigscience/P3), [Natural-Instructions (NI) dataset](https://github.com/allenai/natural-instructions).
87
+
88
+ With the help of techniques mentioned above, GPT-JT significantly improves the performance of classification tasks over the original GPT-J, and even outperforms most 100B+ parameter models!
89
 
90
  ***Please try out our [Online Demo](https://huggingface.co/spaces/togethercomputer/GPT-JT)!***
91
 
 
110
  ## UL2 Training Objective
111
 
112
  We train GPT-J using UL2 training objective [1][2].
113
+ The usual GPT model, including GPT-J, uses causal mask (as shown in the lower left) to do autoregressive generation, so for each token, it can only see the context information before itself.
114
+ In order to fully leverage the context information, we continue training GPT-J with UL2 training objectives, and uses causal mask with prefix (as shown in the lower right) -- using bidirectional attention for the prompt / input and causal attention for token generation.
115
+ Intuitively, being able to see context bidirectionally might improve downstream tasks that requires this information.
116
 
117
  $$
118
  \begin{bmatrix}
 
132
  \end{bmatrix}
133
  $$
134
 
135
+ Furthermore, we leverage a large collection of data, including NI, P3, COT, the pile:
 
 
136
  - [Natural-Instructions](https://github.com/allenai/natural-instructions)
137
  - [P3](https://huggingface.co/datasets/Muennighoff/P3)
138
  - [MMLU-COT](https://github.com/jasonwei20/flan-2/blob/main/mmlu-cot.json)
139
  - [the pile](https://huggingface.co/datasets/the_pile)
140
 
141
+ Specifically, we first conduct training for 2.62 billion tokens using the UL2 loss on the Pile, followed by 0.92 billion tokens with a mixture of the above datasets: 5% of COT, 20% of P3, 20% of NI, and 55% of the Pile.
142
 
143
  ## Hyperparameters
144
 
 
150
  ## Infrastructure
151
 
152
  We used [the Together Research Computer](https://together.xyz/) to conduct training.
153
+ The model was trained on computers networked with 1Gbps interconnect (in contrast, data center networks are 100Gbps-1.6Tbps).
154
 
155
  # References
156