Correct maximum positional embeddings

The model appears to have been trained with context window = 512, not 2048 as claimed here. This can be seen by looking at the average loss by sequence position on the GPT4 tiny stories dataset (packed into inputs of length 2048):

![image.png](https://cdn-uploads.huggingface.co/production/uploads/65b0cb8770773c0ab8fde1e0/qXnk9-RtXGrXlUlkZCxl3.png)

It would be great to get this changed (for all tinystories models), as the current config is misleading.

Files changed (1) hide show

config.json +1 -1

config.json CHANGED Viewed

@@ -28,7 +28,7 @@
   "initializer_range": 0.02,
   "intermediate_size": null,
   "layer_norm_epsilon": 1e-05,
-  "max_position_embeddings": 2048,
   "model_type": "gpt_neo",
   "num_heads": 16,
   "num_layers": 4,

   "initializer_range": 0.02,
   "intermediate_size": null,
   "layer_norm_epsilon": 1e-05,
+  "max_position_embeddings": 512,
   "model_type": "gpt_neo",
   "num_heads": 16,
   "num_layers": 4,