Update README.md
Browse filesadd info about kv cache saving
README.md
CHANGED
@@ -1,4 +1,5 @@
|
|
1 |
---
|
|
|
2 |
datasets:
|
3 |
- cerebras/SlimPajama-627B
|
4 |
language:
|
@@ -65,6 +66,8 @@ print(response[0]["generated_text"])
|
|
65 |
|
66 |
## The LCKV Collection
|
67 |
|
|
|
|
|
68 |
This model was first initialized from the [TinyLlama 2.5T checkpoint](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), then continued pre-training on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
69 |
|
70 |
Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.
|
|
|
1 |
---
|
2 |
+
library_name: transformers
|
3 |
datasets:
|
4 |
- cerebras/SlimPajama-627B
|
5 |
language:
|
|
|
66 |
|
67 |
## The LCKV Collection
|
68 |
|
69 |
+
The model has 2 warmup layers. i.e. 3/22 KV cache of a standard TinyLlama.
|
70 |
+
|
71 |
This model was first initialized from the [TinyLlama 2.5T checkpoint](https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1195k-token-2.5T), then continued pre-training on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
|
72 |
|
73 |
Since the model structure has been changed, the initialization cannot inherit the performance of the TinyLlama checkpoint, but it effectively boosts the training process compared to pre-training from scratch.
|