rskuzma commited on
Commit
287d8f9
·
1 Parent(s): 908e05c

update README, slight revisions and typos (#5)

Browse files

- update README, slight revisions and typos (b09d46a667d2d83d21acdb95fc79731dc5bdd0d6)

Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -27,7 +27,7 @@ BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxi
27
  BTLM-3B-8k-base:
28
  - **Licensed for commercial use** (Apache 2.0).
29
  - **[State of the art 3B parameter model](#performance-vs-3b-models)**.
30
- - **Provides 7B model performance in a 3B model** via performance enhancements from [ALiBi](https://arxiv.org/abs/2108.12409), [SwiGLU](https://arxiv.org/abs/2002.05202), [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466) and the the extensively duduplicated and cleaned [SlimPajama-627B dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
31
  - **[Fits in devices with as little as 3GB of memory](#memory-requirements) when quantized to 4-bit**.
32
  - **One of few 3B models that supports 8k sequence length** thanks to ALiBi.
33
  - **Requires 71% fewer training FLOPs, has 58% smaller memory footprint** for inference than comparable 7B models.
@@ -115,6 +115,12 @@ Table 2: Performance at 7B model size. Detailed down-stream tasks comparisons. M
115
  ![figure_4_image](./figure_4_performance_vs_7b_models.jpg)
116
  Figure 4: Performance at 7B model size
117
 
 
 
 
 
 
 
118
  ## Model Details
119
  - Developed by: [Cerebras Systems](https://www.cerebras.net/) and [Opentensor](https://opentensor.ai/) with generous support from [G42 Cloud](https://www.g42cloud.com/) and [IIAI](https://www.inceptioniai.org/en/)
120
  - License: Apache 2.0
@@ -127,7 +133,7 @@ Figure 4: Performance at 7B model size
127
  - Optimizer: AdamW
128
  - Positional Encoding: ALiBi
129
  - Language: English
130
- - Learn more: [BTLM-3B-8k blog post](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
131
  - Paper: Coming soon
132
 
133
  ## To continue training with PyTorch and Maximal Update Parameterization
@@ -162,12 +168,6 @@ The primary intended use is to further research into large language models. BTLM
162
 
163
  You may fine-tune and adapt BTLM-3B-8k-base model via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied before using the BTLM-3B-8k-base in production downstream applications.
164
 
165
- ## Long Sequence Lengths
166
- To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate the on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows substantially improves extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.
167
-
168
- ![figure_5_image](./figure_5_xentropy_with_sequence_lengths.png)
169
- Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test set. Inference performed on the extrapolated sequence length of 32,768 tokens.
170
-
171
  ### Out of Scope Use
172
  BTLM-3B-8k-base was trained on SlimPajama, with primarily English language, and is not recommended for machine translation tasks. BTLM-3B-8k-base has not been tuned for instruction-following or chat-based use cases.
173
 
 
27
  BTLM-3B-8k-base:
28
  - **Licensed for commercial use** (Apache 2.0).
29
  - **[State of the art 3B parameter model](#performance-vs-3b-models)**.
30
+ - **Provides 7B model performance in a 3B model** via performance enhancements from [ALiBi](https://arxiv.org/abs/2108.12409), [SwiGLU](https://arxiv.org/abs/2002.05202), [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466) and the the extensively deduplicated and cleaned [SlimPajama-627B dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
31
  - **[Fits in devices with as little as 3GB of memory](#memory-requirements) when quantized to 4-bit**.
32
  - **One of few 3B models that supports 8k sequence length** thanks to ALiBi.
33
  - **Requires 71% fewer training FLOPs, has 58% smaller memory footprint** for inference than comparable 7B models.
 
115
  ![figure_4_image](./figure_4_performance_vs_7b_models.jpg)
116
  Figure 4: Performance at 7B model size
117
 
118
+ ## Long Sequence Lengths
119
+ To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate it on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows for substantially improved extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.
120
+
121
+ ![figure_5_image](./figure_5_xentropy_with_sequence_lengths.png)
122
+ Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test set. Inference performed on the extrapolated sequence length of 32,768 tokens.
123
+
124
  ## Model Details
125
  - Developed by: [Cerebras Systems](https://www.cerebras.net/) and [Opentensor](https://opentensor.ai/) with generous support from [G42 Cloud](https://www.g42cloud.com/) and [IIAI](https://www.inceptioniai.org/en/)
126
  - License: Apache 2.0
 
133
  - Optimizer: AdamW
134
  - Positional Encoding: ALiBi
135
  - Language: English
136
+ - Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
137
  - Paper: Coming soon
138
 
139
  ## To continue training with PyTorch and Maximal Update Parameterization
 
168
 
169
  You may fine-tune and adapt BTLM-3B-8k-base model via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied before using the BTLM-3B-8k-base in production downstream applications.
170
 
 
 
 
 
 
 
171
  ### Out of Scope Use
172
  BTLM-3B-8k-base was trained on SlimPajama, with primarily English language, and is not recommended for machine translation tasks. BTLM-3B-8k-base has not been tuned for instruction-following or chat-based use cases.
173