cerebras
/

btlm-3b-8k-base

@@ -27,7 +27,7 @@ BTLM-3B-8k was trained with a similar architecture to [CerebrasGPT](https://arxi
 BTLM-3B-8k-base:
 - **Licensed for commercial use** (Apache 2.0).
 - **[State of the art 3B parameter model](#performance-vs-3b-models)**.
-- **Provides 7B model performance in a 3B model** via performance enhancements from [ALiBi](https://arxiv.org/abs/2108.12409), [SwiGLU](https://arxiv.org/abs/2002.05202), [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466) and the the extensively duduplicated and cleaned [SlimPajama-627B dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
 - **[Fits in devices with as little as 3GB of memory](#memory-requirements) when quantized to 4-bit**.
 - **One of few 3B models that supports 8k sequence length** thanks to ALiBi.
 - **Requires 71% fewer training FLOPs, has 58% smaller memory footprint** for inference than comparable 7B models.
@@ -115,6 +115,12 @@ Table 2: Performance at 7B model size. Detailed down-stream tasks comparisons. M
 ![figure_4_image](./figure_4_performance_vs_7b_models.jpg)
 Figure 4: Performance at 7B model size
 ## Model Details
 - Developed by: [Cerebras Systems](https://www.cerebras.net/) and [Opentensor](https://opentensor.ai/) with generous support from [G42 Cloud](https://www.g42cloud.com/) and [IIAI](https://www.inceptioniai.org/en/)
 - License: Apache 2.0
@@ -127,7 +133,7 @@ Figure 4: Performance at 7B model size
 - Optimizer: AdamW
 - Positional Encoding: ALiBi
 - Language: English
-- Learn more: [BTLM-3B-8k blog post](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
 - Paper: Coming soon
 ## To continue training with PyTorch and Maximal Update Parameterization
@@ -162,12 +168,6 @@ The primary intended use is to further research into large language models. BTLM
 You may fine-tune and adapt BTLM-3B-8k-base model via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied before using the BTLM-3B-8k-base in production downstream applications.
-## Long Sequence Lengths
-To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate the on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows substantially improves extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.
-![figure_5_image](./figure_5_xentropy_with_sequence_lengths.png)
-Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test set. Inference performed on the extrapolated sequence length of 32,768 tokens.
 ### Out of Scope Use
 BTLM-3B-8k-base was trained on SlimPajama, with primarily English language, and is not recommended for machine translation tasks. BTLM-3B-8k-base has not been tuned for instruction-following or chat-based use cases.

 BTLM-3B-8k-base:
 - **Licensed for commercial use** (Apache 2.0).
 - **[State of the art 3B parameter model](#performance-vs-3b-models)**.
+- **Provides 7B model performance in a 3B model** via performance enhancements from [ALiBi](https://arxiv.org/abs/2108.12409), [SwiGLU](https://arxiv.org/abs/2002.05202), [maximal update parameterization (muP)](https://arxiv.org/abs/2203.03466) and the the extensively deduplicated and cleaned [SlimPajama-627B dataset](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
 - **[Fits in devices with as little as 3GB of memory](#memory-requirements) when quantized to 4-bit**.
 - **One of few 3B models that supports 8k sequence length** thanks to ALiBi.
 - **Requires 71% fewer training FLOPs, has 58% smaller memory footprint** for inference than comparable 7B models.
 ![figure_4_image](./figure_4_performance_vs_7b_models.jpg)
 Figure 4: Performance at 7B model size
+## Long Sequence Lengths
+To enable long sequence applications, we use ALiBi position embeddings and trained on 470B tokens at the context length of 2,048 followed by 157B of tokens trained at 8,192 context length. To assess BTLM’s long sequence capability, we evaluate it on SlimPajama test set with 32,768 context length and plot loss at each token position. Although ALiBi allows extrapolation in theory, 2,048 context length training alone does not extrapolate well in practice. Thankfully variable sequence length training allows for substantially improved extrapolation. BTLM-3B extrapolates well up to 10k context length but the performance degrades slightly beyond this.
+![figure_5_image](./figure_5_xentropy_with_sequence_lengths.png)
+Figure 5: BTLM-3B model's cross-entropy evaluation on the SlimPajama’s test set. Inference performed on the extrapolated sequence length of 32,768 tokens.
 ## Model Details
 - Developed by: [Cerebras Systems](https://www.cerebras.net/) and [Opentensor](https://opentensor.ai/) with generous support from [G42 Cloud](https://www.g42cloud.com/) and [IIAI](https://www.inceptioniai.org/en/)
 - License: Apache 2.0
 - Optimizer: AdamW
 - Positional Encoding: ALiBi
 - Language: English
+- Learn more: [BTLM-3B-8k blog](https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/)
 - Paper: Coming soon
 ## To continue training with PyTorch and Maximal Update Parameterization
 You may fine-tune and adapt BTLM-3B-8k-base model via either Cerebras [Model Studio](https://www.cerebras.net/product-cloud/) or third-party libraries. Further safety-related testing and mitigations should be applied before using the BTLM-3B-8k-base in production downstream applications.
 ### Out of Scope Use
 BTLM-3B-8k-base was trained on SlimPajama, with primarily English language, and is not recommended for machine translation tasks. BTLM-3B-8k-base has not been tuned for instruction-following or chat-based use cases.