Update README.md
Browse files
README.md
CHANGED
@@ -22,6 +22,9 @@ ipt-350m is:
|
|
22 |
- **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
|
23 |
- **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
|
24 |
|
|
|
|
|
|
|
25 |
## How to Use
|
26 |
|
27 |
```python
|
@@ -92,7 +95,4 @@ The model has been modified from a standard transformer in the following ways:
|
|
92 |
The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301).
|
93 |
Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
|
94 |
|
95 |
-
Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points.
|
96 |
-
|
97 |
-
If you like this project, consider supporting me with a cup of coffee! 🤖✨🌞
|
98 |
-
[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici)
|
|
|
22 |
- **Capable of fast training and inference** (via [FlashAttention](https://arxiv.org/pdf/2205.14135.pdf) and [FasterTransformer](https://github.com/NVIDIA/FasterTransformer))
|
23 |
- **Equipped with highly efficient open-source training code** via the [llm-foundry repository](https://github.com/mosaicml/llm-foundry)
|
24 |
|
25 |
+
If you find this project useful, consider supporting its development:
|
26 |
+
[![Buy me a coffee](https://badgen.net/badge/icon/Buy%20Me%20A%20Coffee?icon=buymeacoffee&label)](https://bmc.link/edoardofederici)
|
27 |
+
|
28 |
## How to Use
|
29 |
|
30 |
```python
|
|
|
95 |
The model was trained for ~13B tokens (with batch size 64 and sequence length 2048) on [OSCAR-2301](https://huggingface.co/datasets/oscar-corpus/OSCAR-2301).
|
96 |
Each example was constructed from as many sequences from that dataset as were necessary to fill the 2048 sequence length.
|
97 |
|
98 |
+
Vocabulary size is 50432, a multiple of 128 as suggested in [MEGATRON-LM](https://arxiv.org/abs/1909.08053), model flop utilization (MFU) increased by up to four percentage points.
|
|
|
|
|
|