small fixes
Browse files
README.md
CHANGED
@@ -56,7 +56,7 @@ Training a multilingual 176 billion parameters model in the open
|
|
56 |
|
57 |
[BigScience](https://bigscience.huggingface.co) is a open and collaborative workshop around the study and creation of very large language models gathering more than 1000 researchers around the worlds. You can find more information on the main website at https://bigscience.huggingface.co.
|
58 |
|
59 |
-
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will
|
60 |
|
61 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
62 |
|
@@ -75,16 +75,16 @@ You can follow the training at [https://twitter.com/BigScienceLLM](https://twitt
|
|
75 |
|
76 |
- Multilingual: 46 languages: Full list is here: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
77 |
- 341.6 billion tokens (1.5 TB of text data)
|
78 |
-
- Tokenizer vocabulary: 250
|
79 |
- More information:
|
80 |
- Blog post detailing the design choices during the dataset creation: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
81 |
|
82 |
### **The engineering side**
|
83 |
|
84 |
-
- number of GPU used for the training: 384 A100 GPU with 80
|
85 |
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
|
86 |
-
- checkpoint size:
|
87 |
-
- training throughput:
|
88 |
- estimated training time: 3-4 months depending on throughput and unexpected events
|
89 |
- **More information**:
|
90 |
- Blog post on the hardware/engineering side: [https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model](https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model)
|
|
|
56 |
|
57 |
[BigScience](https://bigscience.huggingface.co) is a open and collaborative workshop around the study and creation of very large language models gathering more than 1000 researchers around the worlds. You can find more information on the main website at https://bigscience.huggingface.co.
|
58 |
|
59 |
+
The training of BigScience’s main model started on **March 11, 2022 11:42am PST** and will continue for 3-4 months on 384 A100 80GB GPUs of the Jean Zay public supercomputer
|
60 |
|
61 |
You can follow the training at [https://twitter.com/BigScienceLLM](https://twitter.com/BigScienceLLM)
|
62 |
|
|
|
75 |
|
76 |
- Multilingual: 46 languages: Full list is here: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
77 |
- 341.6 billion tokens (1.5 TB of text data)
|
78 |
+
- Tokenizer vocabulary: 250,680 tokens
|
79 |
- More information:
|
80 |
- Blog post detailing the design choices during the dataset creation: [https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling](https://bigscience.huggingface.co/blog/building-a-tb-scale-multilingual-dataset-for-language-modeling)
|
81 |
|
82 |
### **The engineering side**
|
83 |
|
84 |
+
- number of GPU used for the training: 384 A100 GPU with 80 GB of memory each
|
85 |
- one copy of the model takes 48 GPUs (using 60 GB of memory on each GPU)
|
86 |
+
- checkpoint size: the bf16 weights are 329GB, the full checkpoint with optimizer states is 2.3TB
|
87 |
+
- training throughput: ~150 TFLOPs
|
88 |
- estimated training time: 3-4 months depending on throughput and unexpected events
|
89 |
- **More information**:
|
90 |
- Blog post on the hardware/engineering side: [https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model](https://bigscience.huggingface.co/blog/which-hardware-to-train-a-176b-parameters-model)
|