sippycoder commited on
Commit
883ff9f
β€’
1 Parent(s): 2e48d83

initial commit

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -4,14 +4,14 @@ language:
4
  - en
5
  ---
6
 
7
- # πŸš€ Nucleus-22B-token-350B
8
 
9
- **Nucleus-22B-token-350B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the Apache 2.0 license.**
10
 
11
  *1T-token model coming soon* 😊.
12
 
13
 
14
- ## What about Nucleus-22B-token-350B?
15
 
16
  * **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
17
  * **It is made available under an MIT license**.
@@ -19,7 +19,7 @@ language:
19
 
20
  ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
21
 
22
- # Model Card for Nucleus-22B-token-350B
23
 
24
  ## Model Details
25
 
@@ -46,11 +46,11 @@ Production use without adequate assessment of risks and mitigation; any use case
46
 
47
  ## Bias, Risks, and Limitations
48
 
49
- Nucleus-22B-token-350B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
50
 
51
  ### Recommendations
52
 
53
- We recommend users of Nucleus-22B-token-350B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
54
 
55
  ## How to Get Started with the Mode
56
 
@@ -59,7 +59,7 @@ We recommend users of Nucleus-22B-token-350B to consider finetuning it for the s
59
 
60
  ### Training Data
61
 
62
- Nucleus-22B-token-350B was trained on 350B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
63
 
64
  | **Data source** | **Fraction** | **Tokens** | **Sources** |
65
  |--------------------|--------------|------------|-----------------------------------|
@@ -74,7 +74,7 @@ The data was tokenized with the tokenizer similar to Llama-[7B](https://huggingf
74
 
75
  ### Training Procedure
76
 
77
- Nucleus-22B-token-350B was trained on 256 A100 80GB GPUs, using a FSDP
78
 
79
  #### Training Hyperparameters
80
 
 
4
  - en
5
  ---
6
 
7
+ # πŸš€ Nucleus-22B-token-500B
8
 
9
+ **Nucleus-22B-token-500B is a 22B parameters causal decoder-only model built by Nucleus.AI and trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) along with curated corpora. It is made available under the Apache 2.0 license.**
10
 
11
  *1T-token model coming soon* 😊.
12
 
13
 
14
+ ## What about Nucleus-22B-token-500B?
15
 
16
  * **It performs well compared to similar-size open-source models** (e.g., [MPT-7B](https://huggingface.co/mosaicml/mpt-7b), [StableLM](https://github.com/Stability-AI/StableLM), [RedPajama](https://huggingface.co/togethercomputer/RedPajama-INCITE-Base-7B-v0.1) etc.), thanks to being trained on 1,500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb) enhanced with curated corpora. See the [OpenLLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).
17
  * **It is made available under an MIT license**.
 
19
 
20
  ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
21
 
22
+ # Model Card for Nucleus-22B-token-500B
23
 
24
  ## Model Details
25
 
 
46
 
47
  ## Bias, Risks, and Limitations
48
 
49
+ Nucleus-22B-token-500B is trained on English data only, and will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
50
 
51
  ### Recommendations
52
 
53
+ We recommend users of Nucleus-22B-token-500B to consider finetuning it for the specific set of tasks of interest, and for guardrails and appropriate precautions to be taken for any production use.
54
 
55
  ## How to Get Started with the Mode
56
 
 
59
 
60
  ### Training Data
61
 
62
+ Nucleus-22B-token-500B was trained on 500B tokens of [RefinedWeb](https://huggingface.co/datasets/tiiuae/falcon-refinedweb), along with other corpora.
63
 
64
  | **Data source** | **Fraction** | **Tokens** | **Sources** |
65
  |--------------------|--------------|------------|-----------------------------------|
 
74
 
75
  ### Training Procedure
76
 
77
+ Nucleus-22B-token-500B was trained on 256 A100 80GB GPUs, using a FSDP
78
 
79
  #### Training Hyperparameters
80