update readme
Browse files
README.md
CHANGED
@@ -13,16 +13,15 @@ AutoConfig.register("moduleformer", ModuleFormerConfig)
|
|
13 |
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
|
14 |
AutoModelForSequenceClassification.register(ModuleFormerConfig, ModuleFormerForSequenceClassification)
|
15 |
|
16 |
-
tokenizer = AutoTokenizer.from_pretrained('ibm/MoLM-
|
17 |
-
model = AutoModelForCausalLM.from_pretrained('ibm/MoLM-
|
18 |
```
|
19 |
|
20 |
**Model Details**
|
21 |
-
MoLM-350M-4B is a MoE-based language
|
22 |
-
MoLM-700M-4B has 4 billion parameters and computationally
|
23 |
-
MoLM-700M-8B has 8 billion parameters and computationally
|
24 |
-
|
25 |
-
|
26 |
**Model Developers** IBM
|
27 |
|
28 |
**Variations** MoLM comes in two different parameter sizes — 4B and 8B. The 4B models has two variants with different computation cost — 350M and 700M.
|
|
|
13 |
AutoModelForCausalLM.register(ModuleFormerConfig, ModuleFormerForCausalLM)
|
14 |
AutoModelForSequenceClassification.register(ModuleFormerConfig, ModuleFormerForSequenceClassification)
|
15 |
|
16 |
+
tokenizer = AutoTokenizer.from_pretrained('ibm/MoLM-700M-8B')
|
17 |
+
model = AutoModelForCausalLM.from_pretrained('ibm/MoLM-700M-8B')
|
18 |
```
|
19 |
|
20 |
**Model Details**
|
21 |
+
MoLM-350M-4B is a MoE-based language model. It has 4 billion parameters, but each input token only activates 350M parameters. Thus, it's computationally equivalent to a 350M dense model.
|
22 |
+
MoLM-700M-4B has 4 billion parameters and is computationally equivalent to a 700M dense model.
|
23 |
+
MoLM-700M-8B has 8 billion parameters and is computationally equivalent to a 700M dense model. All models are trained on 300 billion tokens from publicly available sources.
|
24 |
+
All models are trained on 300 billion tokens from publicly available sources, with a learning rate of 3.0 x 10<sup>-4</sup> and a global batch-size of 3M tokens.
|
|
|
25 |
**Model Developers** IBM
|
26 |
|
27 |
**Variations** MoLM comes in two different parameter sizes — 4B and 8B. The 4B models has two variants with different computation cost — 350M and 700M.
|