Update README.md (#17)
Browse files- Update README.md (7d02209c9a1860ada0889eefde5a29d1776ee3f4)
- Update README.md (5f50772b94e6463052b999eda50bea4ade595e03)
Co-authored-by: Vaibhav Srivastav <[email protected]>
README.md
CHANGED
@@ -48,6 +48,8 @@ Below we share some code snippets on how to get quickly started with running the
|
|
48 |
|
49 |
#### Running the model on a single / multi GPU
|
50 |
|
|
|
|
|
51 |
|
52 |
```python
|
53 |
# pip install accelerate
|
@@ -71,51 +73,10 @@ print(tokenizer.decode(outputs[0]))
|
|
71 |
<a name="precisions"></a>
|
72 |
#### Running the model on a GPU using different precisions
|
73 |
|
74 |
-
The native weights of this model were exported in `bfloat16` precision.
|
75 |
|
76 |
You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
|
77 |
|
78 |
-
* _Using `torch.float16`_
|
79 |
-
|
80 |
-
```python
|
81 |
-
# pip install accelerate
|
82 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
83 |
-
import torch
|
84 |
-
|
85 |
-
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
|
86 |
-
model = AutoModelForCausalLM.from_pretrained(
|
87 |
-
"google/gemma-2-27b-it",
|
88 |
-
device_map="auto",
|
89 |
-
torch_dtype=torch.float16,
|
90 |
-
revision="float16",
|
91 |
-
)
|
92 |
-
|
93 |
-
input_text = "Write me a poem about Machine Learning."
|
94 |
-
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
95 |
-
|
96 |
-
outputs = model.generate(**input_ids)
|
97 |
-
print(tokenizer.decode(outputs[0]))
|
98 |
-
```
|
99 |
-
|
100 |
-
* _Using `torch.bfloat16`_
|
101 |
-
|
102 |
-
```python
|
103 |
-
# pip install accelerate
|
104 |
-
from transformers import AutoTokenizer, AutoModelForCausalLM
|
105 |
-
|
106 |
-
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-27b-it")
|
107 |
-
model = AutoModelForCausalLM.from_pretrained(
|
108 |
-
"google/gemma-2-27b-it",
|
109 |
-
device_map="auto",
|
110 |
-
torch_dtype=torch.bfloat16)
|
111 |
-
|
112 |
-
input_text = "Write me a poem about Machine Learning."
|
113 |
-
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")
|
114 |
-
|
115 |
-
outputs = model.generate(**input_ids)
|
116 |
-
print(tokenizer.decode(outputs[0]))
|
117 |
-
```
|
118 |
-
|
119 |
* _Upcasting to `torch.float32`_
|
120 |
|
121 |
```python
|
@@ -182,6 +143,9 @@ print(tokenizer.decode(outputs[0]))
|
|
182 |
|
183 |
* _Flash Attention 2_
|
184 |
|
|
|
|
|
|
|
185 |
First make sure to install `flash-attn` in your environment `pip install flash-attn`
|
186 |
|
187 |
```diff
|
|
|
48 |
|
49 |
#### Running the model on a single / multi GPU
|
50 |
|
51 |
+
> [!IMPORTANT]
|
52 |
+
> Given the model instabilities with SDPA/ FA2, by default, the model inference would utilise `eager` attention.
|
53 |
|
54 |
```python
|
55 |
# pip install accelerate
|
|
|
73 |
<a name="precisions"></a>
|
74 |
#### Running the model on a GPU using different precisions
|
75 |
|
76 |
+
The native weights of this model were exported in `bfloat16` precision.
|
77 |
|
78 |
You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below.
|
79 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
80 |
* _Upcasting to `torch.float32`_
|
81 |
|
82 |
```python
|
|
|
143 |
|
144 |
* _Flash Attention 2_
|
145 |
|
146 |
+
> [!WARNING]
|
147 |
+
> Gemma 2 is currently incompatible with Flash Attention/ SDPA, using it might result in unreliable generations. Use at your own risk.
|
148 |
+
|
149 |
First make sure to install `flash-attn` in your environment `pip install flash-attn`
|
150 |
|
151 |
```diff
|