Update README.md
Browse files
README.md
CHANGED
@@ -12,6 +12,51 @@ tags:
|
|
12 |
> persimmon-8b went to the vocab lipo clinic
|
13 |
|
14 |
|
15 |
-
|
16 |
|
17 |
-
Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
> persimmon-8b went to the vocab lipo clinic
|
13 |
|
14 |
|
15 |
+
A slimmed-down version of [persimmon-8b-base](https://huggingface.co/adept/persimmon-8b-base) which removes the ~70,000 unused entries in the model vocabulary and tokenizer (see the safetensors layer overview). Should be _slightly_ faster.
|
16 |
|
17 |
+
Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)
|
18 |
+
|
19 |
+
|
20 |
+
## inference
|
21 |
+
|
22 |
+
install required pkgs:
|
23 |
+
|
24 |
+
```sh
|
25 |
+
pip install -U transformers accelerate bitsandbytes sentencepiece
|
26 |
+
```
|
27 |
+
|
28 |
+
load in 4bit & run inference:
|
29 |
+
|
30 |
+
```python
|
31 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
32 |
+
|
33 |
+
tokenizer = AutoTokenizer.from_pretrained("pszemraj/perSLIMmon-8b-base")
|
34 |
+
model = AutoModelForCausalLM.from_pretrained(
|
35 |
+
"pszemraj/perSLIMmon-8b-base",
|
36 |
+
load_in_4bit=True, # GPU required
|
37 |
+
torch_dtype="auto",
|
38 |
+
device_map="auto",
|
39 |
+
)
|
40 |
+
inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(
|
41 |
+
model.device
|
42 |
+
)
|
43 |
+
tokens = model.generate(
|
44 |
+
**inputs,
|
45 |
+
max_new_tokens=64,
|
46 |
+
temperature=0.75,
|
47 |
+
top_p=0.95,
|
48 |
+
epsilon_cutoff=1e-5,
|
49 |
+
repetition_penalty=1.05,
|
50 |
+
renormalize_logits=True,
|
51 |
+
do_sample=True,
|
52 |
+
) # adapt inference params as needed
|
53 |
+
|
54 |
+
print(tokenizer.decode(tokens[0], skip_special_tokens=True))
|
55 |
+
```
|
56 |
+
|
57 |
+
inference is decently fast on a colab T4:
|
58 |
+
|
59 |
+
```
|
60 |
+
CPU times: user 6.01 s, sys: 138 ms, total: 6.15 s
|
61 |
+
Wall time: 6.23 s
|
62 |
+
```
|