File size: 1,515 Bytes
0900c49
 
 
 
 
 
 
 
 
 
 
 
 
 
9fec9a3
0900c49
9fec9a3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- persimmon
---

# perSLIMmon-8b-base

> persimmon-8b went to the vocab lipo clinic 


A slimmed-down version of [persimmon-8b-base](https://huggingface.co/adept/persimmon-8b-base) which removes the ~70,000 unused entries in the model vocabulary and tokenizer (see the safetensors layer overview). Should be _slightly_ faster.

Credit: [fine-tune-fuyu](https://github.com/phillip-kravtsov/fine-tune-fuyu) (`scripts/surgery.py` was adapted for persimmon)


## inference

install required pkgs:

```sh
pip install -U transformers accelerate bitsandbytes sentencepiece
```

load in 4bit & run inference:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("pszemraj/perSLIMmon-8b-base")
model = AutoModelForCausalLM.from_pretrained(
    "pszemraj/perSLIMmon-8b-base",
    load_in_4bit=True, # GPU required
    torch_dtype="auto",
    device_map="auto",
)
inputs = tokenizer("The weather is always wonderful", return_tensors="pt").to(
    model.device
)
tokens = model.generate(
    **inputs,
    max_new_tokens=64,
    temperature=0.75,
    top_p=0.95,
    epsilon_cutoff=1e-5,
    repetition_penalty=1.05,
    renormalize_logits=True,
    do_sample=True,
) # adapt inference params as needed

print(tokenizer.decode(tokens[0], skip_special_tokens=True))
```

inference is decently fast on a colab T4:

```
CPU times: user 6.01 s, sys: 138 ms, total: 6.15 s
Wall time: 6.23 s
```