File size: 3,382 Bytes
fb45400
 
39e1675
 
 
 
fb45400
 
39e1675
fb45400
39e1675
fb45400
39e1675
fb45400
39e1675
fb45400
39e1675
fb45400
39e1675
 
 
 
fb45400
39e1675
 
 
 
fb45400
39e1675
fb45400
39e1675
 
 
 
fb45400
39e1675
fb45400
 
39e1675
fb45400
39e1675
 
 
 
 
 
 
 
fb45400
39e1675
 
 
 
 
 
 
 
 
 
fb45400
39e1675
 
 
 
fb45400
 
39e1675
fb45400
9d15e38
 
39e1675
fb45400
39e1675
fb45400
39e1675
 
 
 
 
 
fb45400
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
library_name: transformers
datasets:
- cerebras/SlimPajama-627B
language:
- en
---

# LCKV

This is a research-purpose pretrained model described in paper "[Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://arxiv.org/abs/2405.10637)".

## About

Layer-Condensed KV Cache (LCKV) is a variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. See more details in our github repo: https://github.com/whyNLP/LCKV

## Quick Start

```python
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w2-100b", trust_remote_code=True)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w2-100b", trust_remote_code=True)
```

Sample text generation script:

```python
# This is consistent with the `run_generation.py` script in the github repo: https://github.com/whyNLP/LCKV
import torch
from accelerate.utils import set_seed

from transformers import pipeline


set_seed(42)

pipe = pipeline(
    "text-generation",
    model="whynlp/tinyllama-lckv-w2-100b",
    torch_dtype=torch.bfloat16,
    device="cuda",
    trust_remote_code=True,
    model_kwargs={"attn_implementation": "flash_attention_2"},
)

response = pipe(
    "the meaning of life is",
    add_special_tokens=False,
    max_new_tokens=50,
    temperature=1.0,
    top_k=0,
    top_p=0.9,
    repetition_penalty=1.0,
    do_sample=True,
)

print(response[0]["generated_text"])
# the meaning of life is the magazine, however this time it will take it seems an absolute fantastic. Keeping the key to my appearance. Recently we did cool our liking anyone also up hours, health type process.
# With kids to of this is and
```


## The LCKV Collection

The model has 2 warmup layers. i.e. 3/22 KV cache of a standard TinyLlama.

This model was randomly initialized, then pre-trained on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).

The evaluation follows that of TinyLlama. Refer to [our paper](https://arxiv.org/abs/2405.10637) for more details.

| Model                                                                                         | Paper Section                  | Dev ppl. | Common-sense Reasoning |
| --------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- |
| [whynlp/tinyllama-lckv-w10-ft-250b](https://huggingface.co/whynlp/tinyllama-lckv-w10-ft-250b) | --                             | 7.939    | 50.86                  |
| [whynlp/tinyllama-lckv-w2-ft-100b](https://huggingface.co/whynlp/tinyllama-lckv-w2-ft-100b)   | Appendix C.1, Table 7 (line 5) | 8.514    | 49.55                  |
| [whynlp/tinyllama-lckv-w10-100b](https://huggingface.co/whynlp/tinyllama-lckv-w10-100b)       | Section 3.2, Table 2 (line 3)  | 9.265    | 46.84                  |
| **whynlp/tinyllama-lckv-w2-100b**                                                             | Section 3.2, Table 2 (line 2)  | 9.746    | 45.45                  |