--- library_name: transformers datasets: - cerebras/SlimPajama-627B language: - en --- # LCKV This is a research-purpose pretrained model described in paper "[Layer-Condensed KV Cache for Efficient Inference of Large Language Models](https://arxiv.org/abs/2405.10637)". ## About Layer-Condensed KV Cache (LCKV) is a variant of transformer decoders in which queries of all layers are paired with keys and values of just the top layer. It reduces the memory and computation cost, reduces the number of parameters, significantly improves the inference throughput with comparable or better task performance. See more details in our github repo: https://github.com/whyNLP/LCKV ## Quick Start ```python # Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="whynlp/tinyllama-lckv-w10-100b", trust_remote_code=True) # Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("whynlp/tinyllama-lckv-w10-100b", trust_remote_code=True) ``` Sample text generation script: ```python # This is consistent with the `run_generation.py` script in the github repo: https://github.com/whyNLP/LCKV import torch from accelerate.utils import set_seed from transformers import pipeline set_seed(42) pipe = pipeline( "text-generation", model="whynlp/tinyllama-lckv-w10-100b", torch_dtype=torch.bfloat16, device="cuda", trust_remote_code=True, model_kwargs={"attn_implementation": "flash_attention_2"}, ) response = pipe( "the meaning of life is", add_special_tokens=False, max_new_tokens=50, temperature=1.0, top_k=0, top_p=0.9, repetition_penalty=1.0, do_sample=True, ) print(response[0]["generated_text"]) # the meaning of life is the honest, however this time it will take it is an absolute, to let the time does give them to do their sentence which will be how sense is what anyone use up hours, health as well. Your rate kids must of this is and ``` ## The LCKV Collection The model has 10 warmup layers. i.e. 1/2 KV cache of a standard TinyLlama. This model was randomly initialized, then pre-trained on 100B tokens from [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B). The evaluation follows that of TinyLlama. Refer to [our paper](https://arxiv.org/abs/2405.10637) for more details. | Model | Paper Section | Dev ppl. | Common-sense Reasoning | | --------------------------------------------------------------------------------------------- | ------------------------------ | -------- | ---------------------- | | [whynlp/tinyllama-lckv-w10-ft-250b](https://huggingface.co/whynlp/tinyllama-lckv-w10-ft-250b) | -- | 7.939 | 50.86 | | [whynlp/tinyllama-lckv-w2-ft-100b](https://huggingface.co/whynlp/tinyllama-lckv-w2-ft-100b) | Appendix C.1, Table 7 (line 5) | 8.514 | 49.55 | | **whynlp/tinyllama-lckv-w10-100b** | Section 3.2, Table 2 (line 3) | 9.265 | 46.84 | | [whynlp/tinyllama-lckv-w2-100b](https://huggingface.co/whynlp/tinyllama-lckv-w2-100b) | Section 3.2, Table 2 (line 2) | 9.746 | 45.45 |