sharpenb commited on
Commit
8c8ec70
1 Parent(s): dd99ebf

Upload folder using huggingface_hub (#1)

Browse files

- 3c16884b6f18ac571e0c0e8dc3f828d0a52ea074873ac887bc78246c92ccafd4 (d183119bd2a56a27fe68abae520fcfe8af0d0b0a)
- 271adbb8c0b1a520264ff8edf0bf681604342ab34e9832baf60c54e360ae3085 (525cc2ac5726c5fefbdbd1fea63dec6624d2c3b8)

README.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pruna-engine
3
+ thumbnail: "https://assets-global.website-files.com/646b351987a8d8ce158d1940/64ec9e96b4334c0e1ac41504_Logo%20with%20white%20text.svg"
4
+ metrics:
5
+ - memory_disk
6
+ - memory_inference
7
+ - inference_latency
8
+ - inference_throughput
9
+ - inference_CO2_emissions
10
+ - inference_energy_consumption
11
+ ---
12
+ <!-- header start -->
13
+ <!-- 200823 -->
14
+ <div style="width: auto; margin-left: auto; margin-right: auto">
15
+ <a href="https://www.pruna.ai/" target="_blank" rel="noopener noreferrer">
16
+ <img src="https://i.imgur.com/eDAlcgk.png" alt="PrunaAI" style="width: 100%; min-width: 400px; display: block; margin: auto;">
17
+ </a>
18
+ </div>
19
+ <!-- header end -->
20
+
21
+ [![Twitter](https://img.shields.io/twitter/follow/PrunaAI?style=social)](https://twitter.com/PrunaAI)
22
+ [![GitHub](https://img.shields.io/github/followers/PrunaAI?label=Follow%20%40PrunaAI&style=social)](https://github.com/PrunaAI)
23
+ [![LinkedIn](https://img.shields.io/badge/LinkedIn-Connect-blue)](https://www.linkedin.com/company/93832878/admin/feed/posts/?feedType=following)
24
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Us-blue?style=social&logo=discord)](https://discord.gg/CP4VSgck)
25
+
26
+ # Simply make AI models cheaper, smaller, faster, and greener!
27
+
28
+ - Give a thumbs up if you like this model!
29
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
30
+ - Request access to easily compress your *own* AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
31
+ - Read the documentations to know more [here](https://pruna-ai-pruna.readthedocs-hosted.com/en/latest/)
32
+ - Join Pruna AI community on Discord [here](https://discord.gg/CP4VSgck) to share feedback/suggestions or get help.
33
+
34
+ ## Results
35
+
36
+ ![image info](./plots.png)
37
+
38
+ **Frequently Asked Questions**
39
+ - ***How does the compression work?*** The model is compressed with llm-int8.
40
+ - ***How does the model quality change?*** The quality of the model output might vary compared to the base model.
41
+ - ***How is the model efficiency evaluated?*** These results were obtained on NVIDIA A100-PCIE-40GB with configuration described in `model/smash_config.json` and are obtained after a hardware warmup. The smashed model is directly compared to the original base model. Efficiency results may vary in other settings (e.g. other hardware, image size, batch size, ...). We recommend to directly run them in the use-case conditions to know if the smashed model can benefit you.
42
+ - ***What is the model format?*** We use safetensors.
43
+ - ***What calibration data has been used?*** If needed by the compression method, we used WikiText as the calibration data.
44
+ - ***What is the naming convention for Pruna Huggingface models?*** We take the original model name and append "turbo", "tiny", or "green" if the smashed model has a measured inference speed, inference memory, or inference energy consumption which is less than 90% of the original base model.
45
+ - ***How to compress my own models?*** You can request premium access to more compression methods and tech support for your specific use-cases [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
46
+ - ***What are "first" metrics?*** Results mentioning "first" are obtained after the first run of the model. The first run might take more memory or be slower than the subsequent runs due cuda overheads.
47
+ - ***What are "Sync" and "Async" metrics?*** "Sync" metrics are obtained by syncing all GPU processes and stop measurement when all of them are executed. "Async" metrics are obtained without syncing all GPU processes and stop when the model output can be used by the CPU. We provide both metrics since both could be relevant depending on the use-case. We recommend to test the efficiency gains directly in your use-cases.
48
+
49
+ ## Setup
50
+
51
+ You can run the smashed model with these steps:
52
+
53
+ 0. Check requirements from the original repo NousResearch/Yarn-Llama-2-7b-128k installed. In particular, check python, cuda, and transformers versions.
54
+ 1. Make sure that you have installed quantization related packages.
55
+ ```bash
56
+ pip install transformers accelerate bitsandbytes>0.37.0
57
+ ```
58
+ 2. Load & run the model.
59
+ ```python
60
+ from transformers import AutoModelForCausalLM, AutoTokenizer
61
+
62
+ model = AutoModelForCausalLM.from_pretrained("PrunaAI/NousResearch-Yarn-Llama-2-7b-128k-bnb-4bit-smashed",
63
+ trust_remote_code=True)
64
+ tokenizer = AutoTokenizer.from_pretrained("NousResearch/Yarn-Llama-2-7b-128k")
65
+
66
+ input_ids = tokenizer("What is the color of prunes?,", return_tensors='pt').to(model.device)["input_ids"]
67
+
68
+ outputs = model.generate(input_ids, max_new_tokens=216)
69
+ tokenizer.decode(outputs[0])
70
+ ```
71
+
72
+ ## Configurations
73
+
74
+ The configuration info are in `smash_config.json`.
75
+
76
+ ## Credits & License
77
+
78
+ The license of the smashed model follows the license of the original model. Please check the license of the original model NousResearch/Yarn-Llama-2-7b-128k before using this model which provided the base model. The license of the `pruna-engine` is [here](https://pypi.org/project/pruna-engine/) on Pypi.
79
+
80
+ ## Want to compress other models?
81
+
82
+ - Contact us and tell us which model to compress next [here](https://www.pruna.ai/contact).
83
+ - Request access to easily compress your own AI models [here](https://z0halsaff74.typeform.com/pruna-access?typeform-source=www.pruna.ai).
config.json ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "/tmp/tmplpkfk47m",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_llama.LlamaConfig",
8
+ "AutoModelForCausalLM": "modeling_llama_together_yarn.LlamaForCausalLM"
9
+ },
10
+ "bos_token_id": 1,
11
+ "eos_token_id": 2,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 4096,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 11008,
16
+ "max_position_embeddings": 131072,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 32,
19
+ "num_hidden_layers": 32,
20
+ "num_key_value_heads": 32,
21
+ "pad_token_id": 0,
22
+ "pretraining_tp": 1,
23
+ "quantization_config": {
24
+ "bnb_4bit_compute_dtype": "bfloat16",
25
+ "bnb_4bit_quant_type": "fp4",
26
+ "bnb_4bit_use_double_quant": true,
27
+ "llm_int8_enable_fp32_cpu_offload": false,
28
+ "llm_int8_has_fp16_weight": false,
29
+ "llm_int8_skip_modules": [
30
+ "lm_head"
31
+ ],
32
+ "llm_int8_threshold": 6.0,
33
+ "load_in_4bit": true,
34
+ "load_in_8bit": false,
35
+ "quant_method": "bitsandbytes"
36
+ },
37
+ "rms_norm_eps": 1e-05,
38
+ "rope_scaling": {
39
+ "factor": 32.0,
40
+ "finetuned": true,
41
+ "original_max_position_embeddings": 4096,
42
+ "type": "yarn"
43
+ },
44
+ "tie_word_embeddings": false,
45
+ "torch_dtype": "float16",
46
+ "transformers_version": "4.37.1",
47
+ "use_cache": true,
48
+ "use_flash_attention": false,
49
+ "vocab_size": 32000
50
+ }
configuration_llama.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ LLaMA model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
29
+
30
+
31
+ class LlamaConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
34
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
35
+ defaults will yield a similar configuration to that of the LLaMA-7B.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 32000):
43
+ Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`LlamaModel`]
45
+ hidden_size (`int`, *optional*, defaults to 4096):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 11008):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer encoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ num_key_value_heads (`int`, *optional*):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
60
+ `num_attention_heads`.
61
+ pretraining_tp (`int`, *optional*, defaults to `1`):
62
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
63
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
64
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
65
+ issue](https://github.com/pytorch/pytorch/issues/76232).
66
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
67
+ The non-linear activation function (function or string) in the decoder.
68
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
69
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
70
+ just in case (e.g., 512 or 1024 or 2048).
71
+ initializer_range (`float`, *optional*, defaults to 0.02):
72
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
73
+ rms_norm_eps (`float`, *optional*, defaults to 1e-12):
74
+ The epsilon used by the rms normalization layers.
75
+ use_cache (`bool`, *optional*, defaults to `True`):
76
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
77
+ relevant if `config.is_decoder=True`.
78
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
79
+ Whether to tie weight embeddings
80
+ rope_scaling (`Dict`, *optional*):
81
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports three scaling
82
+ strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
83
+ is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
84
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
85
+ these scaling strategies behave:
86
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
87
+ experimental feature, subject to breaking API changes in future versions.
88
+
89
+ Example:
90
+
91
+ ```python
92
+ >>> from transformers import LlamaModel, LlamaConfig
93
+
94
+ >>> # Initializing a LLaMA llama-7b style configuration
95
+ >>> configuration = LlamaConfig()
96
+
97
+ >>> # Initializing a model from the llama-7b style configuration
98
+ >>> model = LlamaModel(configuration)
99
+
100
+ >>> # Accessing the model configuration
101
+ >>> configuration = model.config
102
+ ```"""
103
+ model_type = "llama"
104
+ keys_to_ignore_at_inference = ["past_key_values"]
105
+
106
+ def __init__(
107
+ self,
108
+ vocab_size=32000,
109
+ hidden_size=4096,
110
+ intermediate_size=11008,
111
+ num_hidden_layers=32,
112
+ num_attention_heads=32,
113
+ num_key_value_heads=None,
114
+ hidden_act="silu",
115
+ max_position_embeddings=2048,
116
+ initializer_range=0.02,
117
+ rms_norm_eps=1e-6,
118
+ use_cache=True,
119
+ pad_token_id=0,
120
+ bos_token_id=1,
121
+ eos_token_id=2,
122
+ pretraining_tp=1,
123
+ tie_word_embeddings=False,
124
+ rope_scaling=None,
125
+ use_flash_attention=False,
126
+ **kwargs,
127
+ ):
128
+ self.vocab_size = vocab_size
129
+ self.max_position_embeddings = max_position_embeddings
130
+ self.hidden_size = hidden_size
131
+ self.intermediate_size = intermediate_size
132
+ self.num_hidden_layers = num_hidden_layers
133
+ self.num_attention_heads = num_attention_heads
134
+
135
+ # for backward compatibility
136
+ if num_key_value_heads is None:
137
+ num_key_value_heads = num_attention_heads
138
+
139
+ self.num_key_value_heads = num_key_value_heads
140
+ self.hidden_act = hidden_act
141
+ self.initializer_range = initializer_range
142
+ self.rms_norm_eps = rms_norm_eps
143
+ self.pretraining_tp = pretraining_tp
144
+ self.use_cache = use_cache
145
+ self.rope_scaling = rope_scaling
146
+ self._rope_scaling_validation()
147
+ self.use_flash_attention = use_flash_attention
148
+ if self.use_flash_attention:
149
+ try:
150
+ from flash_attn.flash_attn_interface import flash_attn_varlen_func
151
+ from einops import rearrange
152
+ except:
153
+ raise ValueError("`use_flash_attention` requires Flash Attention 2+ and einops.\nTry `pip install einops` and installing Flash Attention from from https://github.com/Dao-AILab/flash-attention")
154
+
155
+ super().__init__(
156
+ pad_token_id=pad_token_id,
157
+ bos_token_id=bos_token_id,
158
+ eos_token_id=eos_token_id,
159
+ tie_word_embeddings=tie_word_embeddings,
160
+ **kwargs,
161
+ )
162
+
163
+ def _rope_scaling_validation(self):
164
+ """
165
+ Validate the `rope_scaling` configuration.
166
+ """
167
+ if self.rope_scaling is None:
168
+ return
169
+
170
+ if not isinstance(self.rope_scaling, dict):
171
+ raise ValueError(
172
+ "`rope_scaling` must be a dictionary, "
173
+ f"got {self.rope_scaling}"
174
+ )
175
+ rope_scaling_type = self.rope_scaling.get("type", None)
176
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
177
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic", "ntk-by-parts", "yarn", "dynamic-yarn"]:
178
+ raise ValueError(
179
+ f"`rope_scaling`'s name field must be one of ['linear', 'dynamic', 'ntk-by-parts', 'yarn', 'dynamic-yarn'], got {rope_scaling_type}"
180
+ )
181
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
182
+ raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
183
+ if rope_scaling_type == "ntk-by-parts" or rope_scaling_type == "yarn" or rope_scaling_type == "dynamic-yarn":
184
+ original_max_position_embeddings = self.rope_scaling.get("original_max_position_embeddings", None)
185
+ if original_max_position_embeddings is None or not isinstance(original_max_position_embeddings, int):
186
+ raise ValueError(f"`rope_scaling.original_max_position_embeddings` must be set to an int when using ntk-by-parts, yarn, and dynamic-yarn")
generation_config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token_id": 1,
3
+ "do_sample": true,
4
+ "eos_token_id": 2,
5
+ "max_length": 131072,
6
+ "pad_token_id": 0,
7
+ "temperature": 0.6,
8
+ "top_p": 0.9,
9
+ "transformers_version": "4.37.1"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7e33accbda4814e0eb09395194557c2b0de461d74e9adf61660cfbb9ec9a1407
3
+ size 3866042099
modeling_llama_together_yarn.py ADDED
@@ -0,0 +1,1183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.nn.functional as F
26
+ import torch.utils.checkpoint
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+
30
+ from transformers.activations import ACT2FN
31
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
32
+ from transformers.modeling_utils import PreTrainedModel
33
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, replace_return_docstrings
34
+ from .configuration_llama import LlamaConfig
35
+
36
+
37
+ try:
38
+ from flash_attn.flash_attn_interface import (
39
+ flash_attn_func,
40
+ flash_attn_kvpacked_func,
41
+ flash_attn_qkvpacked_func,
42
+ flash_attn_varlen_kvpacked_func,
43
+ )
44
+ from flash_attn.bert_padding import unpad_input, pad_input
45
+ flash_attn_v2_installed = True
46
+ print('>>>> Flash Attention installed')
47
+ except ImportError:
48
+ flash_attn_v2_installed = False
49
+ raise ImportError('Please install Flash Attention: `pip install flash-attn --no-build-isolation`')
50
+
51
+ try:
52
+ from flash_attn.layers.rotary import apply_rotary_emb_func
53
+ flash_rope_installed = True
54
+ print('>>>> Flash RoPE installed')
55
+ except ImportError:
56
+ flash_rope_installed = False
57
+ raise ImportError('Please install RoPE kernels: `pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary`')
58
+
59
+
60
+ logger = logging.get_logger(__name__)
61
+
62
+ _CONFIG_FOR_DOC = "LlamaConfig"
63
+
64
+
65
+ #@torch.jit.script
66
+ def rmsnorm_func(hidden_states, weight, variance_epsilon):
67
+ input_dtype = hidden_states.dtype
68
+ hidden_states = hidden_states.to(torch.float32)
69
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
70
+ hidden_states = hidden_states * torch.rsqrt(variance + variance_epsilon)
71
+ return (weight * hidden_states).to(input_dtype)
72
+
73
+
74
+ class LlamaRMSNorm(nn.Module):
75
+ def __init__(self, hidden_size, eps=1e-6):
76
+ """
77
+ LlamaRMSNorm is equivalent to T5LayerNorm
78
+ """
79
+ super().__init__()
80
+ self.weight = nn.Parameter(torch.ones(hidden_size))
81
+ self.register_buffer(
82
+ "variance_epsilon",
83
+ torch.tensor(eps),
84
+ persistent=False,
85
+ )
86
+
87
+ def forward(self, hidden_states):
88
+ return rmsnorm_func(hidden_states, self.weight, self.variance_epsilon)
89
+
90
+
91
+ # Inverse dim formula to find dim based on number of rotations
92
+ def _yarn_find_correction_dim(num_rotations, dim, base=10000, max_position_embeddings=2048):
93
+ return (dim * math.log(max_position_embeddings/(num_rotations * 2 * math.pi)))/(2 * math.log(base))
94
+
95
+ # Find dim range bounds based on rotations
96
+ def _yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048):
97
+ low = math.floor(_yarn_find_correction_dim(
98
+ low_rot, dim, base, max_position_embeddings))
99
+ high = math.ceil(_yarn_find_correction_dim(
100
+ high_rot, dim, base, max_position_embeddings))
101
+ return max(low, 0), min(high, dim-1) # Clamp values just in case
102
+
103
+ def _yarn_linear_ramp_mask(min, max, dim):
104
+ if min == max:
105
+ max += 0.001 # Prevent singularity
106
+
107
+ linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
108
+ ramp_func = torch.clamp(linear_func, 0, 1)
109
+ return ramp_func
110
+
111
+ def _yarn_get_mscale(scale=1):
112
+ if scale <= 1:
113
+ return 1.0
114
+ return 0.1 * math.log(scale) + 1.0
115
+
116
+ class FlashYaRNRotaryEmbedding(torch.nn.Module):
117
+ """
118
+ The rotary position embeddings from RoFormer_ (Su et. al).
119
+ A crucial insight from the method is that the query and keys are
120
+ transformed by rotation matrices which depend on the relative positions.
121
+
122
+ Other implementations are available in the Rotary Transformer repo_ and in
123
+ GPT-NeoX_, GPT-NeoX was an inspiration
124
+
125
+ .. _RoFormer: https://arxiv.org/abs/2104.09864
126
+ .. _repo: https://github.com/ZhuiyiTechnology/roformer
127
+ .. _GPT-NeoX: https://github.com/EleutherAI/gpt-neox
128
+
129
+ This implements the YaRN extension method.
130
+ """
131
+
132
+ def __init__(self, dim: int, base=10000.0, interleaved=False,
133
+ scaling_factor=1.0, pos_idx_in_fp32=True,
134
+ max_position_embeddings=2048,
135
+ original_max_position_embeddings=2048, extrapolation_factor=1,
136
+ attn_factor=1, beta_fast=32, beta_slow=1,
137
+ dynamic=False, finetuned=False, device=None):
138
+ """
139
+ interleaved: if True, rotate pairs of even and odd dimensions (GPT-J style) instead
140
+ of 1st half and 2nd half (GPT-NeoX style).
141
+ pos_idx_in_fp32: if True, the position indices [0.0, ..., seqlen - 1] are in fp32,
142
+ otherwise they might be in lower precision.
143
+ This option was added because previously (before 2023-07-02), when we construct
144
+ the position indices, we use the dtype of self.inv_freq. In most cases this would
145
+ be fp32, but if the model is trained in pure bf16 (not mixed precision), then
146
+ self.inv_freq would be bf16, and the position indices are also in bf16.
147
+ Because of the limited precision of bf16 (e.g. 1995.0 is rounded to 2000.0), the
148
+ embeddings for some positions will coincide.
149
+ To maintain compatibility with models previously trained in pure bf16,
150
+ we add this option.
151
+ scaling_factor: RotaryEmbedding extended with YaRN scaling.
152
+ """
153
+ super().__init__()
154
+
155
+ self.dim = dim
156
+ self.base = float(base)
157
+ self.interleaved = interleaved
158
+ self.scaling_factor = scaling_factor
159
+ self.max_position_embeddings = max_position_embeddings
160
+ self.original_max_position_embeddings = original_max_position_embeddings if original_max_position_embeddings else max_position_embeddings
161
+ self.extrapolation_factor = extrapolation_factor
162
+ self.attn_factor = attn_factor
163
+ self.beta_fast = beta_fast
164
+ self.beta_slow = beta_slow
165
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
166
+ self.mscale = float(_yarn_get_mscale(self.scaling_factor) * attn_factor) # Get n-d magnitude scaling corrected for interpolation
167
+ self.dynamic = dynamic
168
+ self.finetuned = finetuned
169
+
170
+ # Generate and save the inverse frequency buffer (non trainable)
171
+ if not dynamic:
172
+ self._compute_inv_freq(scaling_factor, device)
173
+
174
+ self._seq_len_cached = 0
175
+ self._cos_cached = None
176
+ self._sin_cached = None
177
+
178
+ def _compute_inv_freq(self, scaling_factor, device=None):
179
+ pos_freqs = self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
180
+ inv_freq_extrapolation = 1.0 / pos_freqs
181
+ inv_freq_interpolation = 1.0 / (scaling_factor * pos_freqs)
182
+
183
+ low, high = _yarn_find_correction_range(self.beta_fast, self.beta_slow, self.dim, self.base, self.original_max_position_embeddings)
184
+ inv_freq_mask = (1 - _yarn_linear_ramp_mask(low, high, self.dim // 2).float().to(device)) * self.extrapolation_factor # Get n-d rotational scaling corrected for extrapolation
185
+ inv_freq = inv_freq_interpolation * (1 - inv_freq_mask) + inv_freq_extrapolation * inv_freq_mask
186
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
187
+
188
+ def _compute_inv_freq_original(self, device=None):
189
+ inv_freq = 1 / (self.base ** (torch.arange(0, self.dim, 2, device=device,
190
+ dtype=torch.float32) / self.dim))
191
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
192
+
193
+ def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
194
+ # Reset the tables if the sequence length has changed,
195
+ # if we're on a new device (possibly due to tracing for instance),
196
+ # or if we're switching from inference mode to training
197
+ if (seqlen > self._seq_len_cached or self._cos_cached.device != device
198
+ or self._cos_cached.dtype != dtype
199
+ or (self.training and self._cos_cached.is_inference())):
200
+ self._seq_len_cached = seqlen
201
+
202
+ if self.dynamic:
203
+ scaling_factor = None
204
+ if seqlen <= self.max_position_embeddings:
205
+ if self.finetuned:
206
+ scaling_factor = self.scaling_factor
207
+ else:
208
+ scaling_factor = seqlen / self.original_max_position_embeddings
209
+ if scaling_factor:
210
+ self._compute_inv_freq(scaling_factor, device)
211
+ self.mscale = float(_yarn_get_mscale(scaling_factor) * self.attn_factor)
212
+ else:
213
+ self._compute_inv_freq_original(device)
214
+
215
+ # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
216
+ # And the output of arange can be quite large, so bf16 would lose a lot of precision.
217
+ # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
218
+ if self.pos_idx_in_fp32:
219
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
220
+ # We want fp32 here as well since inv_freq will be multiplied with t, and the output
221
+ # will be large. Having it in bf16 will lose a lot of precision and cause the
222
+ # cos & sin output to change significantly.
223
+ # We want to recompute self.inv_freq if it was not loaded in fp32
224
+ if self.inv_freq.dtype != torch.float32:
225
+ inv_freq = self.inv_freq.to(torch.float32)
226
+ else:
227
+ inv_freq = self.inv_freq
228
+ else:
229
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
230
+ inv_freq = self.inv_freq
231
+ # Don't do einsum, it converts fp32 to fp16 under AMP
232
+ # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
233
+ freqs = torch.outer(t, inv_freq)
234
+ self._cos_cached = (torch.cos(freqs) * self.mscale).to(dtype)
235
+ self._sin_cached = (torch.sin(freqs) * self.mscale).to(dtype)
236
+
237
+
238
+ def forward(self, q: torch.Tensor, k: torch.Tensor, seqlen_offset: int = 0) -> Tuple[torch.Tensor, torch.Tensor]:
239
+ """
240
+ q: (batch, seqlen, nheads, headdim)
241
+ k: (batch, seqlen, nheads, headdim)
242
+ seqlen_offset: can be used in generation where the qkv being passed in is only the last
243
+ token in the batch.
244
+ """
245
+ self._update_cos_sin_cache(q.shape[1] + seqlen_offset, device=q.device, dtype=q.dtype)
246
+ return apply_rotary_emb_func(
247
+ q, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
248
+ self.interleaved, True # inplace=True
249
+ ), apply_rotary_emb_func(
250
+ k, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
251
+ self.interleaved, True # inplace=True
252
+ )
253
+
254
+
255
+ class FlashRotaryEmbedding(torch.nn.Module):
256
+ """
257
+ The rotary position embeddings from RoFormer_ (Su et. al).
258
+ A crucial insight from the method is that the query and keys are
259
+ transformed by rotation matrices which depend on the relative positions.
260
+ Other implementations are available in the Rotary Transformer repo_ and in
261
+ GPT-NeoX_, GPT-NeoX was an inspiration
262
+ .. _RoFormer: https://arxiv.org/abs/2104.09864
263
+ .. _repo: https://github.com/ZhuiyiTechnology/roformer
264
+ .. _GPT-NeoX: https://github.com/EleutherAI/gpt-neox
265
+ If scale_base is not None, this implements XPos (Sun et al., https://arxiv.org/abs/2212.10554).
266
+ A recommended value for scale_base is 512: https://github.com/HazyResearch/flash-attention/issues/96
267
+ Reference: https://github.com/sunyt32/torchscale/blob/main/torchscale/component/xpos_relative_position.py
268
+ """
269
+
270
+ def __init__(self, dim: int, base=10000.0, interleaved=False, scale_base=None,
271
+ scaling_factor=1.0, pos_idx_in_fp32=True, device=None):
272
+ """
273
+ interleaved: if True, rotate pairs of even and odd dimensions (GPT-J style) instead
274
+ of 1st half and 2nd half (GPT-NeoX style).
275
+ pos_idx_in_fp32: if True, the position indices [0.0, ..., seqlen - 1] are in fp32,
276
+ otherwise they might be in lower precision.
277
+ This option was added because previously (before 2023-07-02), when we construct
278
+ the position indices, we use the dtype of self.inv_freq. In most cases this would
279
+ be fp32, but if the model is trained in pure bf16 (not mixed precision), then
280
+ self.inv_freq would be bf16, and the position indices are also in bf16.
281
+ Because of the limited precision of bf16 (e.g. 1995.0 is rounded to 2000.0), the
282
+ embeddings for some positions will coincide.
283
+ To maintain compatibility with models previously trained in pure bf16,
284
+ we add this option.
285
+ scaling_factor: RotaryEmbedding extended with linear scaling.
286
+ """
287
+ super().__init__()
288
+ self.dim = dim
289
+ self.base = float(base)
290
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
291
+ # Generate and save the inverse frequency buffer (non trainable)
292
+ inv_freq = self._compute_inv_freq(device)
293
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
294
+ self.interleaved = interleaved
295
+ self.scale_base = scale_base
296
+ self.scaling_factor = scaling_factor
297
+ scale = ((torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim)
298
+ / (1.4 * dim) if scale_base is not None else None)
299
+ self.register_buffer("scale", scale)
300
+
301
+ self._seq_len_cached = 0
302
+ self._cos_cached = None
303
+ self._sin_cached = None
304
+ self._cos_k_cached = None
305
+ self._sin_k_cached = None
306
+
307
+ def _compute_inv_freq(self, device=None):
308
+ return 1 / (self.base ** (torch.arange(0, self.dim, 2, device=device,
309
+ dtype=torch.float32) / self.dim))
310
+
311
+
312
+ def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
313
+ # Reset the tables if the sequence length has changed,
314
+ # if we're on a new device (possibly due to tracing for instance),
315
+ # or if we're switching from inference mode to training
316
+ if (seqlen > self._seq_len_cached or self._cos_cached.device != device
317
+ or self._cos_cached.dtype != dtype
318
+ or (self.training and self._cos_cached.is_inference())):
319
+ self._seq_len_cached = seqlen
320
+ # We want fp32 here, not self.inv_freq.dtype, since the model could be loaded in bf16
321
+ # And the output of arange can be quite large, so bf16 would lose a lot of precision.
322
+ # However, for compatibility reason, we add an option to use the dtype of self.inv_freq.
323
+ if self.pos_idx_in_fp32:
324
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
325
+ t /= self.scaling_factor
326
+ # We want fp32 here as well since inv_freq will be multiplied with t, and the output
327
+ # will be large. Having it in bf16 will lose a lot of precision and cause the
328
+ # cos & sin output to change significantly.
329
+ # We want to recompute self.inv_freq if it was not loaded in fp32
330
+ if self.inv_freq.dtype != torch.float32:
331
+ inv_freq = self.inv_freq.to(torch.float32)
332
+ else:
333
+ inv_freq = self.inv_freq
334
+ else:
335
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
336
+ t /= self.scaling_factor
337
+ inv_freq = self.inv_freq
338
+ # Don't do einsum, it converts fp32 to fp16 under AMP
339
+ # freqs = torch.einsum("i,j->ij", t, self.inv_freq)
340
+ freqs = torch.outer(t, inv_freq)
341
+ if self.scale is None:
342
+ self._cos_cached = torch.cos(freqs).to(dtype)
343
+ self._sin_cached = torch.sin(freqs).to(dtype)
344
+ else:
345
+ power = ((torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device)
346
+ - seqlen // 2) / self.scale_base)
347
+ scale = self.scale.to(device=power.device) ** power.unsqueeze(-1)
348
+ # We want the multiplication by scale to happen in fp32
349
+ self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
350
+ self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
351
+ self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
352
+ self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
353
+
354
+ def forward(self, q: torch.Tensor, k: torch.Tensor, seqlen_offset: int = 0) -> Tuple[torch.Tensor, torch.Tensor]:
355
+ """
356
+ q: (batch, seqlen, nheads, headdim)
357
+ k: (batch, seqlen, nheads, headdim)
358
+ seqlen_offset: can be used in generation where the qkv being passed in is only the last
359
+ token in the batch.
360
+ """
361
+ self._update_cos_sin_cache(q.shape[1] + seqlen_offset, device=q.device, dtype=q.dtype)
362
+ if self.scale is None:
363
+ return apply_rotary_emb_func(
364
+ q, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
365
+ self.interleaved, True # inplace=True
366
+ ), apply_rotary_emb_func(
367
+ k, self._cos_cached[seqlen_offset:], self._sin_cached[seqlen_offset:],
368
+ self.interleaved, True # inplace=True
369
+ )
370
+ else:
371
+ assert False
372
+
373
+ class LlamaMLP(nn.Module):
374
+ def __init__(self, config):
375
+ super().__init__()
376
+ self.config = config
377
+ self.hidden_size = config.hidden_size
378
+ self.intermediate_size = config.intermediate_size
379
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
380
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
381
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
382
+ self.act_fn = ACT2FN[config.hidden_act]
383
+
384
+ def forward(self, x):
385
+ if self.config.pretraining_tp > 1:
386
+ slice = self.intermediate_size // self.config.pretraining_tp
387
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
388
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
389
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
390
+
391
+ gate_proj = torch.cat(
392
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
393
+ )
394
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
395
+
396
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
397
+ down_proj = [
398
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
399
+ ]
400
+ down_proj = sum(down_proj)
401
+ else:
402
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
403
+
404
+ return down_proj
405
+
406
+ @torch.jit.script
407
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
408
+ """
409
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
410
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
411
+ """
412
+ batch, slen, _, num_key_value_heads, head_dim = hidden_states.shape
413
+ if n_rep == 1:
414
+ return hidden_states
415
+ hidden_states = hidden_states[:, :, :, :, None, :].expand(batch, slen, 2, num_key_value_heads, n_rep, head_dim)
416
+ return hidden_states.reshape(batch, slen, 2, num_key_value_heads * n_rep, head_dim)
417
+
418
+
419
+ class LlamaAttention(nn.Module):
420
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
421
+
422
+ def __init__(self, config: LlamaConfig):
423
+ super().__init__()
424
+ self.config = config
425
+ self.hidden_size = config.hidden_size
426
+ self.num_heads = config.num_attention_heads
427
+ self.head_dim = self.hidden_size // self.num_heads
428
+ self.num_key_value_heads = config.num_key_value_heads
429
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
430
+ self.max_position_embeddings = config.max_position_embeddings
431
+
432
+ if (self.head_dim * self.num_heads) != self.hidden_size:
433
+ raise ValueError(
434
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
435
+ f" and `num_heads`: {self.num_heads})."
436
+ )
437
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
438
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
439
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
440
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
441
+
442
+ self.register_buffer(
443
+ "norm_factor",
444
+ torch.sqrt(torch.tensor(self.head_dim, dtype=torch.float32)).to(torch.get_default_dtype()),
445
+ persistent=False,
446
+ )
447
+
448
+ if self.config.rope_scaling is None:
449
+ scaling_type = "linear"
450
+ scaling_factor = 1.0
451
+ else:
452
+ scaling_type = self.config.rope_scaling["type"]
453
+ scaling_factor = self.config.rope_scaling["factor"]
454
+ if scaling_type == "yarn" or scaling_type == "dynamic-yarn":
455
+ original_max_position_embeddings = self.config.rope_scaling["original_max_position_embeddings"]
456
+
457
+ self.rotary_emb = FlashYaRNRotaryEmbedding(
458
+ self.head_dim, base=10000, interleaved=False, scaling_factor=scaling_factor,
459
+ max_position_embeddings=self.max_position_embeddings,
460
+ original_max_position_embeddings=original_max_position_embeddings,
461
+ dynamic=scaling_type.startswith("dynamic"), finetuned=self.config.rope_scaling.get("finetuned", False)
462
+ )
463
+ elif scaling_type == "linear":
464
+ self.rotary_emb = FlashRotaryEmbedding(
465
+ self.head_dim, base=10000, interleaved=False, scaling_factor=scaling_factor,
466
+ )
467
+ else:
468
+ raise RuntimeError(f"Unknown scaling type {scaling_type}")
469
+
470
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
471
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
472
+
473
+ def forward(
474
+ self,
475
+ hidden_states: torch.Tensor,
476
+ attention_mask: Optional[torch.Tensor] = None,
477
+ position_ids: Optional[torch.LongTensor] = None,
478
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
479
+ output_attentions: bool = False,
480
+ use_cache: bool = False,
481
+ is_padded_inputs: Optional[bool] = False,
482
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
483
+ bsz, q_len, h_size = hidden_states.size()
484
+
485
+ has_layer_past = past_key_value is not None
486
+
487
+ if has_layer_past:
488
+ past_kv = past_key_value[0]
489
+ past_len = past_key_value[1]
490
+ else:
491
+ past_len = 0
492
+
493
+ if self.config.pretraining_tp > 1:
494
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
495
+ query_slices = self.q_proj.weight.split(
496
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
497
+ )
498
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
499
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
500
+
501
+ q = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
502
+ q = torch.cat(q, dim=-1)
503
+
504
+ k = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
505
+ k = torch.cat(k, dim=-1)
506
+
507
+ v = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
508
+ v = torch.cat(v, dim=-1)
509
+
510
+ else:
511
+ q = self.q_proj(hidden_states)
512
+ k = self.k_proj(hidden_states)
513
+ v = self.v_proj(hidden_states)
514
+
515
+ q = q.view(bsz, q_len, self.num_heads, self.head_dim)
516
+ k = k.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
517
+ v = v.view(bsz, q_len, self.num_key_value_heads, self.head_dim)
518
+
519
+ q, k = self.rotary_emb(q, k, past_len)
520
+
521
+ kv = torch.stack([k, v], 2)
522
+ kv = repeat_kv(kv, self.num_key_value_groups)
523
+
524
+ # Cache QKV values
525
+ if has_layer_past:
526
+ new_len = past_len+q.size(1)
527
+ if new_len > past_kv.size(1):
528
+ past_kv = torch.cat([past_kv, torch.empty(bsz, 256, 2, kv.size(3), kv.size(4), dtype=kv.dtype, device=kv.device)], 1)
529
+ past_kv[:, past_len:new_len] = kv
530
+ kv = past_kv[:, :new_len]
531
+ else:
532
+ past_kv = kv
533
+
534
+ past_key_value = (past_kv, past_len+q.size(1)) if use_cache else None
535
+
536
+ if is_padded_inputs:
537
+
538
+ # varlen, ignore padding tokens, efficient for large batch with many paddings
539
+
540
+ assert attention_mask is not None
541
+
542
+ unpadded_kv, indices_k, cu_seqlens_k, max_seqlen_k = unpad_input(kv, attention_mask)
543
+ unpadded_q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, attention_mask[:, -q.size(1):])
544
+ attn_outputs = flash_attn_varlen_kvpacked_func(
545
+ unpadded_q, unpadded_kv, cu_seqlens_q, cu_seqlens_k,
546
+ max_seqlen_q, max_seqlen_k,
547
+ dropout_p=0.0, softmax_scale=1.0/self.norm_factor,
548
+ causal=(not has_layer_past), return_attn_probs=output_attentions
549
+ )
550
+
551
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
552
+ attn_output = pad_input(
553
+ attn_output, indices_q, bsz, q_len
554
+ ).reshape(bsz, q_len, h_size)
555
+ attn_weights = attn_outputs[2] if output_attentions else None
556
+
557
+ else:
558
+
559
+ # no padding tokens, more efficient
560
+
561
+ attn_outputs = flash_attn_kvpacked_func(
562
+ q, kv, dropout_p=0.0, softmax_scale=1.0/self.norm_factor, causal=(not has_layer_past), return_attn_probs=output_attentions)
563
+
564
+ attn_output = attn_outputs[0] if output_attentions else attn_outputs
565
+ attn_output = attn_output.reshape(bsz, q_len, h_size)
566
+ attn_weights = attn_outputs[2] if output_attentions else None
567
+
568
+ if self.config.pretraining_tp > 1:
569
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
570
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
571
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
572
+ else:
573
+ attn_output = self.o_proj(attn_output)
574
+
575
+ if not output_attentions:
576
+ attn_weights = None
577
+
578
+ return attn_output, attn_weights, past_key_value
579
+
580
+
581
+ class LlamaDecoderLayer(nn.Module):
582
+ def __init__(self, config: LlamaConfig):
583
+ super().__init__()
584
+ self.hidden_size = config.hidden_size
585
+ self.self_attn = LlamaAttention(config=config)
586
+ self.mlp = LlamaMLP(config)
587
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
588
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
589
+
590
+ def forward(
591
+ self,
592
+ hidden_states: torch.Tensor,
593
+ attention_mask: Optional[torch.Tensor] = None,
594
+ position_ids: Optional[torch.LongTensor] = None,
595
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
596
+ is_padded_inputs: Optional[bool] = False,
597
+ output_attentions: Optional[bool] = False,
598
+ use_cache: Optional[bool] = False,
599
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
600
+ """
601
+ Args:
602
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
603
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
604
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
605
+ output_attentions (`bool`, *optional*):
606
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
607
+ returned tensors for more detail.
608
+ use_cache (`bool`, *optional*):
609
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
610
+ (see `past_key_values`).
611
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
612
+ """
613
+
614
+ residual = hidden_states
615
+
616
+ hidden_states = self.input_layernorm(hidden_states)
617
+
618
+ # Self Attention
619
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
620
+ hidden_states=hidden_states,
621
+ attention_mask=attention_mask,
622
+ position_ids=position_ids,
623
+ past_key_value=past_key_value,
624
+ output_attentions=output_attentions,
625
+ use_cache=use_cache,
626
+ is_padded_inputs=is_padded_inputs,
627
+ )
628
+ hidden_states = residual + hidden_states
629
+
630
+ # Fully Connected
631
+ residual = hidden_states
632
+ hidden_states = self.post_attention_layernorm(hidden_states)
633
+ hidden_states = self.mlp(hidden_states)
634
+ hidden_states = residual + hidden_states
635
+
636
+ outputs = (hidden_states,)
637
+
638
+ if output_attentions:
639
+ outputs += (self_attn_weights,)
640
+
641
+ if use_cache:
642
+ outputs += (present_key_value,)
643
+
644
+ return outputs
645
+
646
+
647
+ LLAMA_START_DOCSTRING = r"""
648
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
649
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
650
+ etc.)
651
+
652
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
653
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
654
+ and behavior.
655
+
656
+ Parameters:
657
+ config ([`LlamaConfig`]):
658
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
659
+ load the weights associated with the model, only the configuration. Check out the
660
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
661
+ """
662
+
663
+
664
+ @add_start_docstrings(
665
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
666
+ LLAMA_START_DOCSTRING,
667
+ )
668
+ class LlamaPreTrainedModel(PreTrainedModel):
669
+ config_class = LlamaConfig
670
+ base_model_prefix = "model"
671
+ supports_gradient_checkpointing = True
672
+ _no_split_modules = ["LlamaDecoderLayer"]
673
+ _skip_keys_device_placement = "past_key_values"
674
+
675
+ def _init_weights(self, module):
676
+ std = self.config.initializer_range
677
+ if isinstance(module, nn.Linear):
678
+ module.weight.data.normal_(mean=0.0, std=std)
679
+ if module.bias is not None:
680
+ module.bias.data.zero_()
681
+ elif isinstance(module, nn.Embedding):
682
+ module.weight.data.normal_(mean=0.0, std=std)
683
+ if module.padding_idx is not None:
684
+ module.weight.data[module.padding_idx].zero_()
685
+
686
+ def _set_gradient_checkpointing(self, module, value=False):
687
+ if isinstance(module, LlamaModel):
688
+ module.gradient_checkpointing = value
689
+
690
+
691
+ LLAMA_INPUTS_DOCSTRING = r"""
692
+ Args:
693
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
694
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
695
+ it.
696
+
697
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
698
+ [`PreTrainedTokenizer.__call__`] for details.
699
+
700
+ [What are input IDs?](../glossary#input-ids)
701
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
702
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
703
+
704
+ - 1 for tokens that are **not masked**,
705
+ - 0 for tokens that are **masked**.
706
+
707
+ [What are attention masks?](../glossary#attention-mask)
708
+
709
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
710
+ [`PreTrainedTokenizer.__call__`] for details.
711
+
712
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
713
+ `past_key_values`).
714
+
715
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
716
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
717
+ information on the default strategy.
718
+
719
+ - 1 indicates the head is **not masked**,
720
+ - 0 indicates the head is **masked**.
721
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
722
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
723
+ config.n_positions - 1]`.
724
+
725
+ [What are position IDs?](../glossary#position-ids)
726
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
727
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
728
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
729
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
730
+
731
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
732
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
733
+
734
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
735
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
736
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
737
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
738
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
739
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
740
+ model's internal embedding lookup matrix.
741
+ use_cache (`bool`, *optional*):
742
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
743
+ `past_key_values`).
744
+ output_attentions (`bool`, *optional*):
745
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
746
+ tensors for more detail.
747
+ output_hidden_states (`bool`, *optional*):
748
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
749
+ more detail.
750
+ return_dict (`bool`, *optional*):
751
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
752
+ """
753
+
754
+
755
+ @add_start_docstrings(
756
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
757
+ LLAMA_START_DOCSTRING,
758
+ )
759
+ class LlamaModel(LlamaPreTrainedModel):
760
+ """
761
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
762
+
763
+ Args:
764
+ config: LlamaConfig
765
+ """
766
+
767
+ def __init__(self, config: LlamaConfig):
768
+ super().__init__(config)
769
+ self.padding_idx = config.pad_token_id
770
+ self.vocab_size = config.vocab_size
771
+
772
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
773
+ self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
774
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
775
+
776
+ self.gradient_checkpointing = False
777
+ # Initialize weights and apply final processing
778
+ self.post_init()
779
+
780
+ def get_input_embeddings(self):
781
+ return self.embed_tokens
782
+
783
+ def set_input_embeddings(self, value):
784
+ self.embed_tokens = value
785
+
786
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
787
+ def forward(
788
+ self,
789
+ input_ids: torch.LongTensor = None,
790
+ attention_mask: Optional[torch.Tensor] = None,
791
+ position_ids: Optional[torch.LongTensor] = None,
792
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
793
+ inputs_embeds: Optional[torch.FloatTensor] = None,
794
+ use_cache: Optional[bool] = None,
795
+ output_attentions: Optional[bool] = None,
796
+ output_hidden_states: Optional[bool] = None,
797
+ return_dict: Optional[bool] = None,
798
+ is_padded_inputs: Optional[bool] = False,
799
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
800
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
801
+ output_hidden_states = (
802
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
803
+ )
804
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
805
+
806
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
807
+
808
+ # retrieve input_ids and inputs_embeds
809
+ if input_ids is not None and inputs_embeds is not None:
810
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
811
+ elif input_ids is not None:
812
+ batch_size, seq_length = input_ids.shape
813
+ elif inputs_embeds is not None:
814
+ batch_size, seq_length, _ = inputs_embeds.shape
815
+ else:
816
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
817
+
818
+ seq_length_with_past = seq_length
819
+ past_key_values_length = 0
820
+
821
+ if past_key_values is not None:
822
+ past_key_values_length = past_key_values[0][0].shape[2]
823
+ seq_length_with_past = seq_length_with_past + past_key_values_length
824
+
825
+ position_ids = None
826
+
827
+ if inputs_embeds is None:
828
+ inputs_embeds = self.embed_tokens(input_ids)
829
+
830
+ hidden_states = inputs_embeds
831
+
832
+ if self.gradient_checkpointing and self.training:
833
+ if use_cache:
834
+ logger.warning_once(
835
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
836
+ )
837
+ use_cache = False
838
+
839
+ # decoder layers
840
+ all_hidden_states = () if output_hidden_states else None
841
+ all_self_attns = () if output_attentions else None
842
+ next_decoder_cache = () if use_cache else None
843
+
844
+ for idx, decoder_layer in enumerate(self.layers):
845
+ if output_hidden_states:
846
+ all_hidden_states += (hidden_states,)
847
+
848
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
849
+
850
+ if self.gradient_checkpointing and self.training:
851
+
852
+ def create_custom_forward(module):
853
+ def custom_forward(*inputs):
854
+ # None for past_key_value
855
+ return module(*inputs, output_attentions, None)
856
+
857
+ return custom_forward
858
+
859
+ layer_outputs = torch.utils.checkpoint.checkpoint(
860
+ create_custom_forward(decoder_layer),
861
+ hidden_states,
862
+ attention_mask,
863
+ position_ids,
864
+ None,
865
+ is_padded_inputs
866
+ )
867
+ else:
868
+ layer_outputs = decoder_layer(
869
+ hidden_states,
870
+ attention_mask=attention_mask,
871
+ position_ids=position_ids,
872
+ past_key_value=past_key_value,
873
+ output_attentions=output_attentions,
874
+ use_cache=use_cache,
875
+ is_padded_inputs=is_padded_inputs,
876
+ )
877
+
878
+ hidden_states = layer_outputs[0]
879
+
880
+ if use_cache:
881
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
882
+
883
+ if output_attentions:
884
+ all_self_attns += (layer_outputs[1],)
885
+
886
+ hidden_states = self.norm(hidden_states)
887
+
888
+ # add hidden states from the last decoder layer
889
+ if output_hidden_states:
890
+ all_hidden_states += (hidden_states,)
891
+
892
+ next_cache = next_decoder_cache if use_cache else None
893
+ if not return_dict:
894
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
895
+ return BaseModelOutputWithPast(
896
+ last_hidden_state=hidden_states,
897
+ past_key_values=next_cache,
898
+ hidden_states=all_hidden_states,
899
+ attentions=all_self_attns,
900
+ )
901
+
902
+
903
+ class LlamaForCausalLM(LlamaPreTrainedModel):
904
+ _tied_weights_keys = ["lm_head.weight"]
905
+
906
+ def __init__(self, config):
907
+ super().__init__(config)
908
+ self.model = LlamaModel(config)
909
+ self.vocab_size = config.vocab_size
910
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
911
+
912
+ # Initialize weights and apply final processing
913
+ self.post_init()
914
+
915
+ def get_input_embeddings(self):
916
+ return self.model.embed_tokens
917
+
918
+ def set_input_embeddings(self, value):
919
+ self.model.embed_tokens = value
920
+
921
+ def get_output_embeddings(self):
922
+ return self.lm_head
923
+
924
+ def set_output_embeddings(self, new_embeddings):
925
+ self.lm_head = new_embeddings
926
+
927
+ def set_decoder(self, decoder):
928
+ self.model = decoder
929
+
930
+ def get_decoder(self):
931
+ return self.model
932
+
933
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
934
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
935
+ def forward(
936
+ self,
937
+ input_ids: torch.LongTensor = None,
938
+ attention_mask: Optional[torch.Tensor] = None,
939
+ position_ids: Optional[torch.LongTensor] = None,
940
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
941
+ inputs_embeds: Optional[torch.FloatTensor] = None,
942
+ labels: Optional[torch.LongTensor] = None,
943
+ use_cache: Optional[bool] = None,
944
+ output_attentions: Optional[bool] = None,
945
+ output_hidden_states: Optional[bool] = None,
946
+ return_dict: Optional[bool] = None,
947
+ is_padded_inputs: Optional[bool] = None,
948
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
949
+ r"""
950
+ Args:
951
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
952
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
953
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
954
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
955
+
956
+ Returns:
957
+
958
+ Example:
959
+
960
+ ```python
961
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
962
+
963
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
964
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
965
+
966
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
967
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
968
+
969
+ >>> # Generate
970
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
971
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
972
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
973
+ ```"""
974
+
975
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
976
+ output_hidden_states = (
977
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
978
+ )
979
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
980
+
981
+ is_padded_inputs = ((attention_mask is not None) and (not attention_mask.all().item()))
982
+
983
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
984
+ outputs = self.model(
985
+ input_ids=input_ids,
986
+ attention_mask=attention_mask,
987
+ position_ids=position_ids,
988
+ past_key_values=past_key_values,
989
+ inputs_embeds=inputs_embeds,
990
+ use_cache=use_cache,
991
+ output_attentions=output_attentions,
992
+ output_hidden_states=output_hidden_states,
993
+ return_dict=return_dict,
994
+ is_padded_inputs=is_padded_inputs,
995
+ )
996
+
997
+ hidden_states = outputs[0]
998
+ if self.config.pretraining_tp > 1:
999
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1000
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1001
+ logits = torch.cat(logits, dim=-1)
1002
+ else:
1003
+ logits = self.lm_head(hidden_states)
1004
+ logits = logits.float()
1005
+
1006
+ loss = None
1007
+ if labels is not None:
1008
+ # Shift so that tokens < n predict n
1009
+ shift_logits = logits[..., :-1, :].contiguous()
1010
+ shift_labels = labels[..., 1:].contiguous()
1011
+ # Flatten the tokens
1012
+ loss_fct = CrossEntropyLoss()
1013
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1014
+ shift_labels = shift_labels.view(-1)
1015
+ # Enable model parallelism
1016
+ shift_labels = shift_labels.to(shift_logits.device)
1017
+ loss = loss_fct(shift_logits, shift_labels)
1018
+
1019
+ if not return_dict:
1020
+ output = (logits,) + outputs[1:]
1021
+ return (loss,) + output if loss is not None else output
1022
+
1023
+ return CausalLMOutputWithPast(
1024
+ loss=loss,
1025
+ logits=logits,
1026
+ past_key_values=outputs.past_key_values,
1027
+ hidden_states=outputs.hidden_states,
1028
+ attentions=outputs.attentions,
1029
+ )
1030
+
1031
+ def prepare_inputs_for_generation(
1032
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1033
+ ):
1034
+ if past_key_values:
1035
+ input_ids = input_ids[:, -1:]
1036
+
1037
+ position_ids = kwargs.get("position_ids", None)
1038
+
1039
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1040
+ if inputs_embeds is not None and past_key_values is None:
1041
+ model_inputs = {"inputs_embeds": inputs_embeds}
1042
+ else:
1043
+ model_inputs = {"input_ids": input_ids}
1044
+
1045
+ model_inputs.update(
1046
+ {
1047
+ "position_ids": position_ids,
1048
+ "past_key_values": past_key_values,
1049
+ "use_cache": kwargs.get("use_cache"),
1050
+ "attention_mask": attention_mask,
1051
+ "is_padded_inputs": ((attention_mask is not None) and (not attention_mask.all().item()))
1052
+ }
1053
+ )
1054
+ return model_inputs
1055
+
1056
+ @staticmethod
1057
+ def _reorder_cache(past_key_values, beam_idx):
1058
+ reordered_past = ()
1059
+ for layer_past in past_key_values:
1060
+ reordered_past += (
1061
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1062
+ )
1063
+ return reordered_past
1064
+
1065
+
1066
+ @add_start_docstrings(
1067
+ """
1068
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
1069
+
1070
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1071
+ (e.g. GPT-2) do.
1072
+
1073
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1074
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1075
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1076
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1077
+ each row of the batch).
1078
+ """,
1079
+ LLAMA_START_DOCSTRING,
1080
+ )
1081
+ class LlamaForSequenceClassification(LlamaPreTrainedModel):
1082
+ def __init__(self, config):
1083
+ super().__init__(config)
1084
+ self.num_labels = config.num_labels
1085
+ self.model = LlamaModel(config)
1086
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1087
+
1088
+ # Initialize weights and apply final processing
1089
+ self.post_init()
1090
+
1091
+ def get_input_embeddings(self):
1092
+ return self.model.embed_tokens
1093
+
1094
+ def set_input_embeddings(self, value):
1095
+ self.model.embed_tokens = value
1096
+
1097
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
1098
+ def forward(
1099
+ self,
1100
+ input_ids: torch.LongTensor = None,
1101
+ attention_mask: Optional[torch.Tensor] = None,
1102
+ position_ids: Optional[torch.LongTensor] = None,
1103
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1104
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1105
+ labels: Optional[torch.LongTensor] = None,
1106
+ use_cache: Optional[bool] = None,
1107
+ output_attentions: Optional[bool] = None,
1108
+ output_hidden_states: Optional[bool] = None,
1109
+ return_dict: Optional[bool] = None,
1110
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1111
+ r"""
1112
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1113
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1114
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1115
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1116
+ """
1117
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1118
+
1119
+ transformer_outputs = self.model(
1120
+ input_ids,
1121
+ attention_mask=attention_mask,
1122
+ position_ids=position_ids,
1123
+ past_key_values=past_key_values,
1124
+ inputs_embeds=inputs_embeds,
1125
+ use_cache=use_cache,
1126
+ output_attentions=output_attentions,
1127
+ output_hidden_states=output_hidden_states,
1128
+ return_dict=return_dict,
1129
+ )
1130
+ hidden_states = transformer_outputs[0]
1131
+ logits = self.score(hidden_states)
1132
+
1133
+ if input_ids is not None:
1134
+ batch_size = input_ids.shape[0]
1135
+ else:
1136
+ batch_size = inputs_embeds.shape[0]
1137
+
1138
+ if self.config.pad_token_id is None and batch_size != 1:
1139
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1140
+ if self.config.pad_token_id is None:
1141
+ sequence_lengths = -1
1142
+ else:
1143
+ if input_ids is not None:
1144
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
1145
+ else:
1146
+ sequence_lengths = -1
1147
+
1148
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1149
+
1150
+ loss = None
1151
+ if labels is not None:
1152
+ labels = labels.to(logits.device)
1153
+ if self.config.problem_type is None:
1154
+ if self.num_labels == 1:
1155
+ self.config.problem_type = "regression"
1156
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1157
+ self.config.problem_type = "single_label_classification"
1158
+ else:
1159
+ self.config.problem_type = "multi_label_classification"
1160
+
1161
+ if self.config.problem_type == "regression":
1162
+ loss_fct = MSELoss()
1163
+ if self.num_labels == 1:
1164
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1165
+ else:
1166
+ loss = loss_fct(pooled_logits, labels)
1167
+ elif self.config.problem_type == "single_label_classification":
1168
+ loss_fct = CrossEntropyLoss()
1169
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1170
+ elif self.config.problem_type == "multi_label_classification":
1171
+ loss_fct = BCEWithLogitsLoss()
1172
+ loss = loss_fct(pooled_logits, labels)
1173
+ if not return_dict:
1174
+ output = (pooled_logits,) + transformer_outputs[1:]
1175
+ return ((loss,) + output) if loss is not None else output
1176
+
1177
+ return SequenceClassifierOutputWithPast(
1178
+ loss=loss,
1179
+ logits=pooled_logits,
1180
+ past_key_values=transformer_outputs.past_key_values,
1181
+ hidden_states=transformer_outputs.hidden_states,
1182
+ attentions=transformer_outputs.attentions,
1183
+ )
plots.png ADDED
smash_config.json ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "api_key": null,
3
+ "verify_url": "http://johnrachwan.pythonanywhere.com",
4
+ "smash_config": {
5
+ "pruners": "None",
6
+ "factorizers": "None",
7
+ "quantizers": "['llm-int8']",
8
+ "compilers": "None",
9
+ "task": "text_text_generation",
10
+ "device": "cuda",
11
+ "cache_dir": "/ceph/hdd/staff/charpent/.cache/modelsz8prnsvt",
12
+ "batch_size": 1,
13
+ "model_name": "NousResearch/Yarn-Llama-2-7b-128k",
14
+ "pruning_ratio": 0.0,
15
+ "n_quantization_bits": 4,
16
+ "output_deviation": 0.005,
17
+ "max_batch_size": 1,
18
+ "qtype_weight": "torch.qint8",
19
+ "qtype_activation": "torch.quint8",
20
+ "qobserver": "<class 'torch.ao.quantization.observer.MinMaxObserver'>",
21
+ "qscheme": "torch.per_tensor_symmetric",
22
+ "qconfig": "x86",
23
+ "group_size": 128,
24
+ "damp_percent": 0.1,
25
+ "save_load_fn": "bitsandbytes"
26
+ }
27
+ }