TheBloke commited on
Commit
2f03821
·
1 Parent(s): 11027f5

AWQ model commit

Browse files
LICENSE.txt ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ LLAMA 2 COMMUNITY LICENSE AGREEMENT
2
+ Llama 2 Version Release Date: July 18, 2023
3
+
4
+ "Agreement" means the terms and conditions for use, reproduction, distribution and
5
+ modification of the Llama Materials set forth herein.
6
+
7
+ "Documentation" means the specifications, manuals and documentation
8
+ accompanying Llama 2 distributed by Meta at ai.meta.com/resources/models-and-
9
+ libraries/llama-downloads/.
10
+
11
+ "Licensee" or "you" means you, or your employer or any other person or entity (if
12
+ you are entering into this Agreement on such person or entity's behalf), of the age
13
+ required under applicable laws, rules or regulations to provide legal consent and that
14
+ has legal authority to bind your employer or such other person or entity if you are
15
+ entering in this Agreement on their behalf.
16
+
17
+ "Llama 2" means the foundational large language models and software and
18
+ algorithms, including machine-learning model code, trained model weights,
19
+ inference-enabling code, training-enabling code, fine-tuning enabling code and other
20
+ elements of the foregoing distributed by Meta at ai.meta.com/resources/models-and-
21
+ libraries/llama-downloads/.
22
+
23
+ "Llama Materials" means, collectively, Meta's proprietary Llama 2 and
24
+ Documentation (and any portion thereof) made available under this Agreement.
25
+
26
+ "Meta" or "we" means Meta Platforms Ireland Limited (if you are located in or, if you
27
+ are an entity, your principal place of business is in the EEA or Switzerland) and Meta
28
+ Platforms, Inc. (if you are located outside of the EEA or Switzerland).
29
+
30
+ By clicking "I Accept" below or by using or distributing any portion or element of the
31
+ Llama Materials, you agree to be bound by this Agreement.
32
+
33
+ 1. License Rights and Redistribution.
34
+
35
+ a. Grant of Rights. You are granted a non-exclusive, worldwide, non-
36
+ transferable and royalty-free limited license under Meta's intellectual property or
37
+ other rights owned by Meta embodied in the Llama Materials to use, reproduce,
38
+ distribute, copy, create derivative works of, and make modifications to the Llama
39
+ Materials.
40
+
41
+ b. Redistribution and Use.
42
+
43
+ i. If you distribute or make the Llama Materials, or any derivative works
44
+ thereof, available to a third party, you shall provide a copy of this Agreement to such
45
+ third party.
46
+ ii. If you receive Llama Materials, or any derivative works thereof, from
47
+ a Licensee as part of an integrated end user product, then Section 2 of this
48
+ Agreement will not apply to you.
49
+
50
+ iii. You must retain in all copies of the Llama Materials that you
51
+ distribute the following attribution notice within a "Notice" text file distributed as a
52
+ part of such copies: "Llama 2 is licensed under the LLAMA 2 Community License,
53
+ Copyright (c) Meta Platforms, Inc. All Rights Reserved."
54
+
55
+ iv. Your use of the Llama Materials must comply with applicable laws
56
+ and regulations (including trade compliance laws and regulations) and adhere to the
57
+ Acceptable Use Policy for the Llama Materials (available at
58
+ https://ai.meta.com/llama/use-policy), which is hereby incorporated by reference into
59
+ this Agreement.
60
+
61
+ v. You will not use the Llama Materials or any output or results of the
62
+ Llama Materials to improve any other large language model (excluding Llama 2 or
63
+ derivative works thereof).
64
+
65
+ 2. Additional Commercial Terms. If, on the Llama 2 version release date, the
66
+ monthly active users of the products or services made available by or for Licensee,
67
+ or Licensee's affiliates, is greater than 700 million monthly active users in the
68
+ preceding calendar month, you must request a license from Meta, which Meta may
69
+ grant to you in its sole discretion, and you are not authorized to exercise any of the
70
+ rights under this Agreement unless or until Meta otherwise expressly grants you
71
+ such rights.
72
+
73
+ 3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE
74
+ LLAMA MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE
75
+ PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND,
76
+ EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY
77
+ WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR
78
+ FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE
79
+ FOR DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING
80
+ THE LLAMA MATERIALS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR
81
+ USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND RESULTS.
82
+
83
+ 4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE
84
+ LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT,
85
+ NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS
86
+ AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL,
87
+ CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN
88
+ IF META OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF
89
+ ANY OF THE FOREGOING.
90
+
91
+ 5. Intellectual Property.
92
+
93
+ a. No trademark licenses are granted under this Agreement, and in
94
+ connection with the Llama Materials, neither Meta nor Licensee may use any name
95
+ or mark owned by or associated with the other or any of its affiliates, except as
96
+ required for reasonable and customary use in describing and redistributing the
97
+ Llama Materials.
98
+
99
+ b. Subject to Meta's ownership of Llama Materials and derivatives made by or
100
+ for Meta, with respect to any derivative works and modifications of the Llama
101
+ Materials that are made by you, as between you and Meta, you are and will be the
102
+ owner of such derivative works and modifications.
103
+
104
+ c. If you institute litigation or other proceedings against Meta or any entity
105
+ (including a cross-claim or counterclaim in a lawsuit) alleging that the Llama
106
+ Materials or Llama 2 outputs or results, or any portion of any of the foregoing,
107
+ constitutes infringement of intellectual property or other rights owned or licensable
108
+ by you, then any licenses granted to you under this Agreement shall terminate as of
109
+ the date such litigation or claim is filed or instituted. You will indemnify and hold
110
+ harmless Meta from and against any claim by any third party arising out of or related
111
+ to your use or distribution of the Llama Materials.
112
+
113
+ 6. Term and Termination. The term of this Agreement will commence upon your
114
+ acceptance of this Agreement or access to the Llama Materials and will continue in
115
+ full force and effect until terminated in accordance with the terms and conditions
116
+ herein. Meta may terminate this Agreement if you are in breach of any term or
117
+ condition of this Agreement. Upon termination of this Agreement, you shall delete
118
+ and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the
119
+ termination of this Agreement.
120
+
121
+ 7. Governing Law and Jurisdiction. This Agreement will be governed and
122
+ construed under the laws of the State of California without regard to choice of law
123
+ principles, and the UN Convention on Contracts for the International Sale of Goods
124
+ does not apply to this Agreement. The courts of California shall have exclusive
125
+ jurisdiction of any dispute arising out of this Agreement.
126
+
Notice ADDED
@@ -0,0 +1 @@
 
 
1
+ Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved.
USE_POLICY.md ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Llama 2 Acceptable Use Policy
2
+
3
+ Meta is committed to promoting safe and fair use of its tools and features, including Llama 2. If you access or use Llama 2, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of this policy can be found at [ai.meta.com/llama/use-policy](http://ai.meta.com/llama/use-policy).
4
+
5
+ ## Prohibited Uses
6
+ We want everyone to use Llama 2 safely and responsibly. You agree you will not use, or allow others to use, Llama 2 to:
7
+
8
+ 1. Violate the law or others’ rights, including to:
9
+ 1. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
10
+ 1. Violence or terrorism
11
+ 2. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
12
+ 3. Human trafficking, exploitation, and sexual violence
13
+ 4. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
14
+ 5. Sexual solicitation
15
+ 6. Any other criminal activity
16
+ 2. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
17
+ 3. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
18
+ 4. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
19
+ 5. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
20
+ 6. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama 2 Materials
21
+ 7. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
22
+
23
+
24
+
25
+ 2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Llama 2 related to the following:
26
+ 1. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
27
+ 2. Guns and illegal weapons (including weapon development)
28
+ 3. Illegal drugs and regulated/controlled substances
29
+ 4. Operation of critical infrastructure, transportation technologies, or heavy machinery
30
+ 5. Self-harm or harm to others, including suicide, cutting, and eating disorders
31
+ 6. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
32
+
33
+
34
+
35
+ 3. Intentionally deceive or mislead others, including use of Llama 2 related to the following:
36
+ 1. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
37
+ 2. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
38
+ 3. Generating, promoting, or further distributing spam
39
+ 4. Impersonating another individual without consent, authorization, or legal right
40
+ 5. Representing that the use of Llama 2 or outputs are human-generated
41
+ 6. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
42
+ 4. Fail to appropriately disclose to end users any known dangers of your AI system
43
+
44
+ Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation of this Policy through one of the following means:
45
+
46
+ * Reporting issues with the model: [github.com/facebookresearch/llama](http://github.com/facebookresearch/llama)
47
+ * Reporting risky content generated by the model: [developers.facebook.com/llama_output_feedback](http://developers.facebook.com/llama_output_feedback)
48
+ * Reporting bugs and security concerns: [facebook.com/whitehat/info](http://facebook.com/whitehat/info)
49
+ * Reporting violations of the Acceptable Use Policy or unlicensed uses of Llama: [[email protected]](mailto:[email protected])
50
+
config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "../../../basemodels/Yarn-Llama-2-13b-64k",
3
+ "architectures": [
4
+ "LlamaForCausalLM"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_llama.LlamaConfig",
8
+ "AutoModelForCausalLM": "modeling_llama.LlamaForCausalLM"
9
+ },
10
+ "bos_token_id": 1,
11
+ "eos_token_id": 2,
12
+ "hidden_act": "silu",
13
+ "hidden_size": 5120,
14
+ "initializer_range": 0.02,
15
+ "intermediate_size": 13824,
16
+ "max_position_embeddings": 65536,
17
+ "model_type": "llama",
18
+ "num_attention_heads": 40,
19
+ "num_hidden_layers": 40,
20
+ "num_key_value_heads": 40,
21
+ "pad_token_id": 0,
22
+ "pretraining_tp": 1,
23
+ "rms_norm_eps": 1e-05,
24
+ "rope_scaling": {
25
+ "factor": 16.0,
26
+ "original_max_position_embeddings": 4096,
27
+ "type": "yarn"
28
+ },
29
+ "tie_word_embeddings": false,
30
+ "torch_dtype": "float16",
31
+ "transformers_version": "4.33.0.dev0",
32
+ "use_cache": true,
33
+ "use_flash_attention": false,
34
+ "vocab_size": 32000
35
+ }
configuration_llama.py ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ LLaMA model configuration"""
21
+
22
+ from transformers.configuration_utils import PretrainedConfig
23
+ from transformers.utils import logging
24
+
25
+
26
+ logger = logging.get_logger(__name__)
27
+
28
+ LLAMA_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
29
+
30
+
31
+ class LlamaConfig(PretrainedConfig):
32
+ r"""
33
+ This is the configuration class to store the configuration of a [`LlamaModel`]. It is used to instantiate an LLaMA
34
+ model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
35
+ defaults will yield a similar configuration to that of the LLaMA-7B.
36
+
37
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
38
+ documentation from [`PretrainedConfig`] for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 32000):
43
+ Vocabulary size of the LLaMA model. Defines the number of different tokens that can be represented by the
44
+ `inputs_ids` passed when calling [`LlamaModel`]
45
+ hidden_size (`int`, *optional*, defaults to 4096):
46
+ Dimension of the hidden representations.
47
+ intermediate_size (`int`, *optional*, defaults to 11008):
48
+ Dimension of the MLP representations.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer encoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ num_key_value_heads (`int`, *optional*):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
60
+ `num_attention_heads`.
61
+ pretraining_tp (`int`, *optional*, defaults to `1`):
62
+ Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
63
+ document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
64
+ necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
65
+ issue](https://github.com/pytorch/pytorch/issues/76232).
66
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
67
+ The non-linear activation function (function or string) in the decoder.
68
+ max_position_embeddings (`int`, *optional*, defaults to 2048):
69
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
70
+ just in case (e.g., 512 or 1024 or 2048).
71
+ initializer_range (`float`, *optional*, defaults to 0.02):
72
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
73
+ rms_norm_eps (`float`, *optional*, defaults to 1e-12):
74
+ The epsilon used by the rms normalization layers.
75
+ use_cache (`bool`, *optional*, defaults to `True`):
76
+ Whether or not the model should return the last key/values attentions (not used by all models). Only
77
+ relevant if `config.is_decoder=True`.
78
+ tie_word_embeddings(`bool`, *optional*, defaults to `False`):
79
+ Whether to tie weight embeddings
80
+ rope_scaling (`Dict`, *optional*):
81
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports three scaling
82
+ strategies: linear and dynamic. Their scaling factor must be an float greater than 1. The expected format
83
+ is `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
84
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
85
+ these scaling strategies behave:
86
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
87
+ experimental feature, subject to breaking API changes in future versions.
88
+
89
+ Example:
90
+
91
+ ```python
92
+ >>> from transformers import LlamaModel, LlamaConfig
93
+
94
+ >>> # Initializing a LLaMA llama-7b style configuration
95
+ >>> configuration = LlamaConfig()
96
+
97
+ >>> # Initializing a model from the llama-7b style configuration
98
+ >>> model = LlamaModel(configuration)
99
+
100
+ >>> # Accessing the model configuration
101
+ >>> configuration = model.config
102
+ ```"""
103
+ model_type = "llama"
104
+ keys_to_ignore_at_inference = ["past_key_values"]
105
+
106
+ def __init__(
107
+ self,
108
+ vocab_size=32000,
109
+ hidden_size=4096,
110
+ intermediate_size=11008,
111
+ num_hidden_layers=32,
112
+ num_attention_heads=32,
113
+ num_key_value_heads=None,
114
+ hidden_act="silu",
115
+ max_position_embeddings=2048,
116
+ initializer_range=0.02,
117
+ rms_norm_eps=1e-6,
118
+ use_cache=True,
119
+ pad_token_id=0,
120
+ bos_token_id=1,
121
+ eos_token_id=2,
122
+ pretraining_tp=1,
123
+ tie_word_embeddings=False,
124
+ rope_scaling=None,
125
+ use_flash_attention=False,
126
+ **kwargs,
127
+ ):
128
+ self.vocab_size = vocab_size
129
+ self.max_position_embeddings = max_position_embeddings
130
+ self.hidden_size = hidden_size
131
+ self.intermediate_size = intermediate_size
132
+ self.num_hidden_layers = num_hidden_layers
133
+ self.num_attention_heads = num_attention_heads
134
+
135
+ # for backward compatibility
136
+ if num_key_value_heads is None:
137
+ num_key_value_heads = num_attention_heads
138
+
139
+ self.num_key_value_heads = num_key_value_heads
140
+ self.hidden_act = hidden_act
141
+ self.initializer_range = initializer_range
142
+ self.rms_norm_eps = rms_norm_eps
143
+ self.pretraining_tp = pretraining_tp
144
+ self.use_cache = use_cache
145
+ self.rope_scaling = rope_scaling
146
+ self._rope_scaling_validation()
147
+ self.use_flash_attention = use_flash_attention
148
+ if self.use_flash_attention:
149
+ try:
150
+ from flash_attn.flash_attn_interface import flash_attn_varlen_func
151
+ from einops import rearrange
152
+ except:
153
+ raise ValueError("`use_flash_attention` requires Flash Attention 2+ and einops.\nTry `pip install einops` and installing Flash Attention from from https://github.com/Dao-AILab/flash-attention")
154
+
155
+ super().__init__(
156
+ pad_token_id=pad_token_id,
157
+ bos_token_id=bos_token_id,
158
+ eos_token_id=eos_token_id,
159
+ tie_word_embeddings=tie_word_embeddings,
160
+ **kwargs,
161
+ )
162
+
163
+ def _rope_scaling_validation(self):
164
+ """
165
+ Validate the `rope_scaling` configuration.
166
+ """
167
+ if self.rope_scaling is None:
168
+ return
169
+
170
+ if not isinstance(self.rope_scaling, dict):
171
+ raise ValueError(
172
+ "`rope_scaling` must be a dictionary, "
173
+ f"got {self.rope_scaling}"
174
+ )
175
+ rope_scaling_type = self.rope_scaling.get("type", None)
176
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
177
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic", "ntk-by-parts", "yarn", "dynamic-yarn"]:
178
+ raise ValueError(
179
+ f"`rope_scaling`'s name field must be one of ['linear', 'dynamic', 'ntk-by-parts', 'yarn', 'dynamic-yarn'], got {rope_scaling_type}"
180
+ )
181
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
182
+ raise ValueError(f"`rope_scaling`'s factor field must be an float > 1, got {rope_scaling_factor}")
183
+ if rope_scaling_type == "ntk-by-parts" or rope_scaling_type == "yarn" or rope_scaling_type == "dynamic-yarn":
184
+ original_max_position_embeddings = self.rope_scaling.get("original_max_position_embeddings", None)
185
+ if original_max_position_embeddings is None or not isinstance(original_max_position_embeddings, int):
186
+ raise ValueError(f"`rope_scaling.original_max_position_embeddings` must be set to an int when using ntk-by-parts, yarn, and dynamic-yarn")
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 0,
6
+ "temperature": 0.9,
7
+ "top_p": 0.6,
8
+ "transformers_version": "4.32.0.dev0"
9
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0828e5bef3957143b18c88c59dea19685578f073db59bca8bd29f0ad0d4ace76
3
+ size 7247987392
modeling_llama.py ADDED
@@ -0,0 +1,1377 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch LLaMA model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.nn.functional as F
26
+ import torch.utils.checkpoint
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+
30
+ from transformers.activations import ACT2FN
31
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, \
32
+ SequenceClassifierOutputWithPast
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.utils import add_start_docstrings, add_start_docstrings_to_model_forward, logging, \
35
+ replace_return_docstrings
36
+ from .configuration_llama import LlamaConfig
37
+
38
+ try:
39
+ from flash_attn.flash_attn_interface import flash_attn_varlen_func
40
+ from flash_attn.modules.mha import FlashSelfAttention
41
+ from einops import rearrange
42
+
43
+ have_flash_attention = True
44
+ except:
45
+ have_flash_attention = False
46
+
47
+ logger = logging.get_logger(__name__)
48
+
49
+ _CONFIG_FOR_DOC = "LlamaConfig"
50
+
51
+
52
+ # Copied from transformers.models.bart.modeling_bart._make_causal_mask
53
+ def _make_causal_mask(
54
+ input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
55
+ ):
56
+ """
57
+ Make causal mask used for bi-directional self-attention.
58
+ """
59
+ bsz, tgt_len = input_ids_shape
60
+ mask = torch.full((tgt_len, tgt_len), torch.finfo(dtype).min, device=device)
61
+ mask_cond = torch.arange(mask.size(-1), device=device)
62
+ mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
63
+ mask = mask.to(dtype)
64
+
65
+ if past_key_values_length > 0:
66
+ mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
67
+ return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
68
+
69
+
70
+ # Copied from transformers.models.bart.modeling_bart._expand_mask
71
+ def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
72
+ """
73
+ Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
74
+ """
75
+ bsz, src_len = mask.size()
76
+ tgt_len = tgt_len if tgt_len is not None else src_len
77
+
78
+ expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
79
+
80
+ inverted_mask = 1.0 - expanded_mask
81
+
82
+ return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
83
+
84
+
85
+ def _ntk_find_correction_factor(num_rotations, dim, base=10000, max_position_embeddings=2048):
86
+ return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (
87
+ 2 * math.log(base)) # Inverse dim formula to find number of rotations
88
+
89
+
90
+ def _ntk_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048):
91
+ low = math.floor(_ntk_find_correction_factor(low_rot, dim, base, max_position_embeddings))
92
+ high = math.ceil(_ntk_find_correction_factor(high_rot, dim, base, max_position_embeddings))
93
+ return max(low, 0), min(high, dim - 1) # Clamp values just in case
94
+
95
+
96
+ def _ntk_linear_ramp_mask(min, max, dim):
97
+ if min == max:
98
+ max += 0.001 # Prevent singularity
99
+
100
+ linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
101
+ ramp_func = torch.clamp(linear_func, 0, 1)
102
+ return ramp_func
103
+
104
+
105
+ def _ntk_find_newbase_ntk(dim, base=10000, scale=1):
106
+ return base * scale ** (dim / (dim - 2))
107
+
108
+
109
+ def _ntk_build_inv_freq(dim, base, scaling_factor, ntk_factor, extrapolation_factor, original_max_position_embeddings,
110
+ device):
111
+ # Interpolation constants found experimentally for LLaMA (might not be totally optimal though)
112
+ # Do not change unless there is a good reason for doing so!
113
+ beta_0 = 1.25
114
+ beta_1 = 0.75
115
+ gamma_0 = 16
116
+ gamma_1 = 2
117
+
118
+ # Three RoPE extrapolation/interpolation methods
119
+ inv_freq_base = 1.0 / (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
120
+ inv_freq_linear = 1.0 / (scaling_factor * (base ** (torch.arange(0, dim, 2).float().to(device) / dim)))
121
+ inv_freq_ntk = 1.0 / (
122
+ _ntk_find_newbase_ntk(dim, base, scaling_factor) ** (torch.arange(0, dim, 2).float().to(device) / dim))
123
+
124
+ current_dtype = inv_freq_ntk.dtype
125
+ current_device = inv_freq_ntk.device
126
+
127
+ # Combine NTK and Linear
128
+ low, high = _ntk_find_correction_range(beta_0, beta_1, dim, base, original_max_position_embeddings)
129
+ inv_freq_mask = (1 - _ntk_linear_ramp_mask(low, high, dim // 2).type(current_dtype).to(current_device)) * ntk_factor
130
+ inv_freq = inv_freq_linear * (1 - inv_freq_mask) + inv_freq_ntk * inv_freq_mask
131
+
132
+ # Combine Extrapolation and NTK and Linear
133
+ low, high = _ntk_find_correction_range(gamma_0, gamma_1, dim, base, original_max_position_embeddings)
134
+ inv_freq_mask = (1 - _ntk_linear_ramp_mask(low, high, dim // 2).type(current_dtype).to(
135
+ current_device)) * extrapolation_factor
136
+ return inv_freq * (1 - inv_freq_mask) + inv_freq_base * inv_freq_mask
137
+
138
+
139
+ # Inverse dim formula to find dim based on number of rotations
140
+ def _yarn_find_correction_dim(num_rotations, dim, base=10000, max_position_embeddings=2048):
141
+ return (dim * math.log(max_position_embeddings / (num_rotations * 2 * math.pi))) / (2 * math.log(base))
142
+
143
+
144
+ # Find dim range bounds based on rotations
145
+ def _yarn_find_correction_range(low_rot, high_rot, dim, base=10000, max_position_embeddings=2048):
146
+ low = math.floor(_yarn_find_correction_dim(
147
+ low_rot, dim, base, max_position_embeddings))
148
+ high = math.ceil(_yarn_find_correction_dim(
149
+ high_rot, dim, base, max_position_embeddings))
150
+ return max(low, 0), min(high, dim - 1) # Clamp values just in case
151
+
152
+
153
+ def _yarn_linear_ramp_mask(min, max, dim):
154
+ if min == max:
155
+ max += 0.001 # Prevent singularity
156
+
157
+ linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
158
+ ramp_func = torch.clamp(linear_func, 0, 1)
159
+ return ramp_func
160
+
161
+
162
+ def _yarn_get_mscale(scale=1):
163
+ if scale <= 1:
164
+ return 1.0
165
+ return 0.1 * math.log(scale) + 1.0
166
+
167
+
168
+ def compute_flash_attention_packed(flash_attn, q, k, v, attention_mask=None):
169
+ if attention_mask is not None:
170
+ attention_mask = attention_mask[:, 0, -1]
171
+ q, k, v = (q.transpose(1, 2), k.transpose(1, 2), v.transpose(1, 2))
172
+
173
+ # q, k, v: [bs, seq_len, num_attention_heads, attn_head_size]
174
+ # attention_mask (float): [bs, seq_len]
175
+ batch_size, max_len = q.size(0), q.size(1)
176
+
177
+ qkv = torch.stack([q, k, v], dim=2).to(
178
+ torch.float16
179
+ ) # need to truncate in case input is fp32
180
+ cu_seqlens, max_seqlen = None, None
181
+
182
+ if attention_mask is None:
183
+ return flash_attn(qkv, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen)
184
+ else:
185
+ # Limitation: non-contiguous attention mask will not be handled correctly
186
+ # model will be able to pay attention between the first and last non-masked token, i.e. left- and right-side padding is supported.
187
+ csums = (attention_mask >= 0).cumsum(dim=1)
188
+ ends = csums.argmax(dim=1) + 1
189
+ starts = ends - csums.max(dim=1).values
190
+ seqlens = ends - starts
191
+
192
+ qkv = torch.cat([qkv[i, starts[i]: ends[i]] for i in range(batch_size)], dim=0)
193
+ zero = torch.zeros_like(
194
+ seqlens[:1]
195
+ ) # torch.tensor([0]) with correct dtype and device
196
+ cu_seqlens = torch.cat([zero, seqlens.cumsum(dim=0)], dim=0).to(torch.int32)
197
+ max_seqlen = seqlens.max().item()
198
+
199
+ out = flash_attn(qkv, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen)
200
+ # out: [num_unmasked_tokens, num_attention_heads, attn_head_size]
201
+
202
+ seqs = [out[start:end] for start, end in zip(cu_seqlens[:-1], cu_seqlens[1:])]
203
+ # stack and pad sequences together
204
+ padded_seqs = [
205
+ F.pad(
206
+ seqs[i],
207
+ (0, 0) * (seqs[i].dim() - 1) + (starts[i], max_len - ends[i]),
208
+ value=0.0,
209
+ )
210
+ for i in range(batch_size)
211
+ ]
212
+
213
+ return torch.stack(padded_seqs).transpose(1, 2)
214
+
215
+
216
+ def compute_flash_attention_inference(query_states, key_states, value_states, attention_mask=None, dropout=0.0):
217
+ scale = query_states.shape[-1] ** (-0.5)
218
+
219
+ batch, _, seq_len_q, _ = query_states.shape
220
+ _, _, seq_len_k, _ = value_states.shape
221
+
222
+ query_states = rearrange(query_states, "b h s d -> b s h d").to(torch.float16)
223
+ key_states = rearrange(key_states, "b h s d -> b s h d").to(torch.float16)
224
+ value_states = rearrange(value_states, "b h s d -> b s h d").to(torch.float16)
225
+
226
+ if attention_mask is not None:
227
+ attention_mask = attention_mask[:, 0, -1]
228
+ csums = (attention_mask >= 0).cumsum(dim=1)
229
+ ends = csums.argmax(dim=1) + 1
230
+ starts = ends - csums.max(dim=1).values
231
+
232
+ query_states = torch.cat([query_states[i, starts[i]: ends[i]] for i in range(batch)], dim=0)
233
+ key_states = torch.cat([key_states[i, starts[i]: ends[i]] for i in range(batch)], dim=0)
234
+ value_states = torch.cat([value_states[i, starts[i]: ends[i]] for i in range(batch)], dim=0)
235
+
236
+ cu_seqlens_q = torch.arange(0, (batch + 1) * seq_len_q, step=seq_len_q, dtype=torch.int32,
237
+ device=query_states.device)
238
+
239
+ cu_seqlens_k = torch.arange(0, (batch + 1) * seq_len_k, step=seq_len_k, dtype=torch.int32,
240
+ device=key_states.device)
241
+
242
+ # No point returning attn_probs since it is not guaranteed to be correct
243
+ if seq_len_q == seq_len_k:
244
+ attn_output = flash_attn_varlen_func(query_states, key_states, value_states,
245
+ cu_seqlens_q, cu_seqlens_k, seq_len_q, seq_len_k,
246
+ dropout, scale, causal=True, return_attn_probs=False)
247
+ else:
248
+ attn_output = flash_attn_varlen_func(query_states, key_states, value_states,
249
+ cu_seqlens_q, cu_seqlens_k, seq_len_q, seq_len_k,
250
+ dropout, scale, causal=False, return_attn_probs=False)
251
+
252
+ return rearrange(attn_output, "(b s) h d-> b h s d", b=batch)
253
+
254
+
255
+ class LlamaRMSNorm(nn.Module):
256
+ def __init__(self, hidden_size, eps=1e-6):
257
+ """
258
+ LlamaRMSNorm is equivalent to T5LayerNorm
259
+ """
260
+ super().__init__()
261
+ self.weight = nn.Parameter(torch.ones(hidden_size))
262
+ self.variance_epsilon = eps
263
+
264
+ def forward(self, hidden_states):
265
+ input_dtype = hidden_states.dtype
266
+ hidden_states = hidden_states.to(torch.float32)
267
+ variance = hidden_states.pow(2).mean(-1, keepdim=True)
268
+ hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
269
+ return (self.weight * hidden_states).to(input_dtype)
270
+
271
+
272
+ class LlamaRotaryEmbedding(torch.nn.Module):
273
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
274
+ super().__init__()
275
+
276
+ self.dim = dim
277
+ self.max_position_embeddings = max_position_embeddings
278
+ self.base = base
279
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
280
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
281
+
282
+ # Build here to make `torch.jit.trace` work.
283
+ self._set_cos_sin_cache(
284
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
285
+ )
286
+
287
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
288
+ self.max_seq_len_cached = seq_len
289
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
290
+
291
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
292
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
293
+ emb = torch.cat((freqs, freqs), dim=-1)
294
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
295
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
296
+
297
+ def forward(self, x, seq_len=None):
298
+ # x: [bs, num_attention_heads, seq_len, head_size]
299
+ if seq_len > self.max_seq_len_cached:
300
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
301
+
302
+ return (
303
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
304
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
305
+ )
306
+
307
+
308
+ class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
309
+ """LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
310
+
311
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
312
+ self.scaling_factor = scaling_factor
313
+ super().__init__(dim, max_position_embeddings, base, device)
314
+
315
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
316
+ self.max_seq_len_cached = seq_len
317
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
318
+ t = t / self.scaling_factor
319
+
320
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
321
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
322
+ emb = torch.cat((freqs, freqs), dim=-1)
323
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
324
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
325
+
326
+
327
+ class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
328
+ """LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
329
+
330
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
331
+ self.scaling_factor = scaling_factor
332
+ super().__init__(dim, max_position_embeddings, base, device)
333
+
334
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
335
+ self.max_seq_len_cached = seq_len
336
+
337
+ if seq_len > self.max_position_embeddings:
338
+ base = self.base * (
339
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
340
+ ) ** (self.dim / (self.dim - 2))
341
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
342
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
343
+
344
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
345
+
346
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
347
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
348
+ emb = torch.cat((freqs, freqs), dim=-1)
349
+ self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
350
+ self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
351
+
352
+
353
+ class LlamaNTKByPartsRotaryEmbedding(LlamaRotaryEmbedding):
354
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0, ntk_factor=1.0,
355
+ extrapolation_factor=1.0, original_max_position_embeddings=2048):
356
+ super().__init__(dim, max_position_embeddings, base, device)
357
+
358
+ inv_freq = _ntk_build_inv_freq(dim, base, scaling_factor, ntk_factor, extrapolation_factor,
359
+ original_max_position_embeddings, device)
360
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
361
+
362
+ # Build here to make `torch.jit.trace` work.
363
+ self._set_cos_sin_cache(
364
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
365
+ )
366
+
367
+
368
+ class LlamaYaRNScaledRotaryEmbedding(torch.nn.Module):
369
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, scale=1, original_max_position_embeddings=2048,
370
+ extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1, finetuned=False, device=None):
371
+ super().__init__()
372
+
373
+ self.dim = dim
374
+ self.max_position_embeddings = max_position_embeddings
375
+ self.base = base
376
+ self.scale = scale
377
+ self.original_max_position_embeddings = original_max_position_embeddings
378
+ self.extrapolation_factor = extrapolation_factor
379
+ self.attn_factor = attn_factor
380
+ self.beta_fast = beta_fast
381
+ self.beta_slow = beta_slow
382
+
383
+ self.yarn(device)
384
+
385
+ # Build here to make `torch.jit.trace` work.
386
+ self.max_seq_len_cached = max_position_embeddings
387
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
388
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
389
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
390
+ emb = torch.cat((freqs, freqs), dim=-1)
391
+ dtype = torch.get_default_dtype()
392
+
393
+ self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
394
+ self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
395
+
396
+ def forward(self, x, seq_len=None):
397
+ # x: [bs, num_attention_heads, seq_len, head_size]
398
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
399
+ if seq_len > self.max_seq_len_cached:
400
+ self.max_seq_len_cached = seq_len
401
+
402
+ t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
403
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
404
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
405
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
406
+
407
+ self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(x.dtype),
408
+ persistent=False)
409
+ self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(x.dtype),
410
+ persistent=False)
411
+ return (
412
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
413
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
414
+ )
415
+
416
+ def yarn(self, device):
417
+ pos_freqs = self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
418
+ inv_freq_extrapolation = 1.0 / pos_freqs
419
+ inv_freq_interpolation = 1.0 / (self.scale * pos_freqs)
420
+
421
+ low, high = _yarn_find_correction_range(self.beta_fast, self.beta_slow, self.dim, self.base,
422
+ self.original_max_position_embeddings)
423
+ inv_freq_mask = (1 - _yarn_linear_ramp_mask(low, high, self.dim // 2).float().to(
424
+ device)) * self.extrapolation_factor # Get n-d rotational scaling corrected for extrapolation
425
+ inv_freq = inv_freq_interpolation * (1 - inv_freq_mask) + inv_freq_extrapolation * inv_freq_mask
426
+
427
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
428
+ self.mscale = float(
429
+ _yarn_get_mscale(self.scale) * self.attn_factor) # Get n-d magnitude scaling corrected for interpolation
430
+
431
+
432
+ class LlamaDynamicYaRNScaledRotaryEmbedding(torch.nn.Module):
433
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, original_max_position_embeddings=2048,
434
+ extrapolation_factor=1, attn_factor=1, beta_fast=32, beta_slow=1, finetuned=False, device=None):
435
+ super().__init__()
436
+
437
+ self.dim = dim
438
+ self.max_position_embeddings = max_position_embeddings
439
+ self.base = base
440
+ self.original_max_position_embeddings = original_max_position_embeddings
441
+ self.extrapolation_factor = extrapolation_factor
442
+ self.attn_factor = attn_factor
443
+ self.beta_fast = beta_fast
444
+ self.beta_slow = beta_slow
445
+
446
+ if finetuned:
447
+ self.yarn(self.max_position_embeddings / self.original_max_position_embeddings, device)
448
+ else:
449
+ inv_freq = 1.0 / \
450
+ (base ** (torch.arange(0, dim, 2).float().to(device) / dim))
451
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
452
+ self.mscale = 1
453
+
454
+ # Build here to make `torch.jit.trace` work.
455
+ self.max_seq_len_cached = max_position_embeddings
456
+ t = torch.arange(self.max_seq_len_cached, device=self.inv_freq.device, dtype=torch.float32)
457
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
458
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
459
+ emb = torch.cat((freqs, freqs), dim=-1)
460
+ dtype = torch.get_default_dtype()
461
+
462
+ self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
463
+ self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(dtype), persistent=False)
464
+
465
+ def forward(self, x, seq_len=None):
466
+ # x: [bs, num_attention_heads, seq_len, head_size]
467
+ # This `if` block is unlikely to be run after we build sin/cos in `__init__`. Keep the logic here just in case.
468
+ if seq_len > self.max_seq_len_cached:
469
+ self.max_seq_len_cached = seq_len
470
+
471
+ self.yarn(seq_len / self.max_position_embeddings, x.device)
472
+
473
+ t = torch.arange(self.max_seq_len_cached, device=x.device, dtype=self.inv_freq.dtype)
474
+ freqs = torch.einsum("i,j->ij", t, self.inv_freq)
475
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
476
+ emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
477
+
478
+ self.register_buffer("cos_cached", (emb.cos() * self.mscale)[None, None, :, :].to(x.dtype),
479
+ persistent=False)
480
+ self.register_buffer("sin_cached", (emb.sin() * self.mscale)[None, None, :, :].to(x.dtype),
481
+ persistent=False)
482
+ return (
483
+ self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
484
+ self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
485
+ )
486
+
487
+ def yarn(self, scale, device):
488
+ pos_freqs = self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim)
489
+ inv_freq_extrapolation = 1.0 / pos_freqs
490
+ inv_freq_interpolation = 1.0 / (scale * pos_freqs)
491
+
492
+ low, high = _yarn_find_correction_range(self.beta_fast, self.beta_slow, self.dim, self.base,
493
+ self.original_max_position_embeddings)
494
+ inv_freq_mask = (1 - _yarn_linear_ramp_mask(low, high, self.dim // 2).float().to(
495
+ device)) * self.extrapolation_factor # Get n-d rotational scaling corrected for extrapolation
496
+ inv_freq = inv_freq_interpolation * (1 - inv_freq_mask) + inv_freq_extrapolation * inv_freq_mask
497
+
498
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
499
+ self.mscale = float(
500
+ _yarn_get_mscale(scale) * self.attn_factor) # Get n-d magnitude scaling corrected for interpolation
501
+
502
+
503
+ def rotate_half(x):
504
+ """Rotates half the hidden dims of the input."""
505
+ x1 = x[..., : x.shape[-1] // 2]
506
+ x2 = x[..., x.shape[-1] // 2:]
507
+ return torch.cat((-x2, x1), dim=-1)
508
+
509
+
510
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
511
+ # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
512
+ cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
513
+ sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
514
+ cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
515
+ sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
516
+ q_embed = (q * cos) + (rotate_half(q) * sin)
517
+ k_embed = (k * cos) + (rotate_half(k) * sin)
518
+ return q_embed, k_embed
519
+
520
+
521
+ class LlamaMLP(nn.Module):
522
+ def __init__(self, config):
523
+ super().__init__()
524
+ self.config = config
525
+ self.hidden_size = config.hidden_size
526
+ self.intermediate_size = config.intermediate_size
527
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
528
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
529
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
530
+ self.act_fn = ACT2FN[config.hidden_act]
531
+
532
+ def forward(self, x):
533
+ if self.config.pretraining_tp > 1:
534
+ slice = self.intermediate_size // self.config.pretraining_tp
535
+ gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
536
+ up_proj_slices = self.up_proj.weight.split(slice, dim=0)
537
+ down_proj_slices = self.down_proj.weight.split(slice, dim=1)
538
+
539
+ gate_proj = torch.cat(
540
+ [F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1
541
+ )
542
+ up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
543
+
544
+ intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
545
+ down_proj = [
546
+ F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)
547
+ ]
548
+ down_proj = sum(down_proj)
549
+ else:
550
+ down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
551
+
552
+ return down_proj
553
+
554
+
555
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
556
+ """
557
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
558
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
559
+ """
560
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
561
+ if n_rep == 1:
562
+ return hidden_states
563
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
564
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
565
+
566
+
567
+ class LlamaAttention(nn.Module):
568
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
569
+
570
+ def __init__(self, config: LlamaConfig):
571
+ super().__init__()
572
+ self.config = config
573
+ self.hidden_size = config.hidden_size
574
+ self.num_heads = config.num_attention_heads
575
+ self.head_dim = self.hidden_size // self.num_heads
576
+ self.num_key_value_heads = config.num_key_value_heads
577
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
578
+ self.max_position_embeddings = config.max_position_embeddings
579
+
580
+ if (self.head_dim * self.num_heads) != self.hidden_size:
581
+ raise ValueError(
582
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
583
+ f" and `num_heads`: {self.num_heads})."
584
+ )
585
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
586
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
587
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
588
+ self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
589
+ self._init_rope()
590
+ self.use_flash_attention = config.use_flash_attention
591
+ if self.use_flash_attention:
592
+ if not have_flash_attention:
593
+ raise RuntimeError("Flash Attention 2 not installed")
594
+ self.flash_attention = FlashSelfAttention(causal=True)
595
+
596
+ def _init_rope(self):
597
+ if self.config.rope_scaling is None:
598
+ self.rotary_emb = LlamaRotaryEmbedding(self.head_dim, max_position_embeddings=self.max_position_embeddings)
599
+ else:
600
+ scaling_type = self.config.rope_scaling["type"]
601
+ scaling_factor = self.config.rope_scaling["factor"]
602
+ if scaling_type == "linear":
603
+ self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
604
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
605
+ )
606
+ elif scaling_type == "dynamic":
607
+ self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
608
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor
609
+ )
610
+ elif scaling_type == "ntk-by-parts":
611
+ original_max_position_embeddings = self.config.rope_scaling["original_max_position_embeddings"]
612
+ self.rotary_emb = LlamaNTKByPartsRotaryEmbedding(
613
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scaling_factor=scaling_factor,
614
+ original_max_position_embeddings=original_max_position_embeddings
615
+ )
616
+ elif scaling_type == "yarn":
617
+ original_max_position_embeddings = self.config.rope_scaling["original_max_position_embeddings"]
618
+ self.rotary_emb = LlamaYaRNScaledRotaryEmbedding(
619
+ self.head_dim, max_position_embeddings=self.max_position_embeddings, scale=scaling_factor,
620
+ original_max_position_embeddings=original_max_position_embeddings
621
+ )
622
+ elif scaling_type == "dynamic-yarn":
623
+ original_max_position_embeddings = self.config.rope_scaling["original_max_position_embeddings"]
624
+ self.rotary_emb = LlamaDynamicYaRNScaledRotaryEmbedding(
625
+ self.head_dim, max_position_embeddings=self.max_position_embeddings,
626
+ original_max_position_embeddings=original_max_position_embeddings
627
+ )
628
+ else:
629
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
630
+
631
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
632
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
633
+
634
+ def forward(
635
+ self,
636
+ hidden_states: torch.Tensor,
637
+ attention_mask: Optional[torch.Tensor] = None,
638
+ position_ids: Optional[torch.LongTensor] = None,
639
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
640
+ output_attentions: bool = False,
641
+ use_cache: bool = False,
642
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
643
+ bsz, q_len, _ = hidden_states.size()
644
+
645
+ if self.config.pretraining_tp > 1:
646
+ key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
647
+ query_slices = self.q_proj.weight.split(
648
+ (self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0
649
+ )
650
+ key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
651
+ value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
652
+
653
+ query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
654
+ query_states = torch.cat(query_states, dim=-1)
655
+
656
+ key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
657
+ key_states = torch.cat(key_states, dim=-1)
658
+
659
+ value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
660
+ value_states = torch.cat(value_states, dim=-1)
661
+
662
+ else:
663
+ query_states = self.q_proj(hidden_states)
664
+ key_states = self.k_proj(hidden_states)
665
+ value_states = self.v_proj(hidden_states)
666
+
667
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
668
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
669
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
670
+
671
+ kv_seq_len = key_states.shape[-2]
672
+ if past_key_value is not None:
673
+ kv_seq_len += past_key_value[0].shape[-2]
674
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
675
+ query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
676
+
677
+ if past_key_value is not None:
678
+ # reuse k, v, self_attention
679
+ key_states = torch.cat([past_key_value[0], key_states], dim=2)
680
+ value_states = torch.cat([past_key_value[1], value_states], dim=2)
681
+
682
+ past_key_value = (key_states, value_states) if use_cache else None
683
+
684
+ # repeat k/v heads if n_kv_heads < n_heads
685
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
686
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
687
+
688
+ if self.use_flash_attention and not output_attentions:
689
+ out_dtype = value_states.dtype
690
+ if self.training or query_states.shape == key_states.shape:
691
+ self.flash_attention.train(self.training)
692
+ attn_output = compute_flash_attention_packed(self.flash_attention, query_states, key_states,
693
+ value_states, attention_mask)
694
+ else:
695
+ attn_output = compute_flash_attention_inference(query_states, key_states, value_states, attention_mask)
696
+ attn_output = attn_output.to(out_dtype)
697
+ else:
698
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
699
+
700
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
701
+ raise ValueError(
702
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
703
+ f" {attn_weights.size()}"
704
+ )
705
+
706
+ if attention_mask is not None:
707
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
708
+ raise ValueError(
709
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
710
+ )
711
+ attn_weights = attn_weights + attention_mask
712
+
713
+ # upcast attention to fp32
714
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
715
+ attn_output = torch.matmul(attn_weights, value_states)
716
+
717
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
718
+ raise ValueError(
719
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
720
+ f" {attn_output.size()}"
721
+ )
722
+
723
+ attn_output = attn_output.transpose(1, 2).contiguous()
724
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
725
+
726
+ if self.config.pretraining_tp > 1:
727
+ attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
728
+ o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
729
+ attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
730
+ else:
731
+ attn_output = self.o_proj(attn_output)
732
+
733
+ if not output_attentions:
734
+ attn_weights = None
735
+
736
+ return attn_output, attn_weights, past_key_value
737
+
738
+
739
+ class LlamaDecoderLayer(nn.Module):
740
+ def __init__(self, config: LlamaConfig):
741
+ super().__init__()
742
+ self.hidden_size = config.hidden_size
743
+ self.self_attn = LlamaAttention(config=config)
744
+ self.mlp = LlamaMLP(config)
745
+ self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
746
+ self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
747
+
748
+ def forward(
749
+ self,
750
+ hidden_states: torch.Tensor,
751
+ attention_mask: Optional[torch.Tensor] = None,
752
+ position_ids: Optional[torch.LongTensor] = None,
753
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
754
+ output_attentions: Optional[bool] = False,
755
+ use_cache: Optional[bool] = False,
756
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
757
+ """
758
+ Args:
759
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
760
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
761
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
762
+ output_attentions (`bool`, *optional*):
763
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
764
+ returned tensors for more detail.
765
+ use_cache (`bool`, *optional*):
766
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
767
+ (see `past_key_values`).
768
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
769
+ """
770
+
771
+ residual = hidden_states
772
+
773
+ hidden_states = self.input_layernorm(hidden_states)
774
+
775
+ # Self Attention
776
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
777
+ hidden_states=hidden_states,
778
+ attention_mask=attention_mask,
779
+ position_ids=position_ids,
780
+ past_key_value=past_key_value,
781
+ output_attentions=output_attentions,
782
+ use_cache=use_cache,
783
+ )
784
+ hidden_states = residual + hidden_states
785
+
786
+ # Fully Connected
787
+ residual = hidden_states
788
+ hidden_states = self.post_attention_layernorm(hidden_states)
789
+ hidden_states = self.mlp(hidden_states)
790
+ hidden_states = residual + hidden_states
791
+
792
+ outputs = (hidden_states,)
793
+
794
+ if output_attentions:
795
+ outputs += (self_attn_weights,)
796
+
797
+ if use_cache:
798
+ outputs += (present_key_value,)
799
+
800
+ return outputs
801
+
802
+
803
+ LLAMA_START_DOCSTRING = r"""
804
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
805
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
806
+ etc.)
807
+
808
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
809
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
810
+ and behavior.
811
+
812
+ Parameters:
813
+ config ([`LlamaConfig`]):
814
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
815
+ load the weights associated with the model, only the configuration. Check out the
816
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
817
+ """
818
+
819
+
820
+ @add_start_docstrings(
821
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
822
+ LLAMA_START_DOCSTRING,
823
+ )
824
+ class LlamaPreTrainedModel(PreTrainedModel):
825
+ config_class = LlamaConfig
826
+ base_model_prefix = "model"
827
+ supports_gradient_checkpointing = True
828
+ _no_split_modules = ["LlamaDecoderLayer"]
829
+ _skip_keys_device_placement = "past_key_values"
830
+
831
+ def _init_weights(self, module):
832
+ std = self.config.initializer_range
833
+ if isinstance(module, nn.Linear):
834
+ module.weight.data.normal_(mean=0.0, std=std)
835
+ if module.bias is not None:
836
+ module.bias.data.zero_()
837
+ elif isinstance(module, nn.Embedding):
838
+ module.weight.data.normal_(mean=0.0, std=std)
839
+ if module.padding_idx is not None:
840
+ module.weight.data[module.padding_idx].zero_()
841
+
842
+ def _set_gradient_checkpointing(self, module, value=False):
843
+ if isinstance(module, LlamaModel):
844
+ module.gradient_checkpointing = value
845
+
846
+
847
+ LLAMA_INPUTS_DOCSTRING = r"""
848
+ Args:
849
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
850
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
851
+ it.
852
+
853
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
854
+ [`PreTrainedTokenizer.__call__`] for details.
855
+
856
+ [What are input IDs?](../glossary#input-ids)
857
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
858
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
859
+
860
+ - 1 for tokens that are **not masked**,
861
+ - 0 for tokens that are **masked**.
862
+
863
+ [What are attention masks?](../glossary#attention-mask)
864
+
865
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
866
+ [`PreTrainedTokenizer.__call__`] for details.
867
+
868
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
869
+ `past_key_values`).
870
+
871
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
872
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
873
+ information on the default strategy.
874
+
875
+ - 1 indicates the head is **not masked**,
876
+ - 0 indicates the head is **masked**.
877
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
878
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
879
+ config.n_positions - 1]`.
880
+
881
+ [What are position IDs?](../glossary#position-ids)
882
+ past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
883
+ Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
884
+ `(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
885
+ `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
886
+
887
+ Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
888
+ blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
889
+
890
+ If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
891
+ don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
892
+ `decoder_input_ids` of shape `(batch_size, sequence_length)`.
893
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
894
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
895
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
896
+ model's internal embedding lookup matrix.
897
+ use_cache (`bool`, *optional*):
898
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
899
+ `past_key_values`).
900
+ output_attentions (`bool`, *optional*):
901
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
902
+ tensors for more detail.
903
+ output_hidden_states (`bool`, *optional*):
904
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
905
+ more detail.
906
+ return_dict (`bool`, *optional*):
907
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
908
+ """
909
+
910
+
911
+ @add_start_docstrings(
912
+ "The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
913
+ LLAMA_START_DOCSTRING,
914
+ )
915
+ class LlamaModel(LlamaPreTrainedModel):
916
+ """
917
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
918
+
919
+ Args:
920
+ config: LlamaConfig
921
+ """
922
+
923
+ def __init__(self, config: LlamaConfig):
924
+ super().__init__(config)
925
+ self.padding_idx = config.pad_token_id
926
+ self.vocab_size = config.vocab_size
927
+
928
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
929
+ self.layers = nn.ModuleList([LlamaDecoderLayer(config) for _ in range(config.num_hidden_layers)])
930
+ self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
931
+
932
+ self.gradient_checkpointing = False
933
+ self.use_flash_attention = config.use_flash_attention
934
+ # Initialize weights and apply final processing
935
+ self.post_init()
936
+
937
+ def get_input_embeddings(self):
938
+ return self.embed_tokens
939
+
940
+ def set_input_embeddings(self, value):
941
+ self.embed_tokens = value
942
+
943
+ # Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
944
+ def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
945
+ # create causal mask
946
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
947
+ combined_attention_mask = None
948
+ if input_shape[-1] > 1:
949
+ combined_attention_mask = _make_causal_mask(
950
+ input_shape,
951
+ inputs_embeds.dtype,
952
+ device=inputs_embeds.device,
953
+ past_key_values_length=past_key_values_length,
954
+ )
955
+
956
+ if attention_mask is not None:
957
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
958
+ expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
959
+ inputs_embeds.device
960
+ )
961
+ combined_attention_mask = (
962
+ expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
963
+ )
964
+
965
+ return combined_attention_mask
966
+
967
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
968
+ def forward(
969
+ self,
970
+ input_ids: torch.LongTensor = None,
971
+ attention_mask: Optional[torch.Tensor] = None,
972
+ position_ids: Optional[torch.LongTensor] = None,
973
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
974
+ inputs_embeds: Optional[torch.FloatTensor] = None,
975
+ use_cache: Optional[bool] = None,
976
+ output_attentions: Optional[bool] = None,
977
+ output_hidden_states: Optional[bool] = None,
978
+ return_dict: Optional[bool] = None,
979
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
980
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
981
+ output_hidden_states = (
982
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
983
+ )
984
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
985
+
986
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
987
+
988
+ # retrieve input_ids and inputs_embeds
989
+ if input_ids is not None and inputs_embeds is not None:
990
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
991
+ elif input_ids is not None:
992
+ batch_size, seq_length = input_ids.shape
993
+ elif inputs_embeds is not None:
994
+ batch_size, seq_length, _ = inputs_embeds.shape
995
+ else:
996
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
997
+
998
+ seq_length_with_past = seq_length
999
+ past_key_values_length = 0
1000
+
1001
+ if past_key_values is not None:
1002
+ past_key_values_length = past_key_values[0][0].shape[2]
1003
+ seq_length_with_past = seq_length_with_past + past_key_values_length
1004
+
1005
+ if position_ids is None:
1006
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
1007
+ position_ids = torch.arange(
1008
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
1009
+ )
1010
+ position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
1011
+ else:
1012
+ position_ids = position_ids.view(-1, seq_length).long()
1013
+
1014
+ if inputs_embeds is None:
1015
+ inputs_embeds = self.embed_tokens(input_ids)
1016
+ # embed positions
1017
+ if attention_mask is None:
1018
+ attention_mask = torch.ones(
1019
+ (batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
1020
+ )
1021
+ attention_mask = self._prepare_decoder_attention_mask(
1022
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
1023
+ )
1024
+
1025
+ hidden_states = inputs_embeds
1026
+
1027
+ if self.gradient_checkpointing and self.training:
1028
+ if use_cache:
1029
+ logger.warning_once(
1030
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
1031
+ )
1032
+ use_cache = False
1033
+
1034
+ # decoder layers
1035
+ all_hidden_states = () if output_hidden_states else None
1036
+ all_self_attns = () if output_attentions else None
1037
+ next_decoder_cache = () if use_cache else None
1038
+
1039
+ for idx, decoder_layer in enumerate(self.layers):
1040
+ if output_hidden_states:
1041
+ all_hidden_states += (hidden_states,)
1042
+
1043
+ past_key_value = past_key_values[idx] if past_key_values is not None else None
1044
+
1045
+ if self.gradient_checkpointing and self.training:
1046
+
1047
+ def create_custom_forward(module):
1048
+ def custom_forward(*inputs):
1049
+ # None for past_key_value
1050
+ return module(*inputs, output_attentions, None)
1051
+
1052
+ return custom_forward
1053
+
1054
+ layer_outputs = torch.utils.checkpoint.checkpoint(
1055
+ create_custom_forward(decoder_layer),
1056
+ hidden_states,
1057
+ attention_mask,
1058
+ position_ids,
1059
+ None,
1060
+ )
1061
+ else:
1062
+ layer_outputs = decoder_layer(
1063
+ hidden_states,
1064
+ attention_mask=attention_mask,
1065
+ position_ids=position_ids,
1066
+ past_key_value=past_key_value,
1067
+ output_attentions=output_attentions,
1068
+ use_cache=use_cache,
1069
+ )
1070
+
1071
+ hidden_states = layer_outputs[0]
1072
+
1073
+ if use_cache:
1074
+ next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
1075
+
1076
+ if output_attentions:
1077
+ all_self_attns += (layer_outputs[1],)
1078
+
1079
+ hidden_states = self.norm(hidden_states)
1080
+
1081
+ # add hidden states from the last decoder layer
1082
+ if output_hidden_states:
1083
+ all_hidden_states += (hidden_states,)
1084
+
1085
+ next_cache = next_decoder_cache if use_cache else None
1086
+ if not return_dict:
1087
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1088
+ return BaseModelOutputWithPast(
1089
+ last_hidden_state=hidden_states,
1090
+ past_key_values=next_cache,
1091
+ hidden_states=all_hidden_states,
1092
+ attentions=all_self_attns,
1093
+ )
1094
+
1095
+
1096
+ class LlamaForCausalLM(LlamaPreTrainedModel):
1097
+ _tied_weights_keys = ["lm_head.weight"]
1098
+
1099
+ def __init__(self, config):
1100
+ super().__init__(config)
1101
+ self.model = LlamaModel(config)
1102
+ self.vocab_size = config.vocab_size
1103
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1104
+
1105
+ # Initialize weights and apply final processing
1106
+ self.post_init()
1107
+
1108
+ def get_input_embeddings(self):
1109
+ return self.model.embed_tokens
1110
+
1111
+ def set_input_embeddings(self, value):
1112
+ self.model.embed_tokens = value
1113
+
1114
+ def get_output_embeddings(self):
1115
+ return self.lm_head
1116
+
1117
+ def set_output_embeddings(self, new_embeddings):
1118
+ self.lm_head = new_embeddings
1119
+
1120
+ def set_decoder(self, decoder):
1121
+ self.model = decoder
1122
+
1123
+ def get_decoder(self):
1124
+ return self.model
1125
+
1126
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
1127
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1128
+ def forward(
1129
+ self,
1130
+ input_ids: torch.LongTensor = None,
1131
+ attention_mask: Optional[torch.Tensor] = None,
1132
+ position_ids: Optional[torch.LongTensor] = None,
1133
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1134
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1135
+ labels: Optional[torch.LongTensor] = None,
1136
+ use_cache: Optional[bool] = None,
1137
+ output_attentions: Optional[bool] = None,
1138
+ output_hidden_states: Optional[bool] = None,
1139
+ return_dict: Optional[bool] = None,
1140
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1141
+ r"""
1142
+ Args:
1143
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1144
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1145
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1146
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1147
+
1148
+ Returns:
1149
+
1150
+ Example:
1151
+
1152
+ ```python
1153
+ >>> from transformers import AutoTokenizer, LlamaForCausalLM
1154
+
1155
+ >>> model = LlamaForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
1156
+ >>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
1157
+
1158
+ >>> prompt = "Hey, are you conscious? Can you talk to me?"
1159
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1160
+
1161
+ >>> # Generate
1162
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1163
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1164
+ "Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
1165
+ ```"""
1166
+
1167
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1168
+ output_hidden_states = (
1169
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1170
+ )
1171
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1172
+
1173
+ # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
1174
+ outputs = self.model(
1175
+ input_ids=input_ids,
1176
+ attention_mask=attention_mask,
1177
+ position_ids=position_ids,
1178
+ past_key_values=past_key_values,
1179
+ inputs_embeds=inputs_embeds,
1180
+ use_cache=use_cache,
1181
+ output_attentions=output_attentions,
1182
+ output_hidden_states=output_hidden_states,
1183
+ return_dict=return_dict,
1184
+ )
1185
+
1186
+ hidden_states = outputs[0]
1187
+ if self.config.pretraining_tp > 1:
1188
+ lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
1189
+ logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
1190
+ logits = torch.cat(logits, dim=-1)
1191
+ else:
1192
+ logits = self.lm_head(hidden_states)
1193
+ logits = logits.float()
1194
+
1195
+ loss = None
1196
+ if labels is not None:
1197
+ # Shift so that tokens < n predict n
1198
+ shift_logits = logits[..., :-1, :].contiguous()
1199
+ shift_labels = labels[..., 1:].contiguous()
1200
+ # Flatten the tokens
1201
+ loss_fct = CrossEntropyLoss()
1202
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1203
+ shift_labels = shift_labels.view(-1)
1204
+ # Enable model parallelism
1205
+ shift_labels = shift_labels.to(shift_logits.device)
1206
+ loss = loss_fct(shift_logits, shift_labels)
1207
+
1208
+ if not return_dict:
1209
+ output = (logits,) + outputs[1:]
1210
+ return (loss,) + output if loss is not None else output
1211
+
1212
+ return CausalLMOutputWithPast(
1213
+ loss=loss,
1214
+ logits=logits,
1215
+ past_key_values=outputs.past_key_values,
1216
+ hidden_states=outputs.hidden_states,
1217
+ attentions=outputs.attentions,
1218
+ )
1219
+
1220
+ def prepare_inputs_for_generation(
1221
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1222
+ ):
1223
+ if past_key_values:
1224
+ input_ids = input_ids[:, -1:]
1225
+
1226
+ position_ids = kwargs.get("position_ids", None)
1227
+ if attention_mask is not None and position_ids is None:
1228
+ # create position_ids on the fly for batch generation
1229
+ position_ids = attention_mask.long().cumsum(-1) - 1
1230
+ position_ids.masked_fill_(attention_mask == 0, 1)
1231
+ if past_key_values:
1232
+ position_ids = position_ids[:, -1].unsqueeze(-1)
1233
+
1234
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1235
+ if inputs_embeds is not None and past_key_values is None:
1236
+ model_inputs = {"inputs_embeds": inputs_embeds}
1237
+ else:
1238
+ model_inputs = {"input_ids": input_ids}
1239
+
1240
+ model_inputs.update(
1241
+ {
1242
+ "position_ids": position_ids,
1243
+ "past_key_values": past_key_values,
1244
+ "use_cache": kwargs.get("use_cache"),
1245
+ "attention_mask": attention_mask,
1246
+ }
1247
+ )
1248
+ return model_inputs
1249
+
1250
+ @staticmethod
1251
+ def _reorder_cache(past_key_values, beam_idx):
1252
+ reordered_past = ()
1253
+ for layer_past in past_key_values:
1254
+ reordered_past += (
1255
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1256
+ )
1257
+ return reordered_past
1258
+
1259
+
1260
+ @add_start_docstrings(
1261
+ """
1262
+ The LLaMa Model transformer with a sequence classification head on top (linear layer).
1263
+
1264
+ [`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
1265
+ (e.g. GPT-2) do.
1266
+
1267
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1268
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1269
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1270
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1271
+ each row of the batch).
1272
+ """,
1273
+ LLAMA_START_DOCSTRING,
1274
+ )
1275
+ class LlamaForSequenceClassification(LlamaPreTrainedModel):
1276
+ def __init__(self, config):
1277
+ super().__init__(config)
1278
+ self.num_labels = config.num_labels
1279
+ self.model = LlamaModel(config)
1280
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1281
+
1282
+ # Initialize weights and apply final processing
1283
+ self.post_init()
1284
+
1285
+ def get_input_embeddings(self):
1286
+ return self.model.embed_tokens
1287
+
1288
+ def set_input_embeddings(self, value):
1289
+ self.model.embed_tokens = value
1290
+
1291
+ @add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
1292
+ def forward(
1293
+ self,
1294
+ input_ids: torch.LongTensor = None,
1295
+ attention_mask: Optional[torch.Tensor] = None,
1296
+ position_ids: Optional[torch.LongTensor] = None,
1297
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1298
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1299
+ labels: Optional[torch.LongTensor] = None,
1300
+ use_cache: Optional[bool] = None,
1301
+ output_attentions: Optional[bool] = None,
1302
+ output_hidden_states: Optional[bool] = None,
1303
+ return_dict: Optional[bool] = None,
1304
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1305
+ r"""
1306
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1307
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1308
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1309
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1310
+ """
1311
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1312
+
1313
+ transformer_outputs = self.model(
1314
+ input_ids,
1315
+ attention_mask=attention_mask,
1316
+ position_ids=position_ids,
1317
+ past_key_values=past_key_values,
1318
+ inputs_embeds=inputs_embeds,
1319
+ use_cache=use_cache,
1320
+ output_attentions=output_attentions,
1321
+ output_hidden_states=output_hidden_states,
1322
+ return_dict=return_dict,
1323
+ )
1324
+ hidden_states = transformer_outputs[0]
1325
+ logits = self.score(hidden_states)
1326
+
1327
+ if input_ids is not None:
1328
+ batch_size = input_ids.shape[0]
1329
+ else:
1330
+ batch_size = inputs_embeds.shape[0]
1331
+
1332
+ if self.config.pad_token_id is None and batch_size != 1:
1333
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1334
+ if self.config.pad_token_id is None:
1335
+ sequence_lengths = -1
1336
+ else:
1337
+ if input_ids is not None:
1338
+ sequence_lengths = (torch.ne(input_ids, self.config.pad_token_id).sum(-1) - 1).to(logits.device)
1339
+ else:
1340
+ sequence_lengths = -1
1341
+
1342
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1343
+
1344
+ loss = None
1345
+ if labels is not None:
1346
+ labels = labels.to(logits.device)
1347
+ if self.config.problem_type is None:
1348
+ if self.num_labels == 1:
1349
+ self.config.problem_type = "regression"
1350
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1351
+ self.config.problem_type = "single_label_classification"
1352
+ else:
1353
+ self.config.problem_type = "multi_label_classification"
1354
+
1355
+ if self.config.problem_type == "regression":
1356
+ loss_fct = MSELoss()
1357
+ if self.num_labels == 1:
1358
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1359
+ else:
1360
+ loss = loss_fct(pooled_logits, labels)
1361
+ elif self.config.problem_type == "single_label_classification":
1362
+ loss_fct = CrossEntropyLoss()
1363
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1364
+ elif self.config.problem_type == "multi_label_classification":
1365
+ loss_fct = BCEWithLogitsLoss()
1366
+ loss = loss_fct(pooled_logits, labels)
1367
+ if not return_dict:
1368
+ output = (pooled_logits,) + transformer_outputs[1:]
1369
+ return ((loss,) + output) if loss is not None else output
1370
+
1371
+ return SequenceClassifierOutputWithPast(
1372
+ loss=loss,
1373
+ logits=pooled_logits,
1374
+ past_key_values=transformer_outputs.past_key_values,
1375
+ hidden_states=transformer_outputs.hidden_states,
1376
+ attentions=transformer_outputs.attentions,
1377
+ )
quant_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "zero_point": true,
3
+ "q_group_size": 128,
4
+ "w_bit": 4,
5
+ "version": "GEMM"
6
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": "<unk>",
17
+ "unk_token": {
18
+ "content": "<unk>",
19
+ "lstrip": false,
20
+ "normalized": true,
21
+ "rstrip": false,
22
+ "single_word": false
23
+ }
24
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_eos_token": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<s>",
7
+ "lstrip": false,
8
+ "normalized": true,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "clean_up_tokenization_spaces": false,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "</s>",
16
+ "lstrip": false,
17
+ "normalized": true,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "legacy": false,
22
+ "model_max_length": 1000000000000000019884624838656,
23
+ "pad_token": null,
24
+ "sp_model_kwargs": {},
25
+ "spaces_between_special_tokens": false,
26
+ "tokenizer_class": "LlamaTokenizer",
27
+ "unk_token": {
28
+ "__type": "AddedToken",
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ },
35
+ "use_default_system_prompt": true
36
+ }