ganser4566 commited on
Commit
3638607
·
verified ·
1 Parent(s): 022c1b3

Upload 10 files

Browse files
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - HuggingFaceH4/ultrachat_200k
4
+ - allenai/ultrafeedback_binarized_cleaned
5
+ - meta-math/MetaMathQA
6
+ - WizardLM/WizardLM_evol_instruct_V2_196k
7
+ - openchat/openchat_sharegpt4_dataset
8
+ - LDJnr/Capybara
9
+ - Intel/orca_dpo_pairs
10
+ - hkust-nlp/deita-10k-v0
11
+ language:
12
+ - en
13
+ tags:
14
+ - causal-lm
15
+ extra_gated_fields:
16
+ Name: text
17
+ Email: text
18
+ Country: text
19
+ Organization or Affiliation: text
20
+ I ALLOW Stability AI to email me about new model releases: checkbox
21
+ license: other
22
+ ---
23
+ # `StableLM 2 Zephyr 1.6B`
24
+
25
+ ## Model Description
26
+
27
+ `Stable LM 2 Zephyr 1.6B` is a 1.6 billion parameter instruction tuned language model inspired by [HugginFaceH4's Zephyr 7B](https://huggingface.co/HuggingFaceH4/zephyr-7b-beta) training pipeline. The model is trained on a mix of publicly available datasets and synthetic datasets, utilizing [Direct Preference Optimization (DPO)](https://arxiv.org/abs/2305.18290).
28
+
29
+ ## Usage
30
+
31
+ `StableLM 2 Zephyr 1.6B` uses the following instruction format:
32
+ ```
33
+ <|user|>
34
+ Which famous math number begins with 1.6 ...?<|endoftext|>
35
+ <|assistant|>
36
+ The number you are referring to is 1.618033988749895. This is the famous value known as the golden ratio<|endoftext|>
37
+ ```
38
+
39
+ This format is also available through the tokenizer's `apply_chat_template` method:
40
+
41
+ ```python
42
+ from transformers import AutoModelForCausalLM, AutoTokenizer
43
+
44
+ tokenizer = AutoTokenizer.from_pretrained('stabilityai/stablelm-2-zephyr-1_6b')
45
+ model = AutoModelForCausalLM.from_pretrained(
46
+ 'stabilityai/stablelm-2-zephyr-1_6b',
47
+ device_map="auto"
48
+ )
49
+
50
+ prompt = [{'role': 'user', 'content': 'Which famous math number begins with 1.6 ...?'}]
51
+ inputs = tokenizer.apply_chat_template(
52
+ prompt,
53
+ add_generation_prompt=True,
54
+ return_tensors='pt'
55
+ )
56
+
57
+ tokens = model.generate(
58
+ inputs.to(model.device),
59
+ max_new_tokens=1024,
60
+ temperature=0.5,
61
+ do_sample=True
62
+ )
63
+
64
+ print(tokenizer.decode(tokens[0], skip_special_tokens=False))
65
+ ```
66
+
67
+ ## Model Details
68
+
69
+ * **Developed by**: [Stability AI](https://stability.ai/)
70
+ * **Model type**: `StableLM 2 Zephyr 1.6B` model is an auto-regressive language model based on the transformer decoder architecture.
71
+ * **Language(s)**: English
72
+ * **Paper**: [Stable LM 2 1.6B Technical Report](https://drive.google.com/file/d/1JYJHszhS8EFChTbNAf8xmqhKjogWRrQF/view?usp=sharing)
73
+ * **Library**: [Alignment Handbook](https://github.com/huggingface/alignment-handbook.git)
74
+ * **Finetuned from model**: [https://huggingface.co/stabilityai/stablelm-2-1_6b](https://huggingface.co/stabilityai/stablelm-2-1_6b)
75
+ * **License**: [StabilityAI Non-Commercial Research Community License](https://huggingface.co/stabilityai/stablelm-2-zephyr-1_6b/blob/main/LICENSE). If you want to use this model for your commercial products or purposes, please contact us [here](https://stability.ai/contact) to learn more.
76
+ * **Contact**: For questions and comments about the model, please email `[email protected]`
77
+
78
+ ### Training Dataset
79
+
80
+ The dataset is comprised of a mixture of open datasets large-scale datasets available on the [HuggingFace Hub](https://huggingface.co/datasets):
81
+ 1. SFT Datasets
82
+ - HuggingFaceH4/ultrachat_200k
83
+ - meta-math/MetaMathQA
84
+ - WizardLM/WizardLM_evol_instruct_V2_196k
85
+ - Open-Orca/SlimOrca
86
+ - openchat/openchat_sharegpt4_dataset
87
+ - LDJnr/Capybara
88
+ - hkust-nlp/deita-10k-v0
89
+
90
+ 2. Preference Datasets:
91
+ - allenai/ultrafeedback_binarized_cleaned
92
+ - Intel/orca_dpo_pairs
93
+
94
+ ## Performance
95
+
96
+ ### MT-Bench
97
+
98
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/61b2bf4f5b1f7cad1799cfbb/QH00HVM3lg-5f17U_py4K.png" alt="mt_bench_plot" width="600"/>
99
+
100
+ | Model | Size | MT-Bench |
101
+ |-------------------------|------|----------|
102
+ | Mistral-7B-Instruct-v0.2| 7B | 7.61 |
103
+ | Llama2-Chat | 70B | 6.86 |
104
+ | stablelm-zephyr-3b | 3B | 6.64 |
105
+ | MPT-30B-Chat | 30B | 6.39 |
106
+ | **stablelm-2-zephyr-1.6b** | 1.6B | 5.42 |
107
+ | Falcon-40B-Instruct | 40B | 5.17 |
108
+ | Qwen-1.8B-Chat | 1.8B | 4.95 |
109
+ | dolphin-2.6-phi-2 | 2.7B | 4.93 |
110
+ | phi-2 | 2.7B | 4.29 |
111
+ | TinyLlama-1.1B-Chat-v1.0| 1.1B | 3.46 |
112
+
113
+ ### OpenLLM Leaderboard
114
+
115
+ | Model | Size | Average | ARC Challenge (acc_norm) | HellaSwag (acc_norm) | MMLU (acc_norm) | TruthfulQA (mc2) | Winogrande (acc) | Gsm8k (acc) |
116
+ |----------------------------------------|------|---------|-------------------------|----------------------|-----------------|------------------|------------------|-------------|
117
+ | microsoft/phi-2 | 2.7B | 61.32% | 61.09% | 75.11% | 58.11% | 44.47% | 74.35% | 54.81% |
118
+ | **stabilityai/stablelm-2-zephyr-1_6b** | 1.6B | 49.89% | 43.69% | 69.34% | 41.85% | 45.21% | 64.09% | 35.18% |
119
+ | microsoft/phi-1_5 | 1.3B | 47.69% | 52.90% | 63.79% | 43.89% | 40.89% | 72.22% | 12.43% |
120
+ | stabilityai/stablelm-2-1_6b | 1.6B | 45.54% | 43.43% | 70.49% | 38.93% | 36.65% | 65.90% | 17.82% |
121
+ | mosaicml/mpt-7b | 7B | 44.28% | 47.70% | 77.57% | 30.80% | 33.40% | 72.14% | 4.02% |
122
+ | KnutJaegersberg/Qwen-1_8B-Llamaified* | 1.8B | 44.75% | 37.71% | 58.87% | 46.37% | 39.41% | 61.72% | 24.41% |
123
+ | openlm-research/open_llama_3b_v2 | 3B | 40.28% | 40.27% | 71.60% | 27.12% | 34.78% | 67.01% | 0.91% |
124
+ | iiuae/falcon-rw-1b | 1B | 37.07% | 35.07% | 63.56% | 25.28% | 35.96% | 62.04% | 0.53% |
125
+ | TinyLlama/TinyLlama-1.1B-3T | 1.1B | 36.40% | 33.79% | 60.31% | 26.04% | 37.32% | 59.51% | 1.44% |
126
+
127
+
128
+
129
+ ### Training Infrastructure
130
+
131
+ * **Hardware**: `StableLM 2 Zephyr 1.6B` was trained on the Stability AI cluster across 8 nodes with 8 A100 80GBs GPUs for each nodes.
132
+ * **Code Base**: We use our internal script for SFT steps and used [HuggingFace Alignment Handbook script](https://github.com/huggingface/alignment-handbook) for DPO training.
133
+
134
+ ## Use and Limitations
135
+
136
+ ### Intended Use
137
+
138
+ The model is intended to be used in chat-like applications. Developers must evaluate the model for safety performance in their specific use case. Read more about [safety and limitations](#limitations-and-bias) below.
139
+
140
+ ### Limitations and Bias
141
+
142
+ This model is not trained against adversarial inputs. We strongly recommend pairing this model with an input and output classifier to prevent harmful responses.
143
+
144
+ Through our internal red teaming, we discovered that while the model will not output harmful information if not prompted to do so, it will hallucinate many facts. It is also willing to output potentially harmful outputs or misinformation when the user requests it.
145
+ Using this model will require guardrails around your inputs and outputs to ensure that any outputs returned are not misinformation or harmful.
146
+ Additionally, as each use case is unique, we recommend running your own suite of tests to ensure proper performance of this model.
147
+ Finally, do not use the models if they are unsuitable for your application, or for any applications that may cause deliberate or unintentional harm to others.
148
+
149
+
150
+ ## How to Cite
151
+
152
+ ```bibtex
153
+ @misc{StableLM-2-1.6B,
154
+ url={[https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b)},
155
+ title={Stable LM 2 1.6B},
156
+ author={Stability AI Language Team}
157
+ }
158
+ ```
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "StableLmForCausalLM"
4
+ ],
5
+ "bos_token_id": 100257,
6
+ "eos_token_id": 100257,
7
+ "hidden_act": "silu",
8
+ "hidden_size": 2048,
9
+ "initializer_range": 0.02,
10
+ "intermediate_size": 5632,
11
+ "max_position_embeddings": 4096,
12
+ "model_type": "stablelm",
13
+ "layer_norm_eps": 1e-05,
14
+ "num_attention_heads": 32,
15
+ "num_hidden_layers": 24,
16
+ "num_key_value_heads": 32,
17
+ "partial_rotary_factor": 0.25,
18
+ "rope_theta": 10000,
19
+ "tie_word_embeddings": false,
20
+ "torch_dtype": "float16",
21
+ "transformers_version": "4.38.0",
22
+ "use_cache": true,
23
+ "use_qkv_bias": true,
24
+ "vocab_size": 100352
25
+ }
configuration_stablelm.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 Stability AI and The HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # Licensed under the Apache License, Version 2.0 (the "License");
5
+ # you may not use this file except in compliance with the License.
6
+ # You may obtain a copy of the License at
7
+ #
8
+ # http://www.apache.org/licenses/LICENSE-2.0
9
+ #
10
+ # Unless required by applicable law or agreed to in writing, software
11
+ # distributed under the License is distributed on an "AS IS" BASIS,
12
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13
+ # See the License for the specific language governing permissions and
14
+ # limitations under the License.
15
+ """ StableLM model configuration """
16
+
17
+ from transformers.configuration_utils import PretrainedConfig
18
+ from transformers.utils import logging
19
+
20
+
21
+ logger = logging.get_logger(__name__)
22
+
23
+ STABLELM_PRETRAINED_CONFIG_ARCHIVE_MAP = {
24
+ "stabilityai/stablelm-3b-4e1t": "https://huggingface.co/stabilityai/stablelm-3b-4e1t/resolve/main/config.json",
25
+ # See all StableLM models at https://huggingface.co/models?filter=stablelm
26
+ }
27
+
28
+
29
+ class StableLmConfig(PretrainedConfig):
30
+ r"""
31
+ This is the configuration class to store the configuration of a [`~StableLmModel`].
32
+ It is used to instantiate an StableLM model according to the specified arguments, defining the model
33
+ architecture. Instantiating a configuration with the defaults will yield a similar configuration to that of
34
+ the StableLM [stabilityai/stablelm-3b-4e1t](https://huggingface.co/stabilityai/stablelm-3b-4e1t) architecture.
35
+
36
+ Configuration objects inherit from [`PretrainedConfig`] and can be used
37
+ to control the model outputs. Read the documentation from [`PretrainedConfig`]
38
+ for more information.
39
+
40
+
41
+ Args:
42
+ vocab_size (`int`, *optional*, defaults to 50304):
43
+ Vocabulary size of the StableLM model. Defines the number of different tokens that
44
+ can be represented by the `inputs_ids` passed when calling [`StableLmModel`].
45
+ intermediate_size (`int`, *optional*, defaults to 6912):
46
+ Dimension of the MLP representations.
47
+ hidden_size (`int`, *optional*, defaults to 2560):
48
+ Number of hidden layers in the Transformer decoder.
49
+ num_hidden_layers (`int`, *optional*, defaults to 32):
50
+ Number of hidden layers in the Transformer decoder.
51
+ num_attention_heads (`int`, *optional*, defaults to 32):
52
+ Number of attention heads for each attention layer in the Transformer encoder.
53
+ num_key_value_heads (`int`, *optional*, defaults to 32):
54
+ This is the number of key_value heads that should be used to implement Grouped Query Attention. If
55
+ `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
56
+ `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
57
+ converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
58
+ by meanpooling all the original heads within that group. For more details checkout [this
59
+ paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
60
+ `num_attention_heads`.
61
+ hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
62
+ The non-linear activation function (function or string).
63
+ max_position_embeddings (`int`, *optional*, defaults to 4096):
64
+ The maximum sequence length that this model might ever be used with.
65
+ Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
66
+ initializer_range (`float`, *optional*, defaults to 0.02):
67
+ The standard deviation of the truncated_normal_initializer for initializing
68
+ all weight matrices.
69
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
70
+ The epsilon used by the normalization layers.
71
+ use_cache (`bool`, *optional*, defaults to `True`):
72
+ Whether or not the model should return the last key/values attentions
73
+ (not used by all models). Only relevant if `config.is_decoder=True`.
74
+ tie_word_embeddings (`bool`, *optional*, defaults to `False`):
75
+ Whether the model's input and output word embeddings should be tied.
76
+ rope_theta (`float`, *optional*, defaults to `10000.0`):
77
+ The base period of the RoPE embeddings.
78
+ rope_scaling (`Dict`, *optional*):
79
+ Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
80
+ strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
81
+ `{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
82
+ `max_position_embeddings` to the expected new maximum. See the following thread for more information on how
83
+ these scaling strategies behave:
84
+ https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This
85
+ is an experimental feature, subject to breaking API changes in future versions.
86
+ use_qkv_bias (`bool`, *optional*, defaults to `False`):
87
+ Whether or not the model should use bias for qkv layers.
88
+ hidden_dropout (`float`, *optional*, defaults to 0.0):
89
+ The dropout ratio after applying the MLP to the hidden states.
90
+ attention_dropout (`float`, *optional*, defaults to 0.0):
91
+ The dropout ratio for the attention probabilities.
92
+ partial_rotary_factor (`float`, *optional*, defaults to 0.25):
93
+ Percentage of the query and keys which will have rotary embedding.
94
+ bos_token_id (int, *optional*, defaults to 0):
95
+ The id of the `BOS` token in the vocabulary.
96
+ eos_token_id (int, *optional*, defaults to 0):
97
+ The id of the `EOS` token in the vocabulary.
98
+
99
+ Example:
100
+
101
+ ```python
102
+ >>> from transformers import StableLmModel, StableLmConfig
103
+
104
+ >>> # Initializing a StableLM stablelm-3b style configuration
105
+ >>> configuration = StableLmConfig()
106
+ ```"""
107
+
108
+ model_type = "stablelm"
109
+ keys_to_ignore_at_inference = ["past_key_values"]
110
+
111
+ def __init__(
112
+ self,
113
+ vocab_size=50304,
114
+ intermediate_size=6912,
115
+ hidden_size=2560,
116
+ num_hidden_layers=32,
117
+ num_attention_heads=32,
118
+ num_key_value_heads=32,
119
+ hidden_act="silu",
120
+ max_position_embeddings=4096,
121
+ initializer_range=0.02,
122
+ layer_norm_eps=1.0e-5,
123
+ use_cache=True,
124
+ tie_word_embeddings=False,
125
+ rope_theta=10_000,
126
+ rope_scaling=None,
127
+ use_qkv_bias=False,
128
+ hidden_dropout=0.0,
129
+ attention_dropout=0.0,
130
+ partial_rotary_factor=0.25,
131
+ bos_token_id=0,
132
+ eos_token_id=0,
133
+ **kwargs,
134
+ ):
135
+ self.vocab_size = vocab_size
136
+ self.max_position_embeddings = max_position_embeddings
137
+
138
+ self.hidden_size = hidden_size
139
+ self.intermediate_size = intermediate_size
140
+ self.num_hidden_layers = num_hidden_layers
141
+ self.num_attention_heads = num_attention_heads
142
+ self.num_key_value_heads = num_key_value_heads
143
+ self.hidden_act = hidden_act
144
+
145
+ self.initializer_range = initializer_range
146
+ self.layer_norm_eps = layer_norm_eps
147
+ self.use_cache = use_cache
148
+ self.rope_theta = rope_theta
149
+ self.rope_scaling = rope_scaling
150
+ self.use_qkv_bias = use_qkv_bias
151
+ self.hidden_dropout = hidden_dropout
152
+ self.attention_dropout = attention_dropout
153
+ self.partial_rotary_factor = partial_rotary_factor
154
+ self._rope_scaling_validation()
155
+
156
+ super().__init__(
157
+ bos_token_id=bos_token_id,
158
+ eos_token_id=eos_token_id,
159
+ tie_word_embeddings=tie_word_embeddings,
160
+ **kwargs,
161
+ )
162
+
163
+ # Copied from transformers.models.llama.configuration_llama.LlamaConfig._rope_scaling_validation
164
+ def _rope_scaling_validation(self):
165
+ """
166
+ Validate the `rope_scaling` configuration.
167
+ """
168
+ if self.rope_scaling is None:
169
+ return
170
+
171
+ if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
172
+ raise ValueError(
173
+ "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
174
+ f"got {self.rope_scaling}"
175
+ )
176
+ rope_scaling_type = self.rope_scaling.get("type", None)
177
+ rope_scaling_factor = self.rope_scaling.get("factor", None)
178
+ if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
179
+ raise ValueError(
180
+ f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
181
+ )
182
+ if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
183
+ raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100257,
4
+ "eos_token_id": 100257,
5
+ "transformers_version": "4.38.0"
6
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_stablelm.py ADDED
@@ -0,0 +1,1341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # coding=utf-8
2
+ # Copyright 2024 EleutherAI and the HuggingFace Inc. team. All rights reserved.
3
+ #
4
+ # This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
5
+ # and OPT implementations in this library. It has been modified from its
6
+ # original forms to accommodate minor architectural differences compared
7
+ # to GPT-NeoX and OPT used by the Meta AI team that trained the model.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ PyTorch StableLM model."""
21
+ import math
22
+ from typing import List, Optional, Tuple, Union
23
+
24
+ import torch
25
+ import torch.nn.functional as F
26
+ import torch.utils.checkpoint
27
+ from torch import nn
28
+ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
29
+
30
+ from transformers.activations import ACT2FN
31
+ from transformers.cache_utils import Cache, DynamicCache
32
+ from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask
33
+ from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
34
+ from transformers.modeling_utils import PreTrainedModel
35
+ from transformers.utils import (
36
+ add_start_docstrings,
37
+ add_start_docstrings_to_model_forward,
38
+ is_flash_attn_2_available,
39
+ is_flash_attn_greater_or_equal_2_10,
40
+ logging,
41
+ replace_return_docstrings,
42
+ )
43
+ from .configuration_stablelm import StableLmConfig
44
+
45
+
46
+ if is_flash_attn_2_available():
47
+ from flash_attn import flash_attn_func, flash_attn_varlen_func
48
+ from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
49
+
50
+
51
+ logger = logging.get_logger(__name__)
52
+
53
+ _CONFIG_FOR_DOC = "StableLmConfig"
54
+
55
+
56
+ # Copied from transformers.models.llama.modeling_llama._get_unpad_data
57
+ def _get_unpad_data(attention_mask):
58
+ seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
59
+ indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
60
+ max_seqlen_in_batch = seqlens_in_batch.max().item()
61
+ cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
62
+ return (
63
+ indices,
64
+ cu_seqlens,
65
+ max_seqlen_in_batch,
66
+ )
67
+
68
+
69
+ # Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->StableLm
70
+ class StableLmRotaryEmbedding(nn.Module):
71
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
72
+ super().__init__()
73
+
74
+ self.dim = dim
75
+ self.max_position_embeddings = max_position_embeddings
76
+ self.base = base
77
+ inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
78
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
79
+
80
+ # Build here to make `torch.jit.trace` work.
81
+ self._set_cos_sin_cache(
82
+ seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
83
+ )
84
+
85
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
86
+ self.max_seq_len_cached = seq_len
87
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
88
+
89
+ freqs = torch.outer(t, self.inv_freq)
90
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
91
+ emb = torch.cat((freqs, freqs), dim=-1)
92
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
93
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
94
+
95
+ def forward(self, x, seq_len=None):
96
+ # x: [bs, num_attention_heads, seq_len, head_size]
97
+ if seq_len > self.max_seq_len_cached:
98
+ self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
99
+
100
+ return (
101
+ self.cos_cached[:seq_len].to(dtype=x.dtype),
102
+ self.sin_cached[:seq_len].to(dtype=x.dtype),
103
+ )
104
+
105
+
106
+ # Copied from transformers.models.falcon.modeling_falcon.FalconLinearScalingRotaryEmbedding with Falcon->StableLm
107
+ class StableLmLinearScalingRotaryEmbedding(StableLmRotaryEmbedding):
108
+ """StableLmRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
109
+
110
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
111
+ self.scaling_factor = scaling_factor
112
+ super().__init__(dim, max_position_embeddings, base, device)
113
+
114
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
115
+ self.max_seq_len_cached = seq_len
116
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
117
+ t = t / self.scaling_factor
118
+
119
+ freqs = torch.outer(t, self.inv_freq)
120
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
121
+ emb = torch.cat((freqs, freqs), dim=-1)
122
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
123
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
124
+
125
+
126
+ # Copied from transformers.models.falcon.modeling_falcon.FalconDynamicNTKScalingRotaryEmbedding with Falcon->StableLm
127
+ class StableLmDynamicNTKScalingRotaryEmbedding(StableLmRotaryEmbedding):
128
+ """StableLmRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
129
+
130
+ def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
131
+ self.scaling_factor = scaling_factor
132
+ super().__init__(dim, max_position_embeddings, base, device)
133
+
134
+ def _set_cos_sin_cache(self, seq_len, device, dtype):
135
+ self.max_seq_len_cached = seq_len
136
+
137
+ if seq_len > self.max_position_embeddings:
138
+ base = self.base * (
139
+ (self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
140
+ ) ** (self.dim / (self.dim - 2))
141
+ inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
142
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
143
+
144
+ t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
145
+
146
+ freqs = torch.outer(t, self.inv_freq)
147
+ # Different from paper, but it uses a different permutation in order to obtain the same calculation
148
+ emb = torch.cat((freqs, freqs), dim=-1)
149
+ self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
150
+ self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
151
+
152
+
153
+ # Copied from transformers.models.llama.modeling_llama.rotate_half
154
+ def rotate_half(x):
155
+ """Rotates half the hidden dims of the input."""
156
+ x1 = x[..., : x.shape[-1] // 2]
157
+ x2 = x[..., x.shape[-1] // 2 :]
158
+ return torch.cat((-x2, x1), dim=-1)
159
+
160
+
161
+ # Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
162
+ def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
163
+ """Applies Rotary Position Embedding to the query and key tensors.
164
+
165
+ Args:
166
+ q (`torch.Tensor`): The query tensor.
167
+ k (`torch.Tensor`): The key tensor.
168
+ cos (`torch.Tensor`): The cosine part of the rotary embedding.
169
+ sin (`torch.Tensor`): The sine part of the rotary embedding.
170
+ position_ids (`torch.Tensor`):
171
+ The position indices of the tokens corresponding to the query and key tensors. For example, this can be
172
+ used to pass offsetted position ids when working with a KV-cache.
173
+ unsqueeze_dim (`int`, *optional*, defaults to 1):
174
+ The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
175
+ sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
176
+ that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
177
+ k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
178
+ cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
179
+ the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
180
+ Returns:
181
+ `tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
182
+ """
183
+ cos = cos[position_ids].unsqueeze(unsqueeze_dim)
184
+ sin = sin[position_ids].unsqueeze(unsqueeze_dim)
185
+ q_embed = (q * cos) + (rotate_half(q) * sin)
186
+ k_embed = (k * cos) + (rotate_half(k) * sin)
187
+ return q_embed, k_embed
188
+
189
+
190
+ # Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->StableLm
191
+ class StableLmMLP(nn.Module):
192
+ def __init__(self, config):
193
+ super().__init__()
194
+ self.config = config
195
+ self.hidden_size = config.hidden_size
196
+ self.intermediate_size = config.intermediate_size
197
+ self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
198
+ self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
199
+ self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
200
+ self.act_fn = ACT2FN[config.hidden_act]
201
+
202
+ def forward(self, x):
203
+ return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
204
+
205
+
206
+ # Copied from transformers.models.llama.modeling_llama.repeat_kv
207
+ def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
208
+ """
209
+ This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
210
+ num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
211
+ """
212
+ batch, num_key_value_heads, slen, head_dim = hidden_states.shape
213
+ if n_rep == 1:
214
+ return hidden_states
215
+ hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
216
+ return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
217
+
218
+
219
+ class StableLmAttention(nn.Module):
220
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
221
+
222
+ def __init__(self, config: StableLmConfig, layer_idx: Optional[int] = None):
223
+ super().__init__()
224
+ self.config = config
225
+ self.layer_idx = layer_idx
226
+ if layer_idx is None:
227
+ logger.warning_once(
228
+ f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
229
+ "lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
230
+ "when creating this class."
231
+ )
232
+
233
+ self.hidden_size = config.hidden_size
234
+ self.num_heads = config.num_attention_heads
235
+ self.head_dim = self.hidden_size // self.num_heads
236
+ self.num_key_value_heads = config.num_key_value_heads
237
+ self.num_key_value_groups = self.num_heads // self.num_key_value_heads
238
+ self.max_position_embeddings = config.max_position_embeddings
239
+ self.rope_theta = config.rope_theta
240
+ self.partial_rotary_factor = config.partial_rotary_factor
241
+ self.is_causal = True
242
+
243
+ if (self.head_dim * self.num_heads) != self.hidden_size:
244
+ raise ValueError(
245
+ f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
246
+ f" and `num_heads`: {self.num_heads})."
247
+ )
248
+ self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.use_qkv_bias)
249
+ self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.use_qkv_bias)
250
+ self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.use_qkv_bias)
251
+ self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
252
+
253
+ self.attention_dropout = nn.Dropout(config.attention_dropout)
254
+ self._init_rope()
255
+
256
+ # Copied from transformers.models.persimmon.modeling_persimmon.PersimmonAttention._init_rope with Persimmon->StableLm
257
+ def _init_rope(self):
258
+ if self.config.rope_scaling is None:
259
+ self.rotary_emb = StableLmRotaryEmbedding(
260
+ int(self.partial_rotary_factor * self.head_dim),
261
+ max_position_embeddings=self.max_position_embeddings,
262
+ base=self.rope_theta,
263
+ )
264
+ else:
265
+ scaling_type = self.config.rope_scaling["type"]
266
+ scaling_factor = self.config.rope_scaling["factor"]
267
+ if scaling_type == "linear":
268
+ self.rotary_emb = StableLmLinearScalingRotaryEmbedding(
269
+ int(self.partial_rotary_factor * self.head_dim),
270
+ max_position_embeddings=self.max_position_embeddings,
271
+ scaling_factor=scaling_factor,
272
+ base=self.rope_theta,
273
+ )
274
+ elif scaling_type == "dynamic":
275
+ self.rotary_emb = StableLmDynamicNTKScalingRotaryEmbedding(
276
+ int(self.partial_rotary_factor * self.head_dim),
277
+ max_position_embeddings=self.max_position_embeddings,
278
+ scaling_factor=scaling_factor,
279
+ base=self.rope_theta,
280
+ )
281
+ else:
282
+ raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
283
+
284
+ def forward(
285
+ self,
286
+ hidden_states: torch.Tensor,
287
+ attention_mask: Optional[torch.Tensor] = None,
288
+ position_ids: Optional[torch.LongTensor] = None,
289
+ past_key_value: Optional[Cache] = None,
290
+ output_attentions: bool = False,
291
+ use_cache: bool = False,
292
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
293
+ bsz, q_len, _ = hidden_states.size()
294
+
295
+ query_states = self.q_proj(hidden_states)
296
+ key_states = self.k_proj(hidden_states)
297
+ value_states = self.v_proj(hidden_states)
298
+
299
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
300
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
301
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
302
+
303
+ kv_seq_len = key_states.shape[-2]
304
+ if past_key_value is not None:
305
+ if self.layer_idx is None:
306
+ raise ValueError(
307
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
308
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
309
+ "with a layer index."
310
+ )
311
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
312
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
313
+
314
+ # Partial rotary embedding
315
+ query_rot, query_pass = (
316
+ query_states[..., : self.rotary_emb.dim],
317
+ query_states[..., self.rotary_emb.dim :],
318
+ )
319
+ key_rot, key_pass = (
320
+ key_states[..., : self.rotary_emb.dim],
321
+ key_states[..., self.rotary_emb.dim :],
322
+ )
323
+ # [batch_size, seq_length, num_heads, head_dim // config.partial_rotary_factor]
324
+ query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin, position_ids)
325
+
326
+ # [batch_size, seq_length, num_heads, head_dim]
327
+ query_states = torch.cat((query_rot, query_pass), dim=-1)
328
+ key_states = torch.cat((key_rot, key_pass), dim=-1)
329
+
330
+ if past_key_value is not None:
331
+ # Specific to RoPE models with partial rotation
332
+ cache_kwargs = {"sin": sin, "cos": cos, "partial_rotation_size": self.rotary_emb.dim}
333
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
334
+
335
+ # Repeat k/v heads if n_kv_heads < n_heads
336
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
337
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
338
+
339
+ attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
340
+
341
+ if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
342
+ raise ValueError(
343
+ f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
344
+ f" {attn_weights.size()}"
345
+ )
346
+
347
+ if attention_mask is not None:
348
+ if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
349
+ raise ValueError(
350
+ f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
351
+ )
352
+ attn_weights = attn_weights + attention_mask
353
+
354
+ # upcast attention to fp32
355
+ attn_weights = nn.functional.softmax(attn_weights, dtype=torch.float32, dim=-1).to(query_states.dtype)
356
+ attn_weights = self.attention_dropout(attn_weights)
357
+
358
+ attn_output = torch.matmul(attn_weights, value_states)
359
+
360
+ if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
361
+ raise ValueError(
362
+ f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
363
+ f" {attn_output.size()}"
364
+ )
365
+
366
+ attn_output = attn_output.transpose(1, 2).contiguous()
367
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
368
+
369
+ attn_output = self.o_proj(attn_output)
370
+
371
+ if not output_attentions:
372
+ attn_weights = None
373
+
374
+ return attn_output, attn_weights, past_key_value
375
+
376
+
377
+ class StableLmSdpaAttention(StableLmAttention):
378
+ def forward(
379
+ self,
380
+ hidden_states: torch.Tensor,
381
+ attention_mask: Optional[torch.Tensor] = None,
382
+ position_ids: Optional[torch.LongTensor] = None,
383
+ past_key_value: Optional[Cache] = None,
384
+ output_attentions: bool = False,
385
+ use_cache: bool = False,
386
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
387
+ if output_attentions:
388
+ # TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
389
+ logger.warning_once(
390
+ "StableLmModel is using StableLmSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
391
+ 'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
392
+ )
393
+ return super().forward(
394
+ hidden_states=hidden_states,
395
+ attention_mask=attention_mask,
396
+ position_ids=position_ids,
397
+ past_key_value=past_key_value,
398
+ output_attentions=output_attentions,
399
+ use_cache=use_cache,
400
+ )
401
+
402
+ bsz, q_len, _ = hidden_states.size()
403
+
404
+ query_states = self.q_proj(hidden_states)
405
+ key_states = self.k_proj(hidden_states)
406
+ value_states = self.v_proj(hidden_states)
407
+
408
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
409
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
410
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
411
+
412
+ kv_seq_len = key_states.shape[-2]
413
+ if past_key_value is not None:
414
+ if self.layer_idx is None:
415
+ raise ValueError(
416
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
417
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
418
+ "with a layer index."
419
+ )
420
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
421
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
422
+
423
+ # Partial rotary embedding
424
+ query_rot, query_pass = (
425
+ query_states[..., : self.rotary_emb.dim],
426
+ query_states[..., self.rotary_emb.dim :],
427
+ )
428
+ key_rot, key_pass = (
429
+ key_states[..., : self.rotary_emb.dim],
430
+ key_states[..., self.rotary_emb.dim :],
431
+ )
432
+ # [batch_size, seq_length, num_heads, head_dim // config.partial_rotary_factor]
433
+ query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin, position_ids)
434
+
435
+ # [batch_size, seq_length, num_heads, head_dim]
436
+ query_states = torch.cat((query_rot, query_pass), dim=-1)
437
+ key_states = torch.cat((key_rot, key_pass), dim=-1)
438
+
439
+ if past_key_value is not None:
440
+ # Specific to RoPE models with partial rotation
441
+ cache_kwargs = {"sin": sin, "cos": cos, "partial_rotation_size": self.rotary_emb.dim}
442
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
443
+
444
+ # Repeat k/v heads if n_kv_heads < n_heads
445
+ key_states = repeat_kv(key_states, self.num_key_value_groups)
446
+ value_states = repeat_kv(value_states, self.num_key_value_groups)
447
+
448
+ # SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
449
+ # Reference: https://github.com/pytorch/pytorch/issues/112577.
450
+ if query_states.device.type == "cuda" and attention_mask is not None:
451
+ query_states = query_states.contiguous()
452
+ key_states = key_states.contiguous()
453
+ value_states = value_states.contiguous()
454
+
455
+ attn_output = torch.nn.functional.scaled_dot_product_attention(
456
+ query_states,
457
+ key_states,
458
+ value_states,
459
+ attn_mask=attention_mask,
460
+ dropout_p=self.attention_dropout.p if self.training else 0.0,
461
+ # The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
462
+ is_causal=self.is_causal and attention_mask is None and q_len > 1,
463
+ )
464
+
465
+ attn_output = attn_output.transpose(1, 2).contiguous()
466
+ attn_output = attn_output.view(bsz, q_len, self.hidden_size)
467
+
468
+ attn_output = self.o_proj(attn_output)
469
+
470
+ return attn_output, None, past_key_value
471
+
472
+
473
+ class StableLmFlashAttention2(StableLmAttention):
474
+ """
475
+ StableLM flash attention module. This module inherits from `StableLmAttention` as the weights of the module stays
476
+ untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
477
+ flash attention and deal with padding tokens in case the input contains any of them.
478
+ """
479
+
480
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
481
+ def __init__(self, *args, **kwargs):
482
+ super().__init__(*args, **kwargs)
483
+
484
+ # TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
485
+ # flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
486
+ # Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
487
+ self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
488
+
489
+ def forward(
490
+ self,
491
+ hidden_states: torch.Tensor,
492
+ attention_mask: Optional[torch.LongTensor] = None,
493
+ position_ids: Optional[torch.LongTensor] = None,
494
+ past_key_value: Optional[Cache] = None,
495
+ output_attentions: bool = False,
496
+ use_cache: bool = False,
497
+ **kwargs,
498
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
499
+ # StableLmFlashAttention2 attention does not support output_attentions
500
+
501
+ output_attentions = False
502
+
503
+ bsz, q_len, _ = hidden_states.size()
504
+
505
+ query_states = self.q_proj(hidden_states)
506
+ key_states = self.k_proj(hidden_states)
507
+ value_states = self.v_proj(hidden_states)
508
+
509
+ # Flash attention requires the input to have the shape
510
+ # batch_size x seq_length x head_dim x hidden_dim
511
+ # therefore we just need to keep the original shape
512
+ query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
513
+ key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
514
+ value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
515
+
516
+ kv_seq_len = key_states.shape[-2]
517
+ if past_key_value is not None:
518
+ if self.layer_idx is None:
519
+ raise ValueError(
520
+ f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
521
+ "for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
522
+ "with a layer index."
523
+ )
524
+ kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
525
+ cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
526
+
527
+ # Partial rotary embedding
528
+ query_rot, query_pass = (
529
+ query_states[..., : self.rotary_emb.dim],
530
+ query_states[..., self.rotary_emb.dim :],
531
+ )
532
+ key_rot, key_pass = (
533
+ key_states[..., : self.rotary_emb.dim],
534
+ key_states[..., self.rotary_emb.dim :],
535
+ )
536
+ query_rot, key_rot = apply_rotary_pos_emb(query_rot, key_rot, cos, sin, position_ids)
537
+
538
+ # [batch_size, seq_length, num_heads, head_dim]
539
+ query_states = torch.cat((query_rot, query_pass), dim=-1)
540
+ key_states = torch.cat((key_rot, key_pass), dim=-1)
541
+
542
+ if past_key_value is not None:
543
+ cache_kwargs = {"sin": sin, "cos": cos, "partial_rotation_size": self.rotary_emb.dim}
544
+ key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
545
+
546
+ # TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
547
+ # to be able to avoid many of these transpose/reshape/view.
548
+ query_states = query_states.transpose(1, 2)
549
+ key_states = key_states.transpose(1, 2)
550
+ value_states = value_states.transpose(1, 2)
551
+
552
+ dropout_rate = self.attention_dropout if self.training else 0.0
553
+
554
+ attn_output = self._flash_attention_forward(
555
+ query_states,
556
+ key_states,
557
+ value_states,
558
+ attention_mask,
559
+ q_len,
560
+ dropout=dropout_rate,
561
+ )
562
+
563
+ attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
564
+ attn_output = self.o_proj(attn_output)
565
+
566
+ if not output_attentions:
567
+ attn_weights = None
568
+
569
+ return attn_output, attn_weights, past_key_value
570
+
571
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._flash_attention_forward
572
+ def _flash_attention_forward(
573
+ self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
574
+ ):
575
+ """
576
+ Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
577
+ first unpad the input, then computes the attention scores and pad the final attention scores.
578
+
579
+ Args:
580
+ query_states (`torch.Tensor`):
581
+ Input query states to be passed to Flash Attention API
582
+ key_states (`torch.Tensor`):
583
+ Input key states to be passed to Flash Attention API
584
+ value_states (`torch.Tensor`):
585
+ Input value states to be passed to Flash Attention API
586
+ attention_mask (`torch.Tensor`):
587
+ The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
588
+ position of padding tokens and 1 for the position of non-padding tokens.
589
+ dropout (`int`, *optional*):
590
+ Attention dropout
591
+ softmax_scale (`float`, *optional*):
592
+ The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
593
+ """
594
+ if not self._flash_attn_uses_top_left_mask:
595
+ causal = self.is_causal
596
+ else:
597
+ # TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
598
+ causal = self.is_causal and query_length != 1
599
+
600
+ # Contains at least one padding token in the sequence
601
+ if attention_mask is not None:
602
+ batch_size = query_states.shape[0]
603
+ query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
604
+ query_states, key_states, value_states, attention_mask, query_length
605
+ )
606
+
607
+ cu_seqlens_q, cu_seqlens_k = cu_seq_lens
608
+ max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
609
+
610
+ attn_output_unpad = flash_attn_varlen_func(
611
+ query_states,
612
+ key_states,
613
+ value_states,
614
+ cu_seqlens_q=cu_seqlens_q,
615
+ cu_seqlens_k=cu_seqlens_k,
616
+ max_seqlen_q=max_seqlen_in_batch_q,
617
+ max_seqlen_k=max_seqlen_in_batch_k,
618
+ dropout_p=dropout,
619
+ softmax_scale=softmax_scale,
620
+ causal=causal,
621
+ )
622
+
623
+ attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
624
+ else:
625
+ attn_output = flash_attn_func(
626
+ query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
627
+ )
628
+
629
+ return attn_output
630
+
631
+ # Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2._upad_input
632
+ def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
633
+ indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
634
+ batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
635
+
636
+ key_layer = index_first_axis(
637
+ key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
638
+ )
639
+ value_layer = index_first_axis(
640
+ value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
641
+ )
642
+ if query_length == kv_seq_len:
643
+ query_layer = index_first_axis(
644
+ query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
645
+ )
646
+ cu_seqlens_q = cu_seqlens_k
647
+ max_seqlen_in_batch_q = max_seqlen_in_batch_k
648
+ indices_q = indices_k
649
+ elif query_length == 1:
650
+ max_seqlen_in_batch_q = 1
651
+ cu_seqlens_q = torch.arange(
652
+ batch_size + 1, dtype=torch.int32, device=query_layer.device
653
+ ) # There is a memcpy here, that is very bad.
654
+ indices_q = cu_seqlens_q[:-1]
655
+ query_layer = query_layer.squeeze(1)
656
+ else:
657
+ # The -q_len: slice assumes left padding.
658
+ attention_mask = attention_mask[:, -query_length:]
659
+ query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
660
+
661
+ return (
662
+ query_layer,
663
+ key_layer,
664
+ value_layer,
665
+ indices_q,
666
+ (cu_seqlens_q, cu_seqlens_k),
667
+ (max_seqlen_in_batch_q, max_seqlen_in_batch_k),
668
+ )
669
+
670
+
671
+ ATTENTION_CLASSES = {
672
+ "eager": StableLmAttention,
673
+ "sdpa": StableLmSdpaAttention,
674
+ "flash_attention_2": StableLmFlashAttention2,
675
+ }
676
+
677
+
678
+ class StableLmDecoderLayer(nn.Module):
679
+ def __init__(self, config: StableLmConfig, layer_idx: int):
680
+ super().__init__()
681
+ self.hidden_size = config.hidden_size
682
+ self.self_attn = ATTENTION_CLASSES[config._attn_implementation](config, layer_idx=layer_idx)
683
+ self.mlp = StableLmMLP(config)
684
+ self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
685
+ self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
686
+ self.dropout = nn.Dropout(config.hidden_dropout)
687
+
688
+ def forward(
689
+ self,
690
+ hidden_states: torch.Tensor,
691
+ attention_mask: Optional[torch.Tensor] = None,
692
+ position_ids: Optional[torch.LongTensor] = None,
693
+ past_key_value: Optional[Tuple[torch.Tensor]] = None,
694
+ output_attentions: Optional[bool] = False,
695
+ use_cache: Optional[bool] = False,
696
+ ) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
697
+ """
698
+ Args:
699
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
700
+ attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
701
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
702
+ position_ids (`torch.LongTensor` of shape `({0})`, *optional*):
703
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range
704
+ `[0, config.n_positions - 1]`.
705
+
706
+ [What are position IDs?](../glossary#position-ids)
707
+ past_key_value (`Tuple(torch.FloatTensor)`, *optional*):
708
+ cached past key and value projection states
709
+ output_attentions (`bool`, *optional*):
710
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
711
+ returned tensors for more detail.
712
+ use_cache (`bool`, *optional*):
713
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
714
+ (see `past_key_values`).
715
+ """
716
+
717
+ residual = hidden_states
718
+
719
+ hidden_states = self.input_layernorm(hidden_states)
720
+
721
+ # Self Attention
722
+ hidden_states, self_attn_weights, present_key_value = self.self_attn(
723
+ hidden_states=hidden_states,
724
+ attention_mask=attention_mask,
725
+ position_ids=position_ids,
726
+ past_key_value=past_key_value,
727
+ output_attentions=output_attentions,
728
+ use_cache=use_cache,
729
+ )
730
+ hidden_states = residual + hidden_states
731
+
732
+ # Fully Connected
733
+ residual = hidden_states
734
+ hidden_states = self.post_attention_layernorm(hidden_states)
735
+ hidden_states = self.mlp(hidden_states)
736
+
737
+ hidden_states = self.dropout(hidden_states)
738
+ hidden_states = hidden_states + residual
739
+
740
+ outputs = (hidden_states,)
741
+
742
+ if output_attentions:
743
+ outputs += (self_attn_weights,)
744
+
745
+ if use_cache:
746
+ outputs += (present_key_value,)
747
+
748
+ return outputs
749
+
750
+
751
+ STABLELM_START_DOCSTRING = r"""
752
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
753
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
754
+ etc.)
755
+
756
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
757
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
758
+ and behavior.
759
+
760
+ Parameters:
761
+ config ([`StableLmConfig`]):
762
+ Model configuration class with all the parameters of the model. Initializing with a config file does not
763
+ load the weights associated with the model, only the configuration. Check out the
764
+ [`~PreTrainedModel.from_pretrained`] method to load the model weights.
765
+ """
766
+
767
+
768
+ @add_start_docstrings(
769
+ "The bare StableLm Model outputting raw hidden-states without any specific head on top.",
770
+ STABLELM_START_DOCSTRING,
771
+ )
772
+ class StableLmPreTrainedModel(PreTrainedModel):
773
+ config_class = StableLmConfig
774
+ base_model_prefix = "model"
775
+ supports_gradient_checkpointing = True
776
+ _no_split_modules = ["StableLmDecoderLayer"]
777
+ _skip_keys_device_placement = "past_key_values"
778
+ _supports_flash_attn_2 = True
779
+ _supports_cache_class = True
780
+ _supports_sdpa = True
781
+
782
+ def _init_weights(self, module):
783
+ std = self.config.initializer_range
784
+ if isinstance(module, nn.Linear):
785
+ module.weight.data.normal_(mean=0.0, std=std)
786
+ if module.bias is not None:
787
+ module.bias.data.zero_()
788
+ elif isinstance(module, nn.Embedding):
789
+ module.weight.data.normal_(mean=0.0, std=std)
790
+ if module.padding_idx is not None:
791
+ module.weight.data[module.padding_idx].zero_()
792
+
793
+
794
+ STABLELM_INPUTS_DOCSTRING = r"""
795
+ Args:
796
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
797
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
798
+ it.
799
+
800
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
801
+ [`PreTrainedTokenizer.__call__`] for details.
802
+
803
+ [What are input IDs?](../glossary#input-ids)
804
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
805
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
806
+
807
+ - 1 for tokens that are **not masked**,
808
+ - 0 for tokens that are **masked**.
809
+
810
+ [What are attention masks?](../glossary#attention-mask)
811
+
812
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
813
+ [`PreTrainedTokenizer.__call__`] for details.
814
+
815
+ If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
816
+ `past_key_values`).
817
+
818
+ If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
819
+ and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
820
+ information on the default strategy.
821
+
822
+ - 1 indicates the head is **not masked**,
823
+ - 0 indicates the head is **masked**.
824
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
825
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
826
+ config.n_positions - 1]`.
827
+
828
+ [What are position IDs?](../glossary#position-ids)
829
+ past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
830
+ Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
831
+ blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
832
+ returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
833
+
834
+ Two formats are allowed:
835
+ - a [`~cache_utils.Cache`] instance;
836
+ - Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
837
+ shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
838
+ cache format.
839
+
840
+ The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
841
+ legacy cache format will be returned.
842
+
843
+ If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
844
+ have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
845
+ of shape `(batch_size, sequence_length)`.
846
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
847
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
848
+ is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
849
+ model's internal embedding lookup matrix.
850
+ use_cache (`bool`, *optional*):
851
+ If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
852
+ `past_key_values`).
853
+ output_attentions (`bool`, *optional*):
854
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
855
+ tensors for more detail.
856
+ output_hidden_states (`bool`, *optional*):
857
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
858
+ more detail.
859
+ return_dict (`bool`, *optional*):
860
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
861
+ """
862
+
863
+
864
+ @add_start_docstrings(
865
+ "The bare StableLm Model outputting raw hidden-states without any specific head on top.",
866
+ STABLELM_START_DOCSTRING,
867
+ )
868
+ class StableLmModel(StableLmPreTrainedModel):
869
+ """
870
+ Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`StableLmDecoderLayer`]
871
+
872
+ Args:
873
+ config: StableLmConfig
874
+ """
875
+
876
+ def __init__(self, config: StableLmConfig):
877
+ super().__init__(config)
878
+ self.padding_idx = config.pad_token_id
879
+ self.vocab_size = config.vocab_size
880
+
881
+ self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
882
+ self.layers = nn.ModuleList(
883
+ [StableLmDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
884
+ )
885
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
886
+
887
+ self._attn_implementation = config._attn_implementation
888
+ self.gradient_checkpointing = False
889
+ # Initialize weights and apply final processing
890
+ self.post_init()
891
+
892
+ def get_input_embeddings(self):
893
+ return self.embed_tokens
894
+
895
+ def set_input_embeddings(self, value):
896
+ self.embed_tokens = value
897
+
898
+ @add_start_docstrings_to_model_forward(STABLELM_INPUTS_DOCSTRING)
899
+ def forward(
900
+ self,
901
+ input_ids: torch.LongTensor = None,
902
+ attention_mask: Optional[torch.Tensor] = None,
903
+ position_ids: Optional[torch.LongTensor] = None,
904
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
905
+ inputs_embeds: Optional[torch.FloatTensor] = None,
906
+ use_cache: Optional[bool] = None,
907
+ output_attentions: Optional[bool] = None,
908
+ output_hidden_states: Optional[bool] = None,
909
+ return_dict: Optional[bool] = None,
910
+ ) -> Union[Tuple, BaseModelOutputWithPast]:
911
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
912
+ output_hidden_states = (
913
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
914
+ )
915
+ use_cache = use_cache if use_cache is not None else self.config.use_cache
916
+
917
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
918
+
919
+ # retrieve input_ids and inputs_embeds
920
+ if input_ids is not None and inputs_embeds is not None:
921
+ raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
922
+ elif input_ids is not None:
923
+ batch_size, seq_length = input_ids.shape
924
+ elif inputs_embeds is not None:
925
+ batch_size, seq_length, _ = inputs_embeds.shape
926
+ else:
927
+ raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
928
+
929
+ seq_length_with_past = seq_length
930
+ past_key_values_length = 0
931
+
932
+ if self.gradient_checkpointing and self.training:
933
+ if use_cache:
934
+ logger.warning_once(
935
+ "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
936
+ )
937
+ use_cache = False
938
+
939
+ if use_cache:
940
+ use_legacy_cache = not isinstance(past_key_values, Cache)
941
+ if use_legacy_cache:
942
+ past_key_values = DynamicCache.from_legacy_cache(past_key_values)
943
+ past_key_values_length = past_key_values.get_usable_length(seq_length)
944
+ seq_length_with_past = seq_length_with_past + past_key_values_length
945
+
946
+ if position_ids is None:
947
+ device = input_ids.device if input_ids is not None else inputs_embeds.device
948
+ position_ids = torch.arange(
949
+ past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
950
+ )
951
+ position_ids = position_ids.unsqueeze(0)
952
+
953
+ if inputs_embeds is None:
954
+ inputs_embeds = self.embed_tokens(input_ids)
955
+ # embed positions
956
+ if self._attn_implementation == "flash_attention_2":
957
+ # 2d mask is passed through the layers
958
+ attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
959
+ # for output_attentions case used fallback to eager attention realization
960
+ elif self._attn_implementation == "sdpa" and not output_attentions:
961
+ attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
962
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
963
+ )
964
+ else:
965
+ # 4d mask is passed through the layers
966
+ attention_mask = _prepare_4d_causal_attention_mask(
967
+ attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
968
+ )
969
+
970
+ hidden_states = inputs_embeds
971
+
972
+ # decoder layers
973
+ all_hidden_states = () if output_hidden_states else None
974
+ all_self_attns = () if output_attentions else None
975
+ next_decoder_cache = None
976
+
977
+ for decoder_layer in self.layers:
978
+ if output_hidden_states:
979
+ all_hidden_states += (hidden_states,)
980
+
981
+ if self.gradient_checkpointing and self.training:
982
+ layer_outputs = self._gradient_checkpointing_func(
983
+ decoder_layer.__call__,
984
+ hidden_states,
985
+ attention_mask,
986
+ position_ids,
987
+ past_key_values,
988
+ output_attentions,
989
+ )
990
+ else:
991
+ layer_outputs = decoder_layer(
992
+ hidden_states,
993
+ attention_mask=attention_mask,
994
+ position_ids=position_ids,
995
+ past_key_value=past_key_values,
996
+ output_attentions=output_attentions,
997
+ use_cache=use_cache,
998
+ )
999
+
1000
+ hidden_states = layer_outputs[0]
1001
+
1002
+ if use_cache:
1003
+ next_decoder_cache = layer_outputs[2 if output_attentions else 1]
1004
+
1005
+ if output_attentions:
1006
+ all_self_attns += (layer_outputs[1],)
1007
+
1008
+ hidden_states = self.norm(hidden_states)
1009
+
1010
+ # add hidden states from the last decoder layer
1011
+ if output_hidden_states:
1012
+ all_hidden_states += (hidden_states,)
1013
+
1014
+ next_cache = None
1015
+ if use_cache:
1016
+ next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
1017
+
1018
+ if not return_dict:
1019
+ return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
1020
+ return BaseModelOutputWithPast(
1021
+ last_hidden_state=hidden_states,
1022
+ past_key_values=next_cache,
1023
+ hidden_states=all_hidden_states,
1024
+ attentions=all_self_attns,
1025
+ )
1026
+
1027
+
1028
+ # Copied from transformers.models.persimmon.modeling_persimmon.PersimmonForCausalLM with PERSIMMON->STABLELM,Persimmon->StableLm
1029
+ class StableLmForCausalLM(StableLmPreTrainedModel):
1030
+ _tied_weights_keys = ["lm_head.weight"]
1031
+
1032
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.__init__ with LLAMA->STABLELM,Llama->StableLm
1033
+ def __init__(self, config):
1034
+ super().__init__(config)
1035
+ self.model = StableLmModel(config)
1036
+ self.vocab_size = config.vocab_size
1037
+ self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
1038
+
1039
+ # Initialize weights and apply final processing
1040
+ self.post_init()
1041
+
1042
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.get_input_embeddings
1043
+ def get_input_embeddings(self):
1044
+ return self.model.embed_tokens
1045
+
1046
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.set_input_embeddings
1047
+ def set_input_embeddings(self, value):
1048
+ self.model.embed_tokens = value
1049
+
1050
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.get_output_embeddings
1051
+ def get_output_embeddings(self):
1052
+ return self.lm_head
1053
+
1054
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.set_output_embeddings
1055
+ def set_output_embeddings(self, new_embeddings):
1056
+ self.lm_head = new_embeddings
1057
+
1058
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.set_decoder
1059
+ def set_decoder(self, decoder):
1060
+ self.model = decoder
1061
+
1062
+ # Copied from transformers.models.llama.modeling_llama.LlamaForCausalLM.get_decoder
1063
+ def get_decoder(self):
1064
+ return self.model
1065
+
1066
+ @add_start_docstrings_to_model_forward(STABLELM_INPUTS_DOCSTRING)
1067
+ @replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
1068
+ # Ignore copy
1069
+ def forward(
1070
+ self,
1071
+ input_ids: torch.LongTensor = None,
1072
+ attention_mask: Optional[torch.Tensor] = None,
1073
+ position_ids: Optional[torch.LongTensor] = None,
1074
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1075
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1076
+ labels: Optional[torch.LongTensor] = None,
1077
+ use_cache: Optional[bool] = None,
1078
+ output_attentions: Optional[bool] = None,
1079
+ output_hidden_states: Optional[bool] = None,
1080
+ return_dict: Optional[bool] = None,
1081
+ ) -> Union[Tuple, CausalLMOutputWithPast]:
1082
+ r"""
1083
+ Args:
1084
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
1085
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
1086
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
1087
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
1088
+
1089
+ Returns:
1090
+
1091
+ Example:
1092
+
1093
+ ```python
1094
+ >>> from transformers import AutoTokenizer, StableLmForCausalLM
1095
+
1096
+ >>> model = StableLmForCausalLM.from_pretrained("stabilityai/stablelm-3b-4e1t")
1097
+ >>> tokenizer = AutoTokenizer.from_pretrained("stabilityai/stablelm-3b-4e1t")
1098
+
1099
+ >>> prompt = "The weather is always wonderful in"
1100
+ >>> inputs = tokenizer(prompt, return_tensors="pt")
1101
+
1102
+ >>> # Generate
1103
+ >>> generate_ids = model.generate(inputs.input_ids, max_length=30)
1104
+ >>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
1105
+ 'The weather is always wonderful in the summer in the city of San Diego. The city is located on the coast of the Pacific Ocean and is surrounded by'
1106
+ ```"""
1107
+
1108
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1109
+ output_hidden_states = (
1110
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1111
+ )
1112
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1113
+
1114
+ outputs = self.model(
1115
+ input_ids=input_ids,
1116
+ attention_mask=attention_mask,
1117
+ position_ids=position_ids,
1118
+ past_key_values=past_key_values,
1119
+ inputs_embeds=inputs_embeds,
1120
+ use_cache=use_cache,
1121
+ output_attentions=output_attentions,
1122
+ output_hidden_states=output_hidden_states,
1123
+ return_dict=return_dict,
1124
+ )
1125
+
1126
+ hidden_states = outputs[0]
1127
+ logits = self.lm_head(hidden_states)
1128
+
1129
+ loss = None
1130
+ if labels is not None:
1131
+ # Shift so that tokens < n predict n
1132
+ shift_logits = logits[..., :-1, :].contiguous()
1133
+ shift_labels = labels[..., 1:].contiguous()
1134
+ # Flatten the tokens
1135
+ loss_fct = CrossEntropyLoss()
1136
+ shift_logits = shift_logits.view(-1, self.config.vocab_size)
1137
+ shift_labels = shift_labels.view(-1)
1138
+ # Enable model parallelism
1139
+ shift_labels = shift_labels.to(shift_logits.device)
1140
+ loss = loss_fct(shift_logits, shift_labels)
1141
+
1142
+ if not return_dict:
1143
+ output = (logits,) + outputs[1:]
1144
+ return (loss,) + output if loss is not None else output
1145
+
1146
+ return CausalLMOutputWithPast(
1147
+ loss=loss,
1148
+ logits=logits,
1149
+ past_key_values=outputs.past_key_values,
1150
+ hidden_states=outputs.hidden_states,
1151
+ attentions=outputs.attentions,
1152
+ )
1153
+
1154
+ def prepare_inputs_for_generation(
1155
+ self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
1156
+ ):
1157
+ if past_key_values is not None:
1158
+ if isinstance(past_key_values, Cache):
1159
+ cache_length = past_key_values.get_seq_length()
1160
+ past_length = past_key_values.seen_tokens
1161
+ max_cache_length = past_key_values.get_max_length()
1162
+ else:
1163
+ cache_length = past_length = past_key_values[0][0].shape[2]
1164
+ max_cache_length = None
1165
+
1166
+ # Keep only the unprocessed tokens:
1167
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
1168
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
1169
+ # input)
1170
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
1171
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
1172
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
1173
+ # input_ids based on the past_length.
1174
+ elif past_length < input_ids.shape[1]:
1175
+ input_ids = input_ids[:, past_length:]
1176
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
1177
+
1178
+ # If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
1179
+ if (
1180
+ max_cache_length is not None
1181
+ and attention_mask is not None
1182
+ and cache_length + input_ids.shape[1] > max_cache_length
1183
+ ):
1184
+ attention_mask = attention_mask[:, -max_cache_length:]
1185
+
1186
+ position_ids = kwargs.get("position_ids", None)
1187
+ if attention_mask is not None and position_ids is None:
1188
+ # create position_ids on the fly for batch generation
1189
+ position_ids = attention_mask.long().cumsum(-1) - 1
1190
+ position_ids.masked_fill_(attention_mask == 0, 1)
1191
+ if past_key_values:
1192
+ position_ids = position_ids[:, -input_ids.shape[1] :]
1193
+
1194
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
1195
+ if inputs_embeds is not None and past_key_values is None:
1196
+ model_inputs = {"inputs_embeds": inputs_embeds}
1197
+ else:
1198
+ model_inputs = {"input_ids": input_ids}
1199
+
1200
+ model_inputs.update(
1201
+ {
1202
+ "position_ids": position_ids,
1203
+ "past_key_values": past_key_values,
1204
+ "use_cache": kwargs.get("use_cache"),
1205
+ "attention_mask": attention_mask,
1206
+ }
1207
+ )
1208
+ return model_inputs
1209
+
1210
+ @staticmethod
1211
+ def _reorder_cache(past_key_values, beam_idx):
1212
+ reordered_past = ()
1213
+ for layer_past in past_key_values:
1214
+ reordered_past += (
1215
+ tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
1216
+ )
1217
+ return reordered_past
1218
+
1219
+
1220
+ @add_start_docstrings(
1221
+ """
1222
+ The StableLm transformer with a sequence classification head on top (linear layer).
1223
+
1224
+ [`StableLmForSequenceClassification`] uses the last token in order to do the classification, as other causal
1225
+ models (e.g. GPT-2) do.
1226
+
1227
+ Since it does classification on the last token, it requires to know the position of the last token. If a
1228
+ `pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
1229
+ no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
1230
+ padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
1231
+ each row of the batch).
1232
+ """,
1233
+ STABLELM_START_DOCSTRING,
1234
+ )
1235
+ # Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with LLAMA->STABLELM,Llama->StableLm
1236
+ class StableLmForSequenceClassification(StableLmPreTrainedModel):
1237
+ def __init__(self, config):
1238
+ super().__init__(config)
1239
+ self.num_labels = config.num_labels
1240
+ self.model = StableLmModel(config)
1241
+ self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
1242
+
1243
+ # Initialize weights and apply final processing
1244
+ self.post_init()
1245
+
1246
+ def get_input_embeddings(self):
1247
+ return self.model.embed_tokens
1248
+
1249
+ def set_input_embeddings(self, value):
1250
+ self.model.embed_tokens = value
1251
+
1252
+ @add_start_docstrings_to_model_forward(STABLELM_INPUTS_DOCSTRING)
1253
+ def forward(
1254
+ self,
1255
+ input_ids: torch.LongTensor = None,
1256
+ attention_mask: Optional[torch.Tensor] = None,
1257
+ position_ids: Optional[torch.LongTensor] = None,
1258
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
1259
+ inputs_embeds: Optional[torch.FloatTensor] = None,
1260
+ labels: Optional[torch.LongTensor] = None,
1261
+ use_cache: Optional[bool] = None,
1262
+ output_attentions: Optional[bool] = None,
1263
+ output_hidden_states: Optional[bool] = None,
1264
+ return_dict: Optional[bool] = None,
1265
+ ) -> Union[Tuple, SequenceClassifierOutputWithPast]:
1266
+ r"""
1267
+ labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
1268
+ Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
1269
+ config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
1270
+ `config.num_labels > 1` a classification loss is computed (Cross-Entropy).
1271
+ """
1272
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1273
+
1274
+ transformer_outputs = self.model(
1275
+ input_ids,
1276
+ attention_mask=attention_mask,
1277
+ position_ids=position_ids,
1278
+ past_key_values=past_key_values,
1279
+ inputs_embeds=inputs_embeds,
1280
+ use_cache=use_cache,
1281
+ output_attentions=output_attentions,
1282
+ output_hidden_states=output_hidden_states,
1283
+ return_dict=return_dict,
1284
+ )
1285
+ hidden_states = transformer_outputs[0]
1286
+ logits = self.score(hidden_states)
1287
+
1288
+ if input_ids is not None:
1289
+ batch_size = input_ids.shape[0]
1290
+ else:
1291
+ batch_size = inputs_embeds.shape[0]
1292
+
1293
+ if self.config.pad_token_id is None and batch_size != 1:
1294
+ raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
1295
+ if self.config.pad_token_id is None:
1296
+ sequence_lengths = -1
1297
+ else:
1298
+ if input_ids is not None:
1299
+ # if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
1300
+ sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
1301
+ sequence_lengths = sequence_lengths % input_ids.shape[-1]
1302
+ sequence_lengths = sequence_lengths.to(logits.device)
1303
+ else:
1304
+ sequence_lengths = -1
1305
+
1306
+ pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
1307
+
1308
+ loss = None
1309
+ if labels is not None:
1310
+ labels = labels.to(logits.device)
1311
+ if self.config.problem_type is None:
1312
+ if self.num_labels == 1:
1313
+ self.config.problem_type = "regression"
1314
+ elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
1315
+ self.config.problem_type = "single_label_classification"
1316
+ else:
1317
+ self.config.problem_type = "multi_label_classification"
1318
+
1319
+ if self.config.problem_type == "regression":
1320
+ loss_fct = MSELoss()
1321
+ if self.num_labels == 1:
1322
+ loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
1323
+ else:
1324
+ loss = loss_fct(pooled_logits, labels)
1325
+ elif self.config.problem_type == "single_label_classification":
1326
+ loss_fct = CrossEntropyLoss()
1327
+ loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
1328
+ elif self.config.problem_type == "multi_label_classification":
1329
+ loss_fct = BCEWithLogitsLoss()
1330
+ loss = loss_fct(pooled_logits, labels)
1331
+ if not return_dict:
1332
+ output = (pooled_logits,) + transformer_outputs[1:]
1333
+ return ((loss,) + output) if loss is not None else output
1334
+
1335
+ return SequenceClassifierOutputWithPast(
1336
+ loss=loss,
1337
+ logits=pooled_logits,
1338
+ past_key_values=transformer_outputs.past_key_values,
1339
+ hidden_states=transformer_outputs.hidden_states,
1340
+ attentions=transformer_outputs.attentions,
1341
+ )
special_tokens_map.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ "<|reg_extra|>",
4
+ "<|endoftext|>",
5
+ "<|fim_prefix|>",
6
+ "<|fim_middle|>",
7
+ "<|fim_suffix|>",
8
+ "<|fim_pad|>",
9
+ "<gh_stars>",
10
+ "<filename>",
11
+ "<issue_start>",
12
+ "<issue_comment>",
13
+ "<issue_closed>",
14
+ "<jupyter_start>",
15
+ "<jupyter_text>",
16
+ "<jupyter_code>",
17
+ "<jupyter_output>",
18
+ "<empty_output>",
19
+ "<commit_before>",
20
+ "<commit_msg>",
21
+ "<commit_after>",
22
+ "<reponame>",
23
+ "<|endofprompt|>",
24
+ "<|im_start|>",
25
+ "<|im_end|>",
26
+ "<|pause|>",
27
+ "<|reg0|>",
28
+ "<|reg1|>",
29
+ "<|reg2|>",
30
+ "<|reg3|>",
31
+ "<|reg4|>",
32
+ "<|reg5|>",
33
+ "<|reg6|>",
34
+ "<|reg7|>",
35
+ "<|extra0|>"
36
+ ],
37
+ "bos_token": "<|endoftext|>",
38
+ "eos_token": "<|endoftext|>",
39
+ "unk_token": "<|endoftext|>"
40
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "additional_special_tokens": [
4
+ "<|reg_extra|>",
5
+ "<|endoftext|>",
6
+ "<|fim_prefix|>",
7
+ "<|fim_middle|>",
8
+ "<|fim_suffix|>",
9
+ "<|fim_pad|>",
10
+ "<gh_stars>",
11
+ "<filename>",
12
+ "<issue_start>",
13
+ "<issue_comment>",
14
+ "<issue_closed>",
15
+ "<jupyter_start>",
16
+ "<jupyter_text>",
17
+ "<jupyter_code>",
18
+ "<jupyter_output>",
19
+ "<empty_output>",
20
+ "<commit_before>",
21
+ "<commit_msg>",
22
+ "<commit_after>",
23
+ "<reponame>",
24
+ "<|endofprompt|>",
25
+ "<|im_start|>",
26
+ "<|im_end|>",
27
+ "<|pause|>",
28
+ "<|reg0|>",
29
+ "<|reg1|>",
30
+ "<|reg2|>",
31
+ "<|reg3|>",
32
+ "<|reg4|>",
33
+ "<|reg5|>",
34
+ "<|reg6|>",
35
+ "<|reg7|>",
36
+ "<|extra0|>"
37
+ ],
38
+ "bos_token": "<|endoftext|>",
39
+ "chat_template": "{% for message in messages %}\n{% if message['role'] == 'user' %}\n{{ '<|user|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'system' %}\n{{ '<|system|>\n' + message['content'] + eos_token }}\n{% elif message['role'] == 'assistant' %}\n{{ '<|assistant|>\n' + message['content'] + eos_token }}\n{% endif %}\n{% if loop.last and add_generation_prompt %}\n{{ '<|assistant|>' }}\n{% endif %}\n{% endfor %}",
40
+ "clean_up_tokenization_spaces": true,
41
+ "eos_token": "<|endoftext|>",
42
+ "tokenizer_class": "GPT2Tokenizer",
43
+ "model_max_length": 2048,
44
+ "pad_token": "<|endoftext|>",
45
+ "unk_token": "<|endoftext|>"
46
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff