TKDKid1000 commited on
Commit
c2fc08c
1 Parent(s): e441d80

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ phi-1_5-Q2_K.gguf filter=lfs diff=lfs merge=lfs -text
37
+ phi-1_5-Q3_K_M.gguf filter=lfs diff=lfs merge=lfs -text
38
+ phi-1_5-Q4_K_M.gguf filter=lfs diff=lfs merge=lfs -text
39
+ phi-1_5-Q5_K_M.gguf filter=lfs diff=lfs merge=lfs -text
40
+ phi-1_5-Q6_K.gguf filter=lfs diff=lfs merge=lfs -text
41
+ phi-1_5-Q8_0.gguf filter=lfs diff=lfs merge=lfs -text
42
+ phi-1_5-f16.gguf filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,167 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ license: other
4
+ license_name: microsoft-research-license
5
+ license_link: https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx
6
+ language:
7
+ - en
8
+ pipeline_tag: text-generation
9
+ tags:
10
+ - nlp
11
+ - code
12
+ ---
13
+ ## Model Summary
14
+
15
+ The language model Phi-1.5 is a Transformer with **1.3 billion** parameters. It was trained using the same data sources as [phi-1](https://huggingface.co/microsoft/phi-1), augmented with a new data source that consists of various NLP synthetic texts. When assessed against benchmarks testing common sense, language understanding, and logical reasoning, Phi-1.5 demonstrates a nearly state-of-the-art performance among models with less than 10 billion parameters.
16
+
17
+ We **did not** fine-tune Phi-1.5 either for **instruction following or through reinforcement learning from human feedback**. The intention behind crafting this open-source model is to provide the research community with a non-restricted small model to explore vital safety challenges, such as reducing toxicity, understanding societal biases, enhancing controllability, and more.
18
+
19
+ For a safer model release, we exclude generic web-crawl data sources such as common-crawl from the training. This strategy prevents direct exposure to potentially harmful online content, enhancing the model's safety without RLHF. However, the model is still vulnerable to generating harmful content. We hope the model can help the research community to further study the safety of language models.
20
+
21
+ Phi-1.5 can write poems, draft emails, create stories, summarize texts, write Python code (such as downloading a Hugging Face transformer model), etc.
22
+
23
+ ## Intended Uses
24
+ Given the nature of the training data, Phi-1.5 is best suited for prompts using the QA format, the chat format, and the code format. Note that Phi-1.5, being a base model, often produces irrelevant text following the main answer. In the following example, we've truncated the answer for illustrative purposes only.
25
+
26
+ ### QA Format:
27
+
28
+ ```markdown
29
+ Write a detailed analogy between mathematics and a lighthouse.
30
+
31
+ Answer: Mathematics is like a lighthouse, guiding us through the vast ocean of numbers and calculations. Just as a lighthouse illuminates the darkness, mathematics provides us with a clear path to navigate through complex problems. It helps us make sense of the world around us, just like a lighthouse helps ships find their way home.
32
+ ```
33
+ where the model generates the text after "Answer:".
34
+
35
+ ### Chat Format:
36
+
37
+ ```markdown
38
+ Alice: I don't know why, I'm struggling to maintain focus while studying. Any suggestions?
39
+
40
+ Bob: Have you tried using a timer? It can help you stay on track and avoid distractions.
41
+
42
+ Alice: That's a good idea. I'll give it a try.
43
+
44
+ Charlie: Another thing that can help is to break up your study sessions into smaller chunks. It's easier to concentrate on one thing at a time.
45
+
46
+ Alice: That makes sense. I'll try that too.
47
+
48
+ Bob: And don't forget to take breaks! It's important to give your brain a rest so you can come back to your studies with a fresh perspective.
49
+
50
+ Alice: Thanks for the advice, guys. I feel more motivated now.
51
+
52
+ Charlie: No problem, Alice. We're all in this together.
53
+
54
+ Bob: Yeah, and remember that it's okay to ask for help if you need it. We're here to support each other.
55
+ ```
56
+ where the model generates the text after the first "Bob:".
57
+
58
+ ### Code Format:
59
+
60
+ ```python
61
+ def print_prime(n):
62
+ """
63
+ Print all primes between 1 and n
64
+ """
65
+ primes = []
66
+ for num in range(2, n+1):
67
+ is_prime = True
68
+ for i in range(2, int(math.sqrt(num))+1):
69
+ if num % i == 0:
70
+ is_prime = False
71
+ break
72
+ if is_prime:
73
+ primes.append(num)
74
+ print(primes)
75
+ ```
76
+ where the model generates the text after the comments.
77
+
78
+ **Notes:**
79
+ * Phi-1.5 is intended for research purposes. The model-generated text/code should be treated as a starting point rather than a definitive solution for potential use cases. Users should be cautious when employing these models in their applications.
80
+ * Direct adoption for production tasks is out of the scope of this research project. As a result, Phi-1.5 has not been tested to ensure that it performs adequately for any production-level application. Please refer to the limitation sections of this document for more details.
81
+ * If you are using `transformers>=4.36.0`, always load the model with `trust_remote_code=True` to prevent side-effects.
82
+
83
+ ## Sample Code
84
+
85
+ There are four types of execution mode:
86
+
87
+ 1. FP16 / Flash-Attention / CUDA:
88
+ ```python
89
+ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", flash_attn=True, flash_rotary=True, fused_dense=True, device_map="cuda", trust_remote_code=True)
90
+ ```
91
+ 2. FP16 / CUDA:
92
+ ```python
93
+ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", device_map="cuda", trust_remote_code=True)
94
+ ```
95
+ 3. FP32 / CUDA:
96
+ ```python
97
+ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype=torch.float32, device_map="cuda", trust_remote_code=True)
98
+ ```
99
+ 4. FP32 / CPU:
100
+ ```python
101
+ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype=torch.float32, device_map="cpu", trust_remote_code=True)
102
+ ```
103
+
104
+ To ensure the maximum compatibility, we recommend using the second execution mode (FP16 / CUDA), as follows:
105
+
106
+ ```python
107
+ import torch
108
+ from transformers import AutoModelForCausalLM, AutoTokenizer
109
+
110
+ torch.set_default_device("cuda")
111
+
112
+ model = AutoModelForCausalLM.from_pretrained("microsoft/phi-1_5", torch_dtype="auto", trust_remote_code=True)
113
+ tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-1_5", trust_remote_code=True)
114
+
115
+ inputs = tokenizer('''def print_prime(n):
116
+ """
117
+ Print all primes between 1 and n
118
+ """''', return_tensors="pt", return_attention_mask=False)
119
+
120
+ outputs = model.generate(**inputs, max_length=200)
121
+ text = tokenizer.batch_decode(outputs)[0]
122
+ print(text)
123
+ ```
124
+
125
+ **Remark:** In the generation function, our model currently does not support beam search (`num_beams > 1`).
126
+ Furthermore, in the forward pass of the model, we currently do not support outputting hidden states or attention values, or using custom input embeddings.
127
+
128
+
129
+ ## Limitations of Phi-1.5
130
+
131
+ * Generate Inaccurate Code and Facts: The model often produces incorrect code snippets and statements. Users should treat these outputs as suggestions or starting points, not as definitive or accurate solutions.
132
+ * Limited Scope for code: If the model generates Python scripts that utilize uncommon packages or scripts in other languages, we strongly recommend users manually verify all API uses.
133
+ * Unreliable Responses to Instruction: The model has not undergone instruction fine-tuning. As a result, it may struggle or fail to adhere to intricate or nuanced instructions provided by users.
134
+ * Language Limitations: The model is primarily designed to understand standard English. Informal English, slang, or any other language outside of English might pose challenges to its comprehension, leading to potential misinterpretations or errors in response.
135
+ * Potential Societal Biases: Regardless of the safe data used for its training, the model is not entirely free from societal biases. There's a possibility it may generate content that mirrors these societal biases, particularly if prompted or instructed to do so. We urge users to be aware of this and to exercise caution and critical thinking when interpreting model outputs.
136
+ * Toxicity: Despite that the model is trained with carefully selected data, the model can still produce harmful content if explicitly prompted or instructed to do so. We chose to release the model for research purposes only -- We hope to help the open-source community develop the most effective ways to reduce the toxicity of a model directly after pretraining.
137
+
138
+ ## Training
139
+
140
+ ### Model
141
+ * Architecture: a Transformer-based model with next-word prediction objective
142
+ * Dataset size: 30B tokens
143
+ * Training tokens: 150B tokens
144
+ * Precision: fp16
145
+ * GPUs: 32xA100-40G
146
+ * Training time: 8 days
147
+
148
+ ### Software
149
+ * [PyTorch](https://github.com/pytorch/pytorch)
150
+ * [DeepSpeed](https://github.com/microsoft/DeepSpeed)
151
+ * [Flash-Attention](https://github.com/HazyResearch/flash-attention)
152
+
153
+ ### License
154
+ The model is licensed under the [Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx).
155
+
156
+ ### Citation
157
+
158
+ You can find the paper at https://arxiv.org/abs/2309.05463
159
+
160
+ ```bib
161
+ @article{textbooks2,
162
+ title={Textbooks Are All You Need II: \textbf{phi-1.5} technical report},
163
+ author={Li, Yuanzhi and Bubeck, S{\'e}bastien and Eldan, Ronen and Del Giorno, Allie and Gunasekar, Suriya and Lee, Yin Tat},
164
+ journal={arXiv preprint arXiv:2309.05463},
165
+ year={2023}
166
+ }
167
+ ```
Research License.docx ADDED
Binary file (38.9 kB). View file
 
added_tokens.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\t\t": 50294,
3
+ "\t\t\t": 50293,
4
+ "\t\t\t\t": 50292,
5
+ "\t\t\t\t\t": 50291,
6
+ "\t\t\t\t\t\t": 50290,
7
+ "\t\t\t\t\t\t\t": 50289,
8
+ "\t\t\t\t\t\t\t\t": 50288,
9
+ "\t\t\t\t\t\t\t\t\t": 50287,
10
+ " ": 50286,
11
+ " ": 50285,
12
+ " ": 50284,
13
+ " ": 50283,
14
+ " ": 50282,
15
+ " ": 50281,
16
+ " ": 50280,
17
+ " ": 50279,
18
+ " ": 50278,
19
+ " ": 50277,
20
+ " ": 50276,
21
+ " ": 50275,
22
+ " ": 50274,
23
+ " ": 50273,
24
+ " ": 50272,
25
+ " ": 50271,
26
+ " ": 50270,
27
+ " ": 50269,
28
+ " ": 50268,
29
+ " ": 50267,
30
+ " ": 50266,
31
+ " ": 50265,
32
+ " ": 50264,
33
+ " ": 50263,
34
+ " ": 50262,
35
+ " ": 50261,
36
+ " ": 50260,
37
+ " ": 50259,
38
+ " ": 50258,
39
+ " ": 50257
40
+ }
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/phi-1_5",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "PhiForCausalLM"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_phi.PhiConfig",
10
+ "AutoModelForCausalLM": "modeling_phi.PhiForCausalLM"
11
+ },
12
+ "embd_pdrop": 0.0,
13
+ "flash_attn": false,
14
+ "flash_rotary": false,
15
+ "fused_dense": false,
16
+ "initializer_range": 0.02,
17
+ "layer_norm_epsilon": 1e-05,
18
+ "model_type": "phi-msft",
19
+ "n_embd": 2048,
20
+ "n_head": 32,
21
+ "n_head_kv": null,
22
+ "n_inner": null,
23
+ "n_layer": 24,
24
+ "n_positions": 2048,
25
+ "resid_pdrop": 0.0,
26
+ "rotary_dim": 32,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float16",
29
+ "transformers_version": "4.34.1",
30
+ "vocab_size": 51200
31
+ }
configuration_phi.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Microsoft Corporation.
2
+ # Licensed under the MIT license.
3
+
4
+ import math
5
+ from typing import Optional
6
+
7
+ from transformers import PretrainedConfig
8
+
9
+
10
+ class PhiConfig(PretrainedConfig):
11
+ """Phi configuration."""
12
+
13
+ model_type = "phi-msft"
14
+ attribute_map = {
15
+ "max_position_embeddings": "n_positions",
16
+ "hidden_size": "n_embd",
17
+ "num_attention_heads": "n_head",
18
+ "num_hidden_layers": "n_layer",
19
+ }
20
+
21
+ def __init__(
22
+ self,
23
+ vocab_size: int = 50304,
24
+ n_positions: int = 2048,
25
+ n_embd: int = 1024,
26
+ n_layer: int = 20,
27
+ n_inner: Optional[int] = None,
28
+ n_head: int = 16,
29
+ n_head_kv: Optional[int] = None,
30
+ rotary_dim: Optional[int] = 32,
31
+ activation_function: Optional[str] = "gelu_new",
32
+ flash_attn: bool = False,
33
+ flash_rotary: bool = False,
34
+ fused_dense: bool = False,
35
+ attn_pdrop: float = 0.0,
36
+ embd_pdrop: float = 0.0,
37
+ resid_pdrop: float = 0.0,
38
+ layer_norm_epsilon: float = 1e-5,
39
+ initializer_range: float = 0.02,
40
+ tie_word_embeddings: bool = False,
41
+ pad_vocab_size_multiple: int = 64,
42
+ **kwargs
43
+ ) -> None:
44
+ self.vocab_size = int(math.ceil(vocab_size / pad_vocab_size_multiple) * pad_vocab_size_multiple)
45
+ self.n_positions = n_positions
46
+ self.n_embd = n_embd
47
+ self.n_layer = n_layer
48
+ self.n_inner = n_inner
49
+ self.n_head = n_head
50
+ self.n_head_kv = n_head_kv
51
+ self.rotary_dim = min(rotary_dim, n_embd // n_head)
52
+ self.activation_function = activation_function
53
+ self.flash_attn = flash_attn
54
+ self.flash_rotary = flash_rotary
55
+ self.fused_dense = fused_dense
56
+ self.attn_pdrop = attn_pdrop
57
+ self.embd_pdrop = embd_pdrop
58
+ self.resid_pdrop = resid_pdrop
59
+ self.layer_norm_epsilon = layer_norm_epsilon
60
+ self.initializer_range = initializer_range
61
+
62
+ super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.32.1"
4
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
modeling_phi.py ADDED
@@ -0,0 +1,961 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Microsoft Corporation.
2
+ # Licensed under the MIT license.
3
+ #
4
+ # Copyright (c) 2022, Tri Dao, [email protected].
5
+ # Licensed under the BSD 3-Clause License.
6
+
7
+ from __future__ import annotations
8
+
9
+ import math
10
+ from dataclasses import dataclass, field
11
+ from typing import Any, Dict, Optional, Tuple, Union
12
+
13
+ import torch
14
+ import torch.nn as nn
15
+ from einops import rearrange, repeat
16
+ from transformers import PretrainedConfig, PreTrainedModel
17
+ from transformers.activations import ACT2FN
18
+ from transformers.modeling_outputs import CausalLMOutputWithPast
19
+
20
+ from .configuration_phi import PhiConfig
21
+
22
+ try:
23
+ from flash_attn.bert_padding import pad_input, unpad_input
24
+ from flash_attn.layers.rotary import RotaryEmbedding as FlashRotaryEmbedding
25
+ from flash_attn.modules.mha import FlashCrossAttention, FlashSelfAttention
26
+ from flash_attn.ops.fused_dense import FusedDense
27
+ except:
28
+ pad_input, unpad_input = None, None
29
+ FlashRotaryEmbedding = None
30
+ FlashSelfAttention, FlashCrossAttention = None, None
31
+ FusedDense = None
32
+
33
+
34
+ @dataclass
35
+ class InferenceParams:
36
+ """Inference parameters passed to model to efficiently calculate
37
+ and store context during inference.
38
+
39
+ Reference:
40
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/utils/generation.py.
41
+
42
+ Args:
43
+ max_seqlen: Maximum sequence length.
44
+ max_batch_size: Maximum batch size.
45
+ seqlen_offset: Sequence length offset.
46
+ batch_size_offset: Batch size offset.
47
+ key_value_memory_dict: Key value memory dictionary.
48
+ lengths_per_sample: Lengths per sample.
49
+
50
+ """
51
+
52
+ max_seqlen: int = field(metadata={"help": "Maximum sequence length."})
53
+
54
+ max_batch_size: int = field(metadata={"help": "Maximum batch size."})
55
+
56
+ seqlen_offset: int = field(default=0, metadata={"help": "Sequence length offset."})
57
+
58
+ batch_size_offset: int = field(default=0, metadata={"help": "Batch size offset."})
59
+
60
+ key_value_memory_dict: Dict[str, Any] = field(
61
+ default_factory=dict, metadata={"help": "Key value memory dictionary."}
62
+ )
63
+
64
+ lengths_per_sample: torch.Tensor = field(default=None, metadata={"help": "Lengths per sample."})
65
+
66
+
67
+ class Embedding(nn.Module):
68
+ """Token embedding with dropout."""
69
+
70
+ def __init__(self, config: PretrainedConfig) -> None:
71
+ super().__init__()
72
+
73
+ self.wte = nn.Embedding(config.vocab_size, config.n_embd)
74
+ self.drop = nn.Dropout(config.embd_pdrop)
75
+
76
+ def forward(self, input_ids: torch.LongTensor) -> torch.FloatTensor:
77
+ input_shape = input_ids.size()
78
+ input_ids = input_ids.view(-1, input_shape[-1])
79
+
80
+ hidden_states = self.wte(input_ids)
81
+ hidden_states = self.drop(hidden_states)
82
+
83
+ return hidden_states
84
+
85
+
86
+ def _apply_rotary_emb(
87
+ x: torch.FloatTensor,
88
+ cos: torch.FloatTensor,
89
+ sin: torch.FloatTensor,
90
+ ) -> torch.FloatTensor:
91
+ _, seqlen, _, _ = x.shape
92
+ _, rotary_dim = cos.shape
93
+ rotary_dim *= 2
94
+
95
+ x_rot = x[:, :, :, :rotary_dim]
96
+ x_pass = x[:, :, :, rotary_dim:]
97
+
98
+ x1, x2 = x_rot.chunk(2, dim=-1)
99
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
100
+ x1, x2, c, s = [t.to(dtype=torch.float32) for t in [x1, x2, c, s]]
101
+
102
+ x_rot = torch.cat([x1 * c - x2 * s, x1 * s + x2 * c], axis=-1).to(x.dtype)
103
+
104
+ return torch.cat([x_rot, x_pass], axis=-1)
105
+
106
+
107
+ def _apply_rotary_emb_kv(
108
+ kv: torch.FloatTensor,
109
+ cos: torch.FloatTensor,
110
+ sin: torch.FloatTensor,
111
+ cos_k: Optional[torch.FloatTensor] = None,
112
+ sin_k: Optional[torch.FloatTensor] = None,
113
+ ) -> torch.FloatTensor:
114
+ _, seqlen, _, _, _ = kv.shape
115
+ _, rotary_dim = cos.shape
116
+ rotary_dim *= 2
117
+
118
+ k_rot = kv[:, :, 0, :, :rotary_dim]
119
+ k_pass = kv[:, :, 0, :, rotary_dim:]
120
+
121
+ k1, k2 = k_rot.chunk(2, dim=-1)
122
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
123
+ k1, k2, c, s = [t.to(dtype=torch.float32) for t in [k1, k2, c, s]]
124
+
125
+ k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(kv.dtype)
126
+
127
+ return torch.cat(
128
+ [
129
+ torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
130
+ kv[:, :, 1:2, :, :],
131
+ ],
132
+ axis=2,
133
+ )
134
+
135
+
136
+ def _apply_rotary_emb_qkv(
137
+ qkv: torch.FloatTensor,
138
+ cos: torch.FloatTensor,
139
+ sin: torch.FloatTensor,
140
+ cos_k: Optional[torch.FloatTensor] = None,
141
+ sin_k: Optional[torch.FloatTensor] = None,
142
+ ) -> torch.FloatTensor:
143
+ _, seqlen, _, _, _ = qkv.shape
144
+ _, rotary_dim = cos.shape
145
+ rotary_dim *= 2
146
+
147
+ q_rot = qkv[:, :, 0, :, :rotary_dim]
148
+ q_pass = qkv[:, :, 0, :, rotary_dim:]
149
+
150
+ k_rot = qkv[:, :, 1, :, :rotary_dim]
151
+ k_pass = qkv[:, :, 1, :, rotary_dim:]
152
+
153
+ q1, q2 = q_rot.chunk(2, dim=-1)
154
+ k1, k2 = k_rot.chunk(2, dim=-1)
155
+ c, s = rearrange(cos[:seqlen], "s d -> s 1 d"), rearrange(sin[:seqlen], "s d -> s 1 d")
156
+ q1, q2, k1, k2, c, s = [t.to(dtype=torch.float32) for t in [q1, q2, k1, k2, c, s]]
157
+
158
+ q_rot = torch.cat([q1 * c - q2 * s, q1 * s + q2 * c], axis=-1).to(qkv.dtype)
159
+ k_rot = torch.cat([k1 * c - k2 * s, k1 * s + k2 * c], axis=-1).to(qkv.dtype)
160
+
161
+ return torch.cat(
162
+ [
163
+ torch.cat([q_rot, q_pass], axis=-1).unsqueeze(2),
164
+ torch.cat([k_rot, k_pass], axis=-1).unsqueeze(2),
165
+ qkv[:, :, 2:3, :, :],
166
+ ],
167
+ axis=2,
168
+ )
169
+
170
+
171
+ class RotaryEmbedding(nn.Module):
172
+ """Rotary positional embedding (RoPE).
173
+
174
+ Reference:
175
+ RoFormer: Enhanced Transformer with Rotary Position Embedding.
176
+ https://arxiv.org/pdf/2104.09864.pdf.
177
+
178
+ """
179
+
180
+ def __init__(
181
+ self,
182
+ dim: int,
183
+ base: int = 10000,
184
+ scale_base: Optional[float] = None,
185
+ pos_idx_in_fp32: bool = True,
186
+ max_position_embeddings: int = 2048,
187
+ device: Optional[str] = None,
188
+ **kwargs,
189
+ ) -> None:
190
+ super().__init__()
191
+
192
+ if scale_base is not None:
193
+ raise NotImplementedError
194
+
195
+ self.dim = dim
196
+ self.base = float(base)
197
+ self.scale_base = scale_base
198
+ self.pos_idx_in_fp32 = pos_idx_in_fp32
199
+ self.max_position_embeddings = max_position_embeddings
200
+ self.device = device
201
+
202
+ # Generate and save the inverse frequency buffer (non-trainable)
203
+ inv_freq = self._compute_inv_freq(device)
204
+ self.register_buffer("inv_freq", inv_freq, persistent=False)
205
+
206
+ # Generate and save the scale buffer (non-trainable)
207
+ scale = (
208
+ (torch.arange(0, dim, 2, device=device, dtype=torch.float32) + 0.4 * dim) / (1.4 * dim)
209
+ if scale_base is not None
210
+ else None
211
+ )
212
+ self.register_buffer("scale", scale, persistent=False)
213
+
214
+ # Initialize cached attributes since ONNX can't rely on dynamic initialization
215
+ self._update_cos_sin_cache(max_position_embeddings, device=device, dtype=torch.float32)
216
+
217
+ def _compute_inv_freq(self, device: Optional[str] = None) -> torch.FloatTensor:
218
+ return 1.0 / (self.base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
219
+
220
+ def _update_cos_sin_cache(
221
+ self,
222
+ seqlen: int,
223
+ device: Optional[str] = None,
224
+ dtype: Optional[torch.dtype] = None,
225
+ ) -> None:
226
+ self._seq_len_cached = seqlen
227
+
228
+ # fp32 is preferred since the output of `torch.arange` can be quite large
229
+ # and bf16 would lose a lot of precision
230
+ if self.pos_idx_in_fp32:
231
+ t = torch.arange(seqlen, device=device, dtype=torch.float32)
232
+ if self.inv_freq.dtype != torch.float32:
233
+ inv_freq = self._compute_inv_freq(device=device)
234
+ else:
235
+ inv_freq = self.inv_freq
236
+ else:
237
+ t = torch.arange(seqlen, device=device, dtype=self.inv_freq.dtype)
238
+ inv_freq = self.inv_freq
239
+
240
+ # `torch.outer` is preferred since `torch.einsum` converts from fp32 to fp16 if used with AMP
241
+ freqs = torch.outer(t, inv_freq)
242
+ if self.scale is None:
243
+ self._cos_cached = torch.cos(freqs).to(dtype)
244
+ self._sin_cached = torch.sin(freqs).to(dtype)
245
+ else:
246
+ power = (
247
+ torch.arange(seqlen, dtype=self.scale.dtype, device=self.scale.device) - seqlen // 2
248
+ ) / self.scale_base
249
+ scale = self.scale.to(device=power.device) ** rearrange(power, "s -> s 1")
250
+
251
+ # Force the scale multiplication to happen in fp32
252
+ self._cos_cached = (torch.cos(freqs) * scale).to(dtype)
253
+ self._sin_cached = (torch.sin(freqs) * scale).to(dtype)
254
+ self._cos_k_cached = (torch.cos(freqs) / scale).to(dtype)
255
+ self._sin_k_cached = (torch.sin(freqs) / scale).to(dtype)
256
+
257
+ def forward(
258
+ self,
259
+ qkv: torch.Tensor,
260
+ kv: Optional[torch.Tensor] = None,
261
+ seqlen_offset: int = 0,
262
+ **kwargs,
263
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
264
+ if (
265
+ self._seq_len_cached < qkv.shape[1] + seqlen_offset
266
+ or self._cos_cached.device != qkv.device
267
+ or self._cos_cached.dtype != qkv.dtype
268
+ or (self.training and self._cos_cached.is_inference())
269
+ ):
270
+ self._update_cos_sin_cache(qkv.shape[1] + seqlen_offset, device=qkv.device, dtype=qkv.dtype)
271
+
272
+ if kv is None:
273
+ return _apply_rotary_emb_qkv(
274
+ qkv,
275
+ self._cos_cached[seqlen_offset:],
276
+ self._sin_cached[seqlen_offset:],
277
+ )
278
+ else:
279
+ q = _apply_rotary_emb(
280
+ qkv,
281
+ self._cos_cached[seqlen_offset:],
282
+ self._sin_cached[seqlen_offset:],
283
+ )
284
+ kv = _apply_rotary_emb_kv(
285
+ kv,
286
+ self._cos_cached[seqlen_offset:],
287
+ self._sin_cached[seqlen_offset:],
288
+ )
289
+
290
+ return q, kv
291
+
292
+
293
+ class MLP(nn.Module):
294
+ """Multi-Layer Perceptron.
295
+
296
+ Reference:
297
+ Attention Is All You Need.
298
+ https://arxiv.org/pdf/1706.03762.pdf.
299
+
300
+ """
301
+
302
+ def __init__(
303
+ self,
304
+ config: PretrainedConfig,
305
+ n_inner: Optional[int] = None,
306
+ act_fn: Optional[str] = None,
307
+ ) -> None:
308
+ super().__init__()
309
+
310
+ act_fn = config.activation_function if act_fn is None else act_fn
311
+
312
+ n_inner = getattr(config, "n_inner", None) if n_inner is None else n_inner
313
+ n_inner = n_inner if n_inner is not None else 4 * config.n_embd
314
+
315
+ self.fc1 = nn.Linear(config.n_embd, n_inner)
316
+ self.fc2 = nn.Linear(n_inner, config.n_embd)
317
+ self.act = ACT2FN[act_fn]
318
+
319
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
320
+ hidden_states = self.fc1(hidden_states)
321
+ hidden_states = self.act(hidden_states)
322
+ hidden_states = self.fc2(hidden_states)
323
+
324
+ return hidden_states
325
+
326
+
327
+ class SelfAttention(nn.Module):
328
+ """Self-attention layer (compatible with PyTorch).
329
+
330
+ Reference:
331
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py.
332
+
333
+ """
334
+
335
+ def __init__(
336
+ self,
337
+ causal: bool = True,
338
+ softmax_scale: Optional[float] = None,
339
+ attention_dropout: float = 0.0,
340
+ ) -> None:
341
+ super().__init__()
342
+
343
+ self.causal = causal
344
+ self.softmax_scale = softmax_scale
345
+ self.drop = nn.Dropout(attention_dropout)
346
+
347
+ @torch.autocast("cpu", enabled=False)
348
+ @torch.autocast("cuda", enabled=False)
349
+ def forward(
350
+ self,
351
+ qkv: torch.FloatTensor,
352
+ causal: bool = None,
353
+ key_padding_mask: Optional[torch.BoolTensor] = None,
354
+ **kwargs,
355
+ ) -> torch.FloatTensor:
356
+ batch_size, seqlen = qkv.shape[0], qkv.shape[1]
357
+ q, k, v = qkv.unbind(dim=2)
358
+
359
+ q = q.to(torch.float32)
360
+ k = k.to(torch.float32)
361
+
362
+ causal = self.causal if causal is None else causal
363
+ softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
364
+
365
+ # Autocast is manually disabled to avoid `torch.einsum` performing the operation
366
+ # using float16, which might lead to overflow
367
+ scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
368
+
369
+ if key_padding_mask is not None:
370
+ padding_mask = torch.full((batch_size, seqlen), -10000.0, dtype=scores.dtype, device=scores.device)
371
+ padding_mask.masked_fill_(key_padding_mask, 0.0)
372
+
373
+ scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
374
+
375
+ if causal:
376
+ causal_mask = torch.triu(torch.full((seqlen, seqlen), -10000.0, device=scores.device), 1)
377
+ scores = scores + causal_mask.to(dtype=scores.dtype)
378
+
379
+ attention = torch.softmax(scores, dim=-1).to(v.dtype)
380
+ attention = self.drop(attention)
381
+
382
+ output = torch.einsum("bhts,bshd->bthd", attention, v)
383
+
384
+ return output
385
+
386
+
387
+ class CrossAttention(nn.Module):
388
+ """Cross-attention layer (compatible with PyTorch).
389
+
390
+ Reference:
391
+ https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py.
392
+
393
+ """
394
+
395
+ def __init__(
396
+ self,
397
+ causal: bool = True,
398
+ softmax_scale: Optional[float] = None,
399
+ attention_dropout: float = 0.0,
400
+ ) -> None:
401
+ super().__init__()
402
+
403
+ self.causal = causal
404
+ self.softmax_scale = softmax_scale
405
+ self.drop = nn.Dropout(attention_dropout)
406
+
407
+ @torch.autocast("cpu", enabled=False)
408
+ @torch.autocast("cuda", enabled=False)
409
+ def forward(
410
+ self,
411
+ q: torch.FloatTensor,
412
+ kv: torch.FloatTensor,
413
+ causal: bool = None,
414
+ key_padding_mask: Optional[torch.BoolTensor] = None,
415
+ **kwargs,
416
+ ) -> torch.FloatTensor:
417
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
418
+ seqlen_k = kv.shape[1]
419
+
420
+ if kv.shape[3] != q.shape[2]:
421
+ kv = repeat(kv, "... hkv d -> ... (hkv g) d", g=q.shape[2] // kv.shape[3])
422
+ k, v = kv.unbind(dim=2)
423
+
424
+ q = q.to(torch.float32)
425
+ k = k.to(torch.float32)
426
+
427
+ causal = self.causal if causal is None else causal
428
+ softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
429
+
430
+ # Autocast is manually disabled to avoid `torch.einsum` performing the operation
431
+ # using float16, which might lead to overflow
432
+ scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
433
+
434
+ if key_padding_mask is not None:
435
+ padding_mask = torch.full(
436
+ (batch_size, seqlen_k),
437
+ -10000.0,
438
+ dtype=scores.dtype,
439
+ device=scores.device,
440
+ )
441
+ padding_mask.masked_fill_(key_padding_mask, 0.0)
442
+
443
+ scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
444
+
445
+ if causal:
446
+ rows = rearrange(torch.arange(seqlen_q, device=q.device, dtype=torch.long), "s -> s 1")
447
+ cols = torch.arange(seqlen_k, device=k.device, dtype=torch.long)
448
+ causal_mask = cols > rows + seqlen_k - seqlen_q
449
+
450
+ scores = scores.masked_fill(causal_mask, -10000.0)
451
+
452
+ attention = torch.softmax(scores, dim=-1).to(v.dtype)
453
+ attention = self.drop(attention)
454
+
455
+ output = torch.einsum("bhts,bshd->bthd", attention, v)
456
+
457
+ return output
458
+
459
+
460
+ def _find_mha_dims(
461
+ config: PretrainedConfig,
462
+ n_head: Optional[int] = None,
463
+ n_head_kv: Optional[int] = None,
464
+ head_dim: Optional[int] = None,
465
+ ) -> Tuple[int, int]:
466
+ if n_head is None and head_dim is None:
467
+ head_dim = config.n_embd // config.n_head
468
+ n_head = config.n_head
469
+ elif n_head is None or head_dim is None:
470
+ raise ValueError("`n_head` and `head_dim` must be both specified or `None`.")
471
+
472
+ if n_head_kv is None:
473
+ n_head_kv = getattr(config, "n_head_kv", None) or n_head
474
+
475
+ return n_head, n_head_kv, head_dim
476
+
477
+
478
+ def _update_kv_cache(kv: torch.FloatTensor, inference_params: InferenceParams, layer_idx: int) -> torch.FloatTensor:
479
+ num_heads, head_dim = kv.shape[-2:]
480
+
481
+ if layer_idx not in inference_params.key_value_memory_dict:
482
+ inference_params.key_value_memory_dict[layer_idx] = torch.empty(
483
+ inference_params.max_batch_size,
484
+ inference_params.max_seqlen,
485
+ 2,
486
+ num_heads,
487
+ head_dim,
488
+ dtype=kv.dtype,
489
+ device=kv.device,
490
+ )
491
+
492
+ batch_start = inference_params.batch_size_offset
493
+ batch_end = batch_start + kv.shape[0]
494
+
495
+ sequence_start = inference_params.seqlen_offset
496
+ sequence_end = sequence_start + kv.shape[1]
497
+
498
+ # When the current sequence length is larger than the maximum sequence length,
499
+ # we need to concatenate the current `kv` with the cached `kv` to expand its length
500
+ if sequence_end > inference_params.max_seqlen:
501
+ inference_params.key_value_memory_dict[layer_idx] = torch.concatenate((inference_params.key_value_memory_dict[layer_idx], kv), dim=1)
502
+
503
+ inference_params.key_value_memory_dict[layer_idx][batch_start:batch_end, sequence_start:sequence_end, ...] = kv
504
+ kv = inference_params.key_value_memory_dict[layer_idx][batch_start:batch_end, :sequence_end, ...]
505
+
506
+ return kv
507
+
508
+
509
+ class MHA(nn.Module):
510
+ """Multi-head attention layer."""
511
+
512
+ def __init__(
513
+ self,
514
+ config: PretrainedConfig,
515
+ dtype: Optional[torch.dtype] = None,
516
+ device: Optional[str] = None,
517
+ rotary_dim: Optional[int] = None,
518
+ rotary_base: float = 10000.0,
519
+ rotary_scale_base: Optional[float] = None,
520
+ n_head: Optional[int] = None,
521
+ n_head_kv: Optional[int] = None,
522
+ head_dim: Optional[int] = None,
523
+ bias: bool = True,
524
+ causal: bool = True,
525
+ softmax_scale: Optional[float] = None,
526
+ layer_idx: Optional[int] = None,
527
+ return_residual: bool = False,
528
+ checkpointing: bool = False,
529
+ ) -> None:
530
+ super().__init__()
531
+
532
+ # Rotary embedding
533
+ self.rotary_dim = rotary_dim if rotary_dim is not None else getattr(config, "rotary_dim", 0)
534
+ if self.rotary_dim > 0:
535
+ rotary_cls = FlashRotaryEmbedding if config.flash_rotary else RotaryEmbedding
536
+ if rotary_cls is None:
537
+ rotary_cls = RotaryEmbedding
538
+
539
+ rotary_kwargs = {}
540
+ if rotary_cls is RotaryEmbedding:
541
+ rotary_kwargs["max_position_embeddings"] = config.n_positions
542
+
543
+ self.rotary_emb = rotary_cls(
544
+ self.rotary_dim,
545
+ base=rotary_base,
546
+ scale_base=rotary_scale_base,
547
+ device=device,
548
+ **rotary_kwargs,
549
+ )
550
+
551
+ # MLP
552
+ self.n_head, self.n_head_kv, self.head_dim = _find_mha_dims(
553
+ config, n_head=n_head, n_head_kv=n_head_kv, head_dim=head_dim
554
+ )
555
+ op_size = self.head_dim * (self.n_head + 2 * self.n_head_kv)
556
+ hidden_size = config.n_embd
557
+
558
+ linear_cls = FusedDense if config.fused_dense else nn.Linear
559
+ if linear_cls is None:
560
+ linear_cls = nn.Linear
561
+
562
+ self.Wqkv = linear_cls(hidden_size, op_size, bias=bias, device=device, dtype=dtype)
563
+ self.out_proj = linear_cls(hidden_size, hidden_size, bias=bias, device=device, dtype=dtype)
564
+
565
+ # Attention
566
+ attn_cls = FlashSelfAttention if config.flash_attn else SelfAttention
567
+ if attn_cls is None:
568
+ attn_cls = SelfAttention
569
+
570
+ cross_attn_cls = FlashCrossAttention if config.flash_attn else CrossAttention
571
+ if cross_attn_cls is None:
572
+ cross_attn_cls = CrossAttention
573
+
574
+ self.inner_attn = attn_cls(
575
+ causal=causal,
576
+ softmax_scale=softmax_scale,
577
+ attention_dropout=config.attn_pdrop,
578
+ )
579
+ self.inner_cross_attn = cross_attn_cls(
580
+ causal=causal,
581
+ softmax_scale=softmax_scale,
582
+ attention_dropout=config.attn_pdrop,
583
+ )
584
+
585
+ self.flash_attn = config.flash_attn and attn_cls is FlashSelfAttention
586
+ self.layer_idx = layer_idx
587
+ self.return_residual = return_residual
588
+ self.checkpointing = checkpointing
589
+
590
+ def _forward_self_attn(
591
+ self, x: torch.FloatTensor, key_padding_mask: Optional[torch.BoolTensor]
592
+ ) -> torch.FloatTensor:
593
+ qkv = self.Wqkv(x)
594
+ qkv = rearrange(qkv, "... (three h d) -> ... three h d", three=3, d=self.head_dim)
595
+
596
+ if self.rotary_dim > 0:
597
+ qkv = self.rotary_emb(qkv)
598
+
599
+ if self.flash_attn:
600
+ batch_size, seqlen = qkv.shape[0], qkv.shape[1]
601
+
602
+ cu_seqlens, max_seqlen = None, None
603
+ if key_padding_mask is not None:
604
+ # If `key_padding_mask` is supplied, we need to unpad the input and retrieve
605
+ # the `cu_seqlens` and `max_seqlen` to be used by `flash-attn`
606
+ qkv, indices, cu_seqlens, max_seqlen = unpad_input(qkv, key_padding_mask)
607
+
608
+ if self.checkpointing:
609
+ attn_output = torch.utils.checkpoint.checkpoint(
610
+ self.inner_attn, qkv, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen
611
+ )
612
+ else:
613
+ attn_output = self.inner_attn(qkv, cu_seqlens=cu_seqlens, max_seqlen=max_seqlen).to(qkv.device)
614
+
615
+ # If `key_padding_mask` is supplied, we need to pad the output back to the original shape
616
+ return pad_input(attn_output, indices, batch_size, seqlen) if key_padding_mask is not None else attn_output
617
+
618
+ if self.checkpointing:
619
+ return torch.utils.checkpoint.checkpoint(self.inner_attn, qkv, key_padding_mask=key_padding_mask)
620
+
621
+ return self.inner_attn(qkv, key_padding_mask=key_padding_mask)
622
+
623
+ def _forward_cross_attn(
624
+ self,
625
+ x: torch.FloatTensor,
626
+ past_key_values: Optional[InferenceParams],
627
+ key_padding_mask: Optional[torch.BoolTensor],
628
+ ) -> torch.FloatTensor:
629
+ batch_size = x.shape[0]
630
+
631
+ qkv = self.Wqkv(x)
632
+
633
+ q = qkv[..., : self.n_head * self.head_dim]
634
+ q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
635
+
636
+ kv = qkv[..., self.n_head * self.head_dim :]
637
+ kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
638
+
639
+ seqlen_offset = past_key_values.seqlen_offset if past_key_values is not None else 0
640
+ causal = None if seqlen_offset == 0 else False
641
+ if self.rotary_dim > 0:
642
+ q, kv = self.rotary_emb(q, kv=kv, seqlen_offset=seqlen_offset)
643
+
644
+ if past_key_values is not None:
645
+ kv = _update_kv_cache(kv, past_key_values, self.layer_idx)
646
+
647
+ if self.flash_attn:
648
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
649
+ seqlen_k = kv.shape[1]
650
+
651
+ cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k = (
652
+ None,
653
+ None,
654
+ None,
655
+ None,
656
+ )
657
+ if key_padding_mask is not None:
658
+ kv, _, cu_seqlens_k, max_seqlen_k = unpad_input(kv, key_padding_mask)
659
+
660
+ if seqlen_q == 1:
661
+ key_padding_mask = torch.ones(batch_size, 1, device=q.device)
662
+ elif seqlen_q != seqlen_k:
663
+ key_padding_mask = key_padding_mask[:, -seqlen_q:]
664
+
665
+ q, indices_q, cu_seqlens_q, max_seqlen_q = unpad_input(q, key_padding_mask)
666
+
667
+ if self.checkpointing:
668
+ attn_output = torch.utils.checkpoint.checkpoint(
669
+ self.inner_cross_attn,
670
+ q,
671
+ kv,
672
+ causal=causal,
673
+ cu_seqlens=cu_seqlens_q,
674
+ max_seqlen=max_seqlen_q,
675
+ cu_seqlens_k=cu_seqlens_k,
676
+ max_seqlen_k=max_seqlen_k,
677
+ )
678
+ else:
679
+ attn_output = self.inner_cross_attn(
680
+ q,
681
+ kv,
682
+ causal=causal,
683
+ cu_seqlens=cu_seqlens_q,
684
+ max_seqlen=max_seqlen_q,
685
+ cu_seqlens_k=cu_seqlens_k,
686
+ max_seqlen_k=max_seqlen_k,
687
+ )
688
+
689
+ return (
690
+ pad_input(attn_output, indices_q, batch_size, max_seqlen_q)
691
+ if key_padding_mask is not None
692
+ else attn_output
693
+ )
694
+
695
+ if self.checkpointing:
696
+ return torch.utils.checkpoint.checkpoint(
697
+ self.inner_cross_attn,
698
+ q,
699
+ kv,
700
+ key_padding_mask=key_padding_mask,
701
+ causal=causal,
702
+ )
703
+
704
+ return self.inner_cross_attn(q, kv, key_padding_mask=key_padding_mask, causal=causal)
705
+
706
+ def forward(
707
+ self,
708
+ x: torch.FloatTensor,
709
+ past_key_values: Optional[InferenceParams] = None,
710
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
711
+ **kwargs,
712
+ ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
713
+ if attention_mask is not None:
714
+ attention_mask = attention_mask.bool()
715
+ else:
716
+ attention_mask = None
717
+
718
+ # MHA
719
+ if self.n_head == self.n_head_kv:
720
+ if past_key_values is None:
721
+ # If `past_key_values` are not supplied, we run self-attention
722
+ attn_output = self._forward_self_attn(x, attention_mask)
723
+ else:
724
+ # If `past_key_values` are supplied, it means that we might have cached values and
725
+ # could take advantage of cross-attention
726
+ attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
727
+ # MQA / GQA
728
+ else:
729
+ # Regardless of `past_key_values` being supplied or not, it always use cross-attention
730
+ # because `q` and `kv` lengths might be different
731
+ attn_output = self._forward_cross_attn(x, past_key_values, attention_mask)
732
+
733
+ output = rearrange(attn_output, "... h d -> ... (h d)")
734
+ output = self.out_proj(output)
735
+
736
+ return output if not self.return_residual else (output, x)
737
+
738
+
739
+ class ParallelBlock(nn.Module):
740
+ """Parallel block.
741
+
742
+ This block applies parallel mixer and MLP layers to the input (used in GPT-J and CodeGen).
743
+
744
+ """
745
+
746
+ def __init__(
747
+ self,
748
+ config: PretrainedConfig,
749
+ block_idx: Optional[int] = None,
750
+ ) -> None:
751
+ super().__init__()
752
+
753
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
754
+ self.resid_dropout = nn.Dropout(config.resid_pdrop)
755
+ self.block_idx = block_idx
756
+
757
+ self.mixer = MHA(config, layer_idx=block_idx)
758
+ self.mlp = MLP(config)
759
+
760
+ def forward(
761
+ self,
762
+ hidden_states: torch.FloatTensor,
763
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
764
+ attention_mask: Optional[torch.BoolTensor] = None,
765
+ **kwargs,
766
+ ) -> torch.FloatTensor:
767
+ residual = hidden_states
768
+ hidden_states = self.ln(hidden_states)
769
+
770
+ attn_outputs = self.mixer(
771
+ hidden_states,
772
+ past_key_values=past_key_values,
773
+ attention_mask=attention_mask,
774
+ )
775
+ if isinstance(attn_outputs, tuple):
776
+ attn_outputs = attn_outputs[0]
777
+
778
+ attn_outputs = self.resid_dropout(attn_outputs)
779
+ feed_forward_hidden_states = self.resid_dropout(self.mlp(hidden_states))
780
+
781
+ hidden_states = attn_outputs + feed_forward_hidden_states + residual
782
+
783
+ return hidden_states
784
+
785
+
786
+ class CausalLMHead(nn.Module):
787
+ """Causal Language Modeling head.
788
+
789
+ Reference:
790
+ Improving Language Understanding by Generative Pre-Training.
791
+ https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
792
+
793
+ """
794
+
795
+ def __init__(self, config: PretrainedConfig) -> None:
796
+ super().__init__()
797
+
798
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
799
+ self.linear = nn.Linear(config.n_embd, config.vocab_size)
800
+
801
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
802
+ hidden_states = self.ln(hidden_states)
803
+ logits = self.linear(hidden_states).to(torch.float32)
804
+
805
+ return logits
806
+
807
+
808
+ class CausalLMLoss(nn.Module):
809
+ """Causal Language Modeling loss.
810
+
811
+ Reference:
812
+ Improving Language Understanding by Generative Pre-Training.
813
+ https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
814
+
815
+ """
816
+
817
+ def __init__(self, shift_labels: bool = True) -> None:
818
+ super().__init__()
819
+
820
+ self.shift_labels = shift_labels
821
+ self.loss_fct = nn.CrossEntropyLoss()
822
+
823
+ def forward(self, logits: torch.FloatTensor, labels: torch.LongTensor) -> torch.FloatTensor:
824
+ if self.shift_labels:
825
+ logits = logits[..., :-1, :].contiguous()
826
+ labels = labels[..., 1:].contiguous()
827
+
828
+ loss = self.loss_fct(logits.view(-1, logits.size(-1)), labels.view(-1))
829
+
830
+ return loss
831
+
832
+
833
+ class PhiPreTrainedModel(PreTrainedModel):
834
+ """Phi pre-trained model."""
835
+
836
+ config_class = PhiConfig
837
+ base_model_prefix = "transformer"
838
+ supports_gradient_checkpointing = False
839
+ _no_split_modules = ["ParallelBlock"]
840
+
841
+ def __init__(self, *inputs, **kwargs) -> None:
842
+ super().__init__(*inputs, **kwargs)
843
+
844
+ def _init_weights(self, module: nn.Module) -> None:
845
+ if isinstance(module, (nn.Linear,)):
846
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
847
+ if module.bias is not None:
848
+ module.bias.data.zero_()
849
+ elif isinstance(module, nn.Embedding):
850
+ module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
851
+ if module.padding_idx is not None:
852
+ module.weight.data[module.padding_idx].zero_()
853
+ elif isinstance(module, nn.LayerNorm):
854
+ if module.bias is not None:
855
+ module.bias.data.zero_()
856
+ module.weight.data.fill_(1.0)
857
+
858
+ def prepare_inputs_for_generation(
859
+ self,
860
+ input_ids: torch.LongTensor,
861
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
862
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
863
+ **kwargs,
864
+ ) -> Dict[str, Any]:
865
+ if past_key_values is None or not (isinstance(past_key_values, InferenceParams)):
866
+ max_batch_size, max_seqlen = input_ids.shape
867
+ past_key_values = InferenceParams(
868
+ max_seqlen=max(max_seqlen, self.config.n_positions),
869
+ max_batch_size=max_batch_size,
870
+ seqlen_offset=0,
871
+ batch_size_offset=0,
872
+ key_value_memory_dict={},
873
+ lengths_per_sample=None,
874
+ )
875
+ else:
876
+ # Assume that `past_key_values` has cached all tokens up to the last token in `input_ids`
877
+ past_key_values.seqlen_offset = input_ids.shape[1] - 1
878
+ input_ids = input_ids[:, -1].unsqueeze(-1)
879
+
880
+ return {
881
+ "input_ids": input_ids,
882
+ "past_key_values": past_key_values,
883
+ "attention_mask": attention_mask,
884
+ }
885
+
886
+
887
+ class PhiModel(PhiPreTrainedModel):
888
+ """Phi model."""
889
+
890
+ _keys_to_ignore_on_load_missing = [""]
891
+ _keys_to_ignore_on_load_unexpected = [r"h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"]
892
+
893
+ def __init__(self, config: PhiConfig) -> None:
894
+ super().__init__(config)
895
+
896
+ self.embd = Embedding(config)
897
+ self.h = nn.ModuleList([ParallelBlock(config, block_idx=i) for i in range(config.n_layer)])
898
+ self.gradient_checkpointing = False
899
+ self.post_init()
900
+
901
+ def get_input_embeddings(self) -> nn.Embedding:
902
+ return self.embd.wte
903
+
904
+ def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
905
+ self.embd.wte = new_embeddings
906
+
907
+ def forward(
908
+ self,
909
+ input_ids: torch.LongTensor,
910
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
911
+ attention_mask: Optional[torch.BoolTensor] = None,
912
+ ) -> torch.FloatTensor:
913
+ hidden_states = self.embd(input_ids)
914
+
915
+ for layer in self.h:
916
+ hidden_states = layer(
917
+ hidden_states,
918
+ past_key_values=past_key_values,
919
+ attention_mask=attention_mask,
920
+ )
921
+
922
+ return hidden_states
923
+
924
+
925
+ class PhiForCausalLM(PhiPreTrainedModel):
926
+ """Phi for Causal Language Modeling."""
927
+
928
+ _keys_to_ignore_on_load_missing = [""]
929
+ _keys_to_ignore_on_load_unexpected = [r"transformer\.h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"]
930
+
931
+ def __init__(self, config: PhiConfig) -> None:
932
+ super().__init__(config)
933
+
934
+ self.transformer = PhiModel(config)
935
+ self.lm_head = CausalLMHead(config)
936
+ self.loss = CausalLMLoss()
937
+
938
+ self.post_init()
939
+
940
+ def get_output_embeddings(self) -> nn.Linear:
941
+ return self.lm_head.linear
942
+
943
+ def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
944
+ self.lm_head.linear = new_embeddings
945
+
946
+ def forward(
947
+ self,
948
+ input_ids: torch.LongTensor,
949
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
950
+ attention_mask: Optional[torch.BoolTensor] = None,
951
+ labels: Optional[torch.LongTensor] = None,
952
+ **kwargs,
953
+ ) -> CausalLMOutputWithPast:
954
+ hidden_states = self.transformer(input_ids, past_key_values=past_key_values, attention_mask=attention_mask)
955
+ lm_logits = self.lm_head(hidden_states)
956
+
957
+ loss = None
958
+ if labels is not None:
959
+ loss = self.loss(lm_logits, labels)
960
+
961
+ return CausalLMOutputWithPast(loss=loss, logits=lm_logits, past_key_values=past_key_values)
phi-1_5-Q2_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5ccfaae7a6f38e9124fd6fca68eabdd320f934e8b4bf92bb1027871c7e16a47f
3
+ size 612982176
phi-1_5-Q3_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5b24908dd1be16b36d22950b3a87a71038b92443b336cd485b920c804f49a412
3
+ size 765451680
phi-1_5-Q4_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5dfe310d09cc9ee85251e21c60e0a54d44480e3b69e27190d9f0edb1fc36325f
3
+ size 918314400
phi-1_5-Q5_K_M.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:95a41a80a031d8c676acd26afbcc66087643f731989933229121a7310330d5c6
3
+ size 1059610016
phi-1_5-Q6_K.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:004f3d8102df7f0f98cb9c641062d53d73f24104c2ef4362e0c2540e02eb14e7
3
+ size 1167121824
phi-1_5-Q8_0.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e8c26615319e1141348b8534641da54d58e02b7baa01ee611b9c69cc07bf43fd
3
+ size 1510464928
phi-1_5-f16.gguf ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:08f32c8026ba6770734f4df228c55673198f6a987301cfa340add79e2e3d0f10
3
+ size 2839534976
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:842bc8cf6dd49e0fdcaf745febaaceff37b927185a297d24591b3d0fb275a5b1
3
+ size 2836621662
special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<|endoftext|>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "eos_token": "<|endoftext|>",
6
+ "model_max_length": 2048,
7
+ "tokenizer_class": "CodeGenTokenizer",
8
+ "unk_token": "<|endoftext|>"
9
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff