mrhacker7599 commited on
Commit
ed0f56d
1 Parent(s): d13d103

Upload 33 files

Browse files
README.md CHANGED
@@ -1,3 +1,49 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ ---
5
+
6
+ # 🌔 moondream1
7
+
8
+ 1.6B parameter model built by [@vikhyatk](https://x.com/vikhyatk) using SigLIP, Phi-1.5 and the LLaVa training dataset.
9
+ The model is release for research purposes only, commercial use is not allowed.
10
+
11
+ Try it out on [Huggingface Spaces](https://huggingface.co/spaces/vikhyatk/moondream1)!
12
+
13
+ **Usage**
14
+
15
+ ```
16
+ pip install transformers timm einops
17
+ ```
18
+
19
+ ```python
20
+ from transformers import AutoModelForCausalLM, CodeGenTokenizerFast as Tokenizer
21
+ from PIL import Image
22
+
23
+ model_id = "vikhyatk/moondream1"
24
+ model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
25
+ tokenizer = Tokenizer.from_pretrained(model_id)
26
+
27
+ image = Image.open('<IMAGE_PATH>')
28
+ enc_image = model.encode_image(image)
29
+ print(model.answer_question(enc_image, "<QUESTION>", tokenizer))
30
+ ```
31
+
32
+ ## Benchmarks
33
+
34
+ | Model | Parameters | VQAv2 | GQA | TextVQA |
35
+ | --- | --- | --- | --- | --- |
36
+ | LLaVA-1.5 | 13.3B | 80.0 | 63.3 | 61.3 |
37
+ | LLaVA-1.5 | 7.3B | 78.5 | 62.0 | 58.2 |
38
+ | **moondream1** | 1.6B | 74.7 | 57.9 | 35.6 |
39
+
40
+
41
+ ## Examples
42
+
43
+ | Image | Examples |
44
+ | --- | --- |
45
+ | ![](assets/demo-1.jpg) | **What is the title of this book?**<br>The Little Book of Deep Learning<br><br>**What can you tell me about this book?**<br>The book in the image is titled "The Little Book of Deep Learning." It appears to be a guide or manual that provides information and instructions on the subject of deep learning. The book is described as being small and thin, making it easy to carry and read. It is set in a room with a view outside, suggesting that the person holding the book might be in a living or working space with a view. The content of the book is likely to be complex, as the title implies a focus on advanced concepts in deep learning.<br><br>**Where is the person standing?**<br> The person is standing on a balcony or deck, which is outside. |
46
+ | ![](assets/demo-2.jpg) | **What type of food is the girl holding?**<br>The girl is holding a hamburger.<br><br>**What color is the woman's hair?**<br>The woman's hair is white.<br><br>**What is the girl doing in the image?**<br>The girl in the image is eating a hamburger. |
47
+ | ![](assets/demo-3.jpg) | **What color is the train?**<br>The train is red.<br><br>**What is the bus doing in the image?**<br> The bus in the image is driving down a city street, passing through an intersection.<br><br>**What kind of public transportation is in the image?**<br>The image features a red train or trolley as the public transportation option. |
48
+ | ![](assets/demo-4.jpg) | **What is the girl looking at?**<br>The girl is looking at her reflection in the mirror.<br><br>**What color is the woman's hair?**<br>The woman's hair is brown.<br><br>**What is the woman doing in the mirror?**<br>The woman is adjusting her hair in the mirror. |
49
+ | ![](assets/demo-5.jpg) | **What might be the reason for the dog's aggressive behavior?**<br>The dog's aggressive behavior, including its growling, baring its teeth, and showing its sharp teeth, could be due to various reasons. It is possible that the dog is feeling threatened, scared, or defensive about something in its environment, such as another animal, a loud noise, or a person nearby. Alternatively, the dog might be reacting to a command it received from its owner, or it could be experiencing discomfort or pain, which is causing it to display aggressive behavior. It is essential to observe the dog's body language and context to determine the underlying cause of its aggressive behavior and respond accordingly to address its needs or concerns.<br><br>**What kind of dog is in the picture?**<br>The dog in the picture is a small, black and white chihuahua. |
added_tokens.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\t\t": 50294,
3
+ "\t\t\t": 50293,
4
+ "\t\t\t\t": 50292,
5
+ "\t\t\t\t\t": 50291,
6
+ "\t\t\t\t\t\t": 50290,
7
+ "\t\t\t\t\t\t\t": 50289,
8
+ "\t\t\t\t\t\t\t\t": 50288,
9
+ "\t\t\t\t\t\t\t\t\t": 50287,
10
+ " ": 50286,
11
+ " ": 50285,
12
+ " ": 50284,
13
+ " ": 50283,
14
+ " ": 50282,
15
+ " ": 50281,
16
+ " ": 50280,
17
+ " ": 50279,
18
+ " ": 50278,
19
+ " ": 50277,
20
+ " ": 50276,
21
+ " ": 50275,
22
+ " ": 50274,
23
+ " ": 50273,
24
+ " ": 50272,
25
+ " ": 50271,
26
+ " ": 50270,
27
+ " ": 50269,
28
+ " ": 50268,
29
+ " ": 50267,
30
+ " ": 50266,
31
+ " ": 50265,
32
+ " ": 50264,
33
+ " ": 50263,
34
+ " ": 50262,
35
+ " ": 50261,
36
+ " ": 50260,
37
+ " ": 50259,
38
+ " ": 50258,
39
+ " ": 50257
40
+ }
assets/demo-1.jpg ADDED
assets/demo-2.jpg ADDED
assets/demo-3.jpg ADDED
assets/demo-4.jpg ADDED
assets/demo-5.jpg ADDED
config.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Moondream"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_moondream.MoondreamConfig",
7
+ "AutoModelForCausalLM": "moondream.Moondream"
8
+ },
9
+ "model_type": "moondream1",
10
+ "phi_config": {
11
+ "model_type": "phi-msft"
12
+ },
13
+ "torch_dtype": "float16",
14
+ "transformers_version": "4.36.2"
15
+ }
configuration_moondream.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PretrainedConfig
2
+
3
+ from typing import Optional
4
+ import math
5
+
6
+
7
+ class PhiConfig(PretrainedConfig):
8
+ model_type = "phi-msft"
9
+
10
+ def __init__(
11
+ self,
12
+ vocab_size: int = 51200,
13
+ n_positions: int = 2048,
14
+ n_embd: int = 2048,
15
+ n_layer: int = 24,
16
+ n_inner: Optional[int] = None,
17
+ n_head: int = 32,
18
+ n_head_kv: Optional[int] = None,
19
+ rotary_dim: Optional[int] = 32,
20
+ activation_function: Optional[str] = "gelu_new",
21
+ flash_attn: bool = False,
22
+ flash_rotary: bool = False,
23
+ fused_dense: bool = False,
24
+ attn_pdrop: float = 0.0,
25
+ embd_pdrop: float = 0.0,
26
+ resid_pdrop: float = 0.0,
27
+ layer_norm_epsilon: float = 1e-5,
28
+ initializer_range: float = 0.02,
29
+ tie_word_embeddings: bool = False,
30
+ pad_vocab_size_multiple: int = 64,
31
+ gradient_checkpointing: bool = False,
32
+ **kwargs
33
+ ):
34
+ pad_vocab_size = (
35
+ math.ceil(vocab_size / pad_vocab_size_multiple) * pad_vocab_size_multiple
36
+ )
37
+ super().__init__(
38
+ vocab_size=pad_vocab_size,
39
+ n_positions=n_positions,
40
+ n_embd=n_embd,
41
+ n_layer=n_layer,
42
+ n_inner=n_inner,
43
+ n_head=n_head,
44
+ n_head_kv=n_head_kv,
45
+ activation_function=activation_function,
46
+ attn_pdrop=attn_pdrop,
47
+ embd_pdrop=embd_pdrop,
48
+ resid_pdrop=resid_pdrop,
49
+ layer_norm_epsilon=layer_norm_epsilon,
50
+ initializer_range=initializer_range,
51
+ pad_vocab_size_multiple=pad_vocab_size_multiple,
52
+ tie_word_embeddings=tie_word_embeddings,
53
+ gradient_checkpointing=gradient_checkpointing,
54
+ **kwargs
55
+ )
56
+ self.rotary_dim = min(rotary_dim, n_embd // n_head)
57
+ self.flash_attn = flash_attn
58
+ self.flash_rotary = flash_rotary
59
+ self.fused_dense = fused_dense
60
+
61
+ attribute_map = {
62
+ "max_position_embeddings": "n_positions",
63
+ "hidden_size": "n_embd",
64
+ "num_attention_heads": "n_head",
65
+ "num_hidden_layers": "n_layer",
66
+ }
67
+
68
+
69
+ class MoondreamConfig(PretrainedConfig):
70
+ model_type = "moondream1"
71
+
72
+ def __init__(self, **kwargs):
73
+ self.phi_config = PhiConfig(**kwargs)
74
+ super().__init__(**kwargs)
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "transformers_version": "4.36.2"
4
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:44ea739f35b3eae160979d3bc03e4a091816a61acad2a58aff3518812c891b1c
3
+ size 135
model-00002-of-00002.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e0520d63ad66cc7dfe1f8cc6a8230735ce8791152917b45fe9e7eec751f86526
3
+ size 135
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3746971ff772573912a5bb83d1a3dce1bde96eb49d2ac5dc504e31a9aa60105e
3
+ size 135
model.safetensors.index.json ADDED
@@ -0,0 +1,591 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 7564205504
4
+ },
5
+ "weight_map": {
6
+ "text_model.model.lm_head.linear.bias": "model-00002-of-00002.safetensors",
7
+ "text_model.model.lm_head.linear.weight": "model-00002-of-00002.safetensors",
8
+ "text_model.model.lm_head.ln.bias": "model-00002-of-00002.safetensors",
9
+ "text_model.model.lm_head.ln.weight": "model-00002-of-00002.safetensors",
10
+ "text_model.model.transformer.embd.wte.weight": "model-00001-of-00002.safetensors",
11
+ "text_model.model.transformer.h.0.ln.bias": "model-00001-of-00002.safetensors",
12
+ "text_model.model.transformer.h.0.ln.weight": "model-00001-of-00002.safetensors",
13
+ "text_model.model.transformer.h.0.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
14
+ "text_model.model.transformer.h.0.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
15
+ "text_model.model.transformer.h.0.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
16
+ "text_model.model.transformer.h.0.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
17
+ "text_model.model.transformer.h.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
18
+ "text_model.model.transformer.h.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
19
+ "text_model.model.transformer.h.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
20
+ "text_model.model.transformer.h.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
21
+ "text_model.model.transformer.h.1.ln.bias": "model-00001-of-00002.safetensors",
22
+ "text_model.model.transformer.h.1.ln.weight": "model-00001-of-00002.safetensors",
23
+ "text_model.model.transformer.h.1.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
24
+ "text_model.model.transformer.h.1.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
25
+ "text_model.model.transformer.h.1.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
26
+ "text_model.model.transformer.h.1.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
27
+ "text_model.model.transformer.h.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
28
+ "text_model.model.transformer.h.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
29
+ "text_model.model.transformer.h.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
30
+ "text_model.model.transformer.h.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
31
+ "text_model.model.transformer.h.10.ln.bias": "model-00001-of-00002.safetensors",
32
+ "text_model.model.transformer.h.10.ln.weight": "model-00001-of-00002.safetensors",
33
+ "text_model.model.transformer.h.10.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
34
+ "text_model.model.transformer.h.10.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
35
+ "text_model.model.transformer.h.10.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
36
+ "text_model.model.transformer.h.10.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
37
+ "text_model.model.transformer.h.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
38
+ "text_model.model.transformer.h.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
39
+ "text_model.model.transformer.h.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
40
+ "text_model.model.transformer.h.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
41
+ "text_model.model.transformer.h.11.ln.bias": "model-00001-of-00002.safetensors",
42
+ "text_model.model.transformer.h.11.ln.weight": "model-00001-of-00002.safetensors",
43
+ "text_model.model.transformer.h.11.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
44
+ "text_model.model.transformer.h.11.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
45
+ "text_model.model.transformer.h.11.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
46
+ "text_model.model.transformer.h.11.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
47
+ "text_model.model.transformer.h.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
48
+ "text_model.model.transformer.h.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
49
+ "text_model.model.transformer.h.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
50
+ "text_model.model.transformer.h.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
51
+ "text_model.model.transformer.h.12.ln.bias": "model-00001-of-00002.safetensors",
52
+ "text_model.model.transformer.h.12.ln.weight": "model-00001-of-00002.safetensors",
53
+ "text_model.model.transformer.h.12.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
54
+ "text_model.model.transformer.h.12.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
55
+ "text_model.model.transformer.h.12.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
56
+ "text_model.model.transformer.h.12.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
57
+ "text_model.model.transformer.h.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
58
+ "text_model.model.transformer.h.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
59
+ "text_model.model.transformer.h.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
60
+ "text_model.model.transformer.h.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
61
+ "text_model.model.transformer.h.13.ln.bias": "model-00001-of-00002.safetensors",
62
+ "text_model.model.transformer.h.13.ln.weight": "model-00001-of-00002.safetensors",
63
+ "text_model.model.transformer.h.13.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
64
+ "text_model.model.transformer.h.13.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
65
+ "text_model.model.transformer.h.13.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
66
+ "text_model.model.transformer.h.13.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
67
+ "text_model.model.transformer.h.13.mlp.fc1.bias": "model-00002-of-00002.safetensors",
68
+ "text_model.model.transformer.h.13.mlp.fc1.weight": "model-00002-of-00002.safetensors",
69
+ "text_model.model.transformer.h.13.mlp.fc2.bias": "model-00002-of-00002.safetensors",
70
+ "text_model.model.transformer.h.13.mlp.fc2.weight": "model-00002-of-00002.safetensors",
71
+ "text_model.model.transformer.h.14.ln.bias": "model-00002-of-00002.safetensors",
72
+ "text_model.model.transformer.h.14.ln.weight": "model-00002-of-00002.safetensors",
73
+ "text_model.model.transformer.h.14.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
74
+ "text_model.model.transformer.h.14.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
75
+ "text_model.model.transformer.h.14.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
76
+ "text_model.model.transformer.h.14.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
77
+ "text_model.model.transformer.h.14.mlp.fc1.bias": "model-00002-of-00002.safetensors",
78
+ "text_model.model.transformer.h.14.mlp.fc1.weight": "model-00002-of-00002.safetensors",
79
+ "text_model.model.transformer.h.14.mlp.fc2.bias": "model-00002-of-00002.safetensors",
80
+ "text_model.model.transformer.h.14.mlp.fc2.weight": "model-00002-of-00002.safetensors",
81
+ "text_model.model.transformer.h.15.ln.bias": "model-00002-of-00002.safetensors",
82
+ "text_model.model.transformer.h.15.ln.weight": "model-00002-of-00002.safetensors",
83
+ "text_model.model.transformer.h.15.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
84
+ "text_model.model.transformer.h.15.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
85
+ "text_model.model.transformer.h.15.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
86
+ "text_model.model.transformer.h.15.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
87
+ "text_model.model.transformer.h.15.mlp.fc1.bias": "model-00002-of-00002.safetensors",
88
+ "text_model.model.transformer.h.15.mlp.fc1.weight": "model-00002-of-00002.safetensors",
89
+ "text_model.model.transformer.h.15.mlp.fc2.bias": "model-00002-of-00002.safetensors",
90
+ "text_model.model.transformer.h.15.mlp.fc2.weight": "model-00002-of-00002.safetensors",
91
+ "text_model.model.transformer.h.16.ln.bias": "model-00002-of-00002.safetensors",
92
+ "text_model.model.transformer.h.16.ln.weight": "model-00002-of-00002.safetensors",
93
+ "text_model.model.transformer.h.16.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
94
+ "text_model.model.transformer.h.16.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
95
+ "text_model.model.transformer.h.16.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
96
+ "text_model.model.transformer.h.16.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
97
+ "text_model.model.transformer.h.16.mlp.fc1.bias": "model-00002-of-00002.safetensors",
98
+ "text_model.model.transformer.h.16.mlp.fc1.weight": "model-00002-of-00002.safetensors",
99
+ "text_model.model.transformer.h.16.mlp.fc2.bias": "model-00002-of-00002.safetensors",
100
+ "text_model.model.transformer.h.16.mlp.fc2.weight": "model-00002-of-00002.safetensors",
101
+ "text_model.model.transformer.h.17.ln.bias": "model-00002-of-00002.safetensors",
102
+ "text_model.model.transformer.h.17.ln.weight": "model-00002-of-00002.safetensors",
103
+ "text_model.model.transformer.h.17.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
104
+ "text_model.model.transformer.h.17.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
105
+ "text_model.model.transformer.h.17.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
106
+ "text_model.model.transformer.h.17.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
107
+ "text_model.model.transformer.h.17.mlp.fc1.bias": "model-00002-of-00002.safetensors",
108
+ "text_model.model.transformer.h.17.mlp.fc1.weight": "model-00002-of-00002.safetensors",
109
+ "text_model.model.transformer.h.17.mlp.fc2.bias": "model-00002-of-00002.safetensors",
110
+ "text_model.model.transformer.h.17.mlp.fc2.weight": "model-00002-of-00002.safetensors",
111
+ "text_model.model.transformer.h.18.ln.bias": "model-00002-of-00002.safetensors",
112
+ "text_model.model.transformer.h.18.ln.weight": "model-00002-of-00002.safetensors",
113
+ "text_model.model.transformer.h.18.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
114
+ "text_model.model.transformer.h.18.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
115
+ "text_model.model.transformer.h.18.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
116
+ "text_model.model.transformer.h.18.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
117
+ "text_model.model.transformer.h.18.mlp.fc1.bias": "model-00002-of-00002.safetensors",
118
+ "text_model.model.transformer.h.18.mlp.fc1.weight": "model-00002-of-00002.safetensors",
119
+ "text_model.model.transformer.h.18.mlp.fc2.bias": "model-00002-of-00002.safetensors",
120
+ "text_model.model.transformer.h.18.mlp.fc2.weight": "model-00002-of-00002.safetensors",
121
+ "text_model.model.transformer.h.19.ln.bias": "model-00002-of-00002.safetensors",
122
+ "text_model.model.transformer.h.19.ln.weight": "model-00002-of-00002.safetensors",
123
+ "text_model.model.transformer.h.19.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
124
+ "text_model.model.transformer.h.19.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
125
+ "text_model.model.transformer.h.19.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
126
+ "text_model.model.transformer.h.19.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
127
+ "text_model.model.transformer.h.19.mlp.fc1.bias": "model-00002-of-00002.safetensors",
128
+ "text_model.model.transformer.h.19.mlp.fc1.weight": "model-00002-of-00002.safetensors",
129
+ "text_model.model.transformer.h.19.mlp.fc2.bias": "model-00002-of-00002.safetensors",
130
+ "text_model.model.transformer.h.19.mlp.fc2.weight": "model-00002-of-00002.safetensors",
131
+ "text_model.model.transformer.h.2.ln.bias": "model-00001-of-00002.safetensors",
132
+ "text_model.model.transformer.h.2.ln.weight": "model-00001-of-00002.safetensors",
133
+ "text_model.model.transformer.h.2.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
134
+ "text_model.model.transformer.h.2.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
135
+ "text_model.model.transformer.h.2.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
136
+ "text_model.model.transformer.h.2.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
137
+ "text_model.model.transformer.h.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
138
+ "text_model.model.transformer.h.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
139
+ "text_model.model.transformer.h.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
140
+ "text_model.model.transformer.h.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
141
+ "text_model.model.transformer.h.20.ln.bias": "model-00002-of-00002.safetensors",
142
+ "text_model.model.transformer.h.20.ln.weight": "model-00002-of-00002.safetensors",
143
+ "text_model.model.transformer.h.20.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
144
+ "text_model.model.transformer.h.20.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
145
+ "text_model.model.transformer.h.20.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
146
+ "text_model.model.transformer.h.20.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
147
+ "text_model.model.transformer.h.20.mlp.fc1.bias": "model-00002-of-00002.safetensors",
148
+ "text_model.model.transformer.h.20.mlp.fc1.weight": "model-00002-of-00002.safetensors",
149
+ "text_model.model.transformer.h.20.mlp.fc2.bias": "model-00002-of-00002.safetensors",
150
+ "text_model.model.transformer.h.20.mlp.fc2.weight": "model-00002-of-00002.safetensors",
151
+ "text_model.model.transformer.h.21.ln.bias": "model-00002-of-00002.safetensors",
152
+ "text_model.model.transformer.h.21.ln.weight": "model-00002-of-00002.safetensors",
153
+ "text_model.model.transformer.h.21.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
154
+ "text_model.model.transformer.h.21.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
155
+ "text_model.model.transformer.h.21.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
156
+ "text_model.model.transformer.h.21.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
157
+ "text_model.model.transformer.h.21.mlp.fc1.bias": "model-00002-of-00002.safetensors",
158
+ "text_model.model.transformer.h.21.mlp.fc1.weight": "model-00002-of-00002.safetensors",
159
+ "text_model.model.transformer.h.21.mlp.fc2.bias": "model-00002-of-00002.safetensors",
160
+ "text_model.model.transformer.h.21.mlp.fc2.weight": "model-00002-of-00002.safetensors",
161
+ "text_model.model.transformer.h.22.ln.bias": "model-00002-of-00002.safetensors",
162
+ "text_model.model.transformer.h.22.ln.weight": "model-00002-of-00002.safetensors",
163
+ "text_model.model.transformer.h.22.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
164
+ "text_model.model.transformer.h.22.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
165
+ "text_model.model.transformer.h.22.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
166
+ "text_model.model.transformer.h.22.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
167
+ "text_model.model.transformer.h.22.mlp.fc1.bias": "model-00002-of-00002.safetensors",
168
+ "text_model.model.transformer.h.22.mlp.fc1.weight": "model-00002-of-00002.safetensors",
169
+ "text_model.model.transformer.h.22.mlp.fc2.bias": "model-00002-of-00002.safetensors",
170
+ "text_model.model.transformer.h.22.mlp.fc2.weight": "model-00002-of-00002.safetensors",
171
+ "text_model.model.transformer.h.23.ln.bias": "model-00002-of-00002.safetensors",
172
+ "text_model.model.transformer.h.23.ln.weight": "model-00002-of-00002.safetensors",
173
+ "text_model.model.transformer.h.23.mixer.Wqkv.bias": "model-00002-of-00002.safetensors",
174
+ "text_model.model.transformer.h.23.mixer.Wqkv.weight": "model-00002-of-00002.safetensors",
175
+ "text_model.model.transformer.h.23.mixer.out_proj.bias": "model-00002-of-00002.safetensors",
176
+ "text_model.model.transformer.h.23.mixer.out_proj.weight": "model-00002-of-00002.safetensors",
177
+ "text_model.model.transformer.h.23.mlp.fc1.bias": "model-00002-of-00002.safetensors",
178
+ "text_model.model.transformer.h.23.mlp.fc1.weight": "model-00002-of-00002.safetensors",
179
+ "text_model.model.transformer.h.23.mlp.fc2.bias": "model-00002-of-00002.safetensors",
180
+ "text_model.model.transformer.h.23.mlp.fc2.weight": "model-00002-of-00002.safetensors",
181
+ "text_model.model.transformer.h.3.ln.bias": "model-00001-of-00002.safetensors",
182
+ "text_model.model.transformer.h.3.ln.weight": "model-00001-of-00002.safetensors",
183
+ "text_model.model.transformer.h.3.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
184
+ "text_model.model.transformer.h.3.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
185
+ "text_model.model.transformer.h.3.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
186
+ "text_model.model.transformer.h.3.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
187
+ "text_model.model.transformer.h.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
188
+ "text_model.model.transformer.h.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
189
+ "text_model.model.transformer.h.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
190
+ "text_model.model.transformer.h.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
191
+ "text_model.model.transformer.h.4.ln.bias": "model-00001-of-00002.safetensors",
192
+ "text_model.model.transformer.h.4.ln.weight": "model-00001-of-00002.safetensors",
193
+ "text_model.model.transformer.h.4.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
194
+ "text_model.model.transformer.h.4.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
195
+ "text_model.model.transformer.h.4.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
196
+ "text_model.model.transformer.h.4.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
197
+ "text_model.model.transformer.h.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
198
+ "text_model.model.transformer.h.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
199
+ "text_model.model.transformer.h.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
200
+ "text_model.model.transformer.h.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
201
+ "text_model.model.transformer.h.5.ln.bias": "model-00001-of-00002.safetensors",
202
+ "text_model.model.transformer.h.5.ln.weight": "model-00001-of-00002.safetensors",
203
+ "text_model.model.transformer.h.5.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
204
+ "text_model.model.transformer.h.5.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
205
+ "text_model.model.transformer.h.5.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
206
+ "text_model.model.transformer.h.5.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
207
+ "text_model.model.transformer.h.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
208
+ "text_model.model.transformer.h.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
209
+ "text_model.model.transformer.h.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
210
+ "text_model.model.transformer.h.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
211
+ "text_model.model.transformer.h.6.ln.bias": "model-00001-of-00002.safetensors",
212
+ "text_model.model.transformer.h.6.ln.weight": "model-00001-of-00002.safetensors",
213
+ "text_model.model.transformer.h.6.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
214
+ "text_model.model.transformer.h.6.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
215
+ "text_model.model.transformer.h.6.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
216
+ "text_model.model.transformer.h.6.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
217
+ "text_model.model.transformer.h.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
218
+ "text_model.model.transformer.h.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
219
+ "text_model.model.transformer.h.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
220
+ "text_model.model.transformer.h.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
221
+ "text_model.model.transformer.h.7.ln.bias": "model-00001-of-00002.safetensors",
222
+ "text_model.model.transformer.h.7.ln.weight": "model-00001-of-00002.safetensors",
223
+ "text_model.model.transformer.h.7.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
224
+ "text_model.model.transformer.h.7.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
225
+ "text_model.model.transformer.h.7.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
226
+ "text_model.model.transformer.h.7.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
227
+ "text_model.model.transformer.h.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
228
+ "text_model.model.transformer.h.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
229
+ "text_model.model.transformer.h.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
230
+ "text_model.model.transformer.h.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
231
+ "text_model.model.transformer.h.8.ln.bias": "model-00001-of-00002.safetensors",
232
+ "text_model.model.transformer.h.8.ln.weight": "model-00001-of-00002.safetensors",
233
+ "text_model.model.transformer.h.8.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
234
+ "text_model.model.transformer.h.8.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
235
+ "text_model.model.transformer.h.8.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
236
+ "text_model.model.transformer.h.8.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
237
+ "text_model.model.transformer.h.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
238
+ "text_model.model.transformer.h.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
239
+ "text_model.model.transformer.h.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
240
+ "text_model.model.transformer.h.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
241
+ "text_model.model.transformer.h.9.ln.bias": "model-00001-of-00002.safetensors",
242
+ "text_model.model.transformer.h.9.ln.weight": "model-00001-of-00002.safetensors",
243
+ "text_model.model.transformer.h.9.mixer.Wqkv.bias": "model-00001-of-00002.safetensors",
244
+ "text_model.model.transformer.h.9.mixer.Wqkv.weight": "model-00001-of-00002.safetensors",
245
+ "text_model.model.transformer.h.9.mixer.out_proj.bias": "model-00001-of-00002.safetensors",
246
+ "text_model.model.transformer.h.9.mixer.out_proj.weight": "model-00001-of-00002.safetensors",
247
+ "text_model.model.transformer.h.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
248
+ "text_model.model.transformer.h.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
249
+ "text_model.model.transformer.h.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
250
+ "text_model.model.transformer.h.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
251
+ "vision_encoder.model.encoder.model.visual.blocks.0.attn.proj.bias": "model-00001-of-00002.safetensors",
252
+ "vision_encoder.model.encoder.model.visual.blocks.0.attn.proj.weight": "model-00001-of-00002.safetensors",
253
+ "vision_encoder.model.encoder.model.visual.blocks.0.attn.qkv.bias": "model-00001-of-00002.safetensors",
254
+ "vision_encoder.model.encoder.model.visual.blocks.0.attn.qkv.weight": "model-00001-of-00002.safetensors",
255
+ "vision_encoder.model.encoder.model.visual.blocks.0.mlp.fc1.bias": "model-00001-of-00002.safetensors",
256
+ "vision_encoder.model.encoder.model.visual.blocks.0.mlp.fc1.weight": "model-00001-of-00002.safetensors",
257
+ "vision_encoder.model.encoder.model.visual.blocks.0.mlp.fc2.bias": "model-00001-of-00002.safetensors",
258
+ "vision_encoder.model.encoder.model.visual.blocks.0.mlp.fc2.weight": "model-00001-of-00002.safetensors",
259
+ "vision_encoder.model.encoder.model.visual.blocks.0.norm1.bias": "model-00001-of-00002.safetensors",
260
+ "vision_encoder.model.encoder.model.visual.blocks.0.norm1.weight": "model-00001-of-00002.safetensors",
261
+ "vision_encoder.model.encoder.model.visual.blocks.0.norm2.bias": "model-00001-of-00002.safetensors",
262
+ "vision_encoder.model.encoder.model.visual.blocks.0.norm2.weight": "model-00001-of-00002.safetensors",
263
+ "vision_encoder.model.encoder.model.visual.blocks.1.attn.proj.bias": "model-00001-of-00002.safetensors",
264
+ "vision_encoder.model.encoder.model.visual.blocks.1.attn.proj.weight": "model-00001-of-00002.safetensors",
265
+ "vision_encoder.model.encoder.model.visual.blocks.1.attn.qkv.bias": "model-00001-of-00002.safetensors",
266
+ "vision_encoder.model.encoder.model.visual.blocks.1.attn.qkv.weight": "model-00001-of-00002.safetensors",
267
+ "vision_encoder.model.encoder.model.visual.blocks.1.mlp.fc1.bias": "model-00001-of-00002.safetensors",
268
+ "vision_encoder.model.encoder.model.visual.blocks.1.mlp.fc1.weight": "model-00001-of-00002.safetensors",
269
+ "vision_encoder.model.encoder.model.visual.blocks.1.mlp.fc2.bias": "model-00001-of-00002.safetensors",
270
+ "vision_encoder.model.encoder.model.visual.blocks.1.mlp.fc2.weight": "model-00001-of-00002.safetensors",
271
+ "vision_encoder.model.encoder.model.visual.blocks.1.norm1.bias": "model-00001-of-00002.safetensors",
272
+ "vision_encoder.model.encoder.model.visual.blocks.1.norm1.weight": "model-00001-of-00002.safetensors",
273
+ "vision_encoder.model.encoder.model.visual.blocks.1.norm2.bias": "model-00001-of-00002.safetensors",
274
+ "vision_encoder.model.encoder.model.visual.blocks.1.norm2.weight": "model-00001-of-00002.safetensors",
275
+ "vision_encoder.model.encoder.model.visual.blocks.10.attn.proj.bias": "model-00001-of-00002.safetensors",
276
+ "vision_encoder.model.encoder.model.visual.blocks.10.attn.proj.weight": "model-00001-of-00002.safetensors",
277
+ "vision_encoder.model.encoder.model.visual.blocks.10.attn.qkv.bias": "model-00001-of-00002.safetensors",
278
+ "vision_encoder.model.encoder.model.visual.blocks.10.attn.qkv.weight": "model-00001-of-00002.safetensors",
279
+ "vision_encoder.model.encoder.model.visual.blocks.10.mlp.fc1.bias": "model-00001-of-00002.safetensors",
280
+ "vision_encoder.model.encoder.model.visual.blocks.10.mlp.fc1.weight": "model-00001-of-00002.safetensors",
281
+ "vision_encoder.model.encoder.model.visual.blocks.10.mlp.fc2.bias": "model-00001-of-00002.safetensors",
282
+ "vision_encoder.model.encoder.model.visual.blocks.10.mlp.fc2.weight": "model-00001-of-00002.safetensors",
283
+ "vision_encoder.model.encoder.model.visual.blocks.10.norm1.bias": "model-00001-of-00002.safetensors",
284
+ "vision_encoder.model.encoder.model.visual.blocks.10.norm1.weight": "model-00001-of-00002.safetensors",
285
+ "vision_encoder.model.encoder.model.visual.blocks.10.norm2.bias": "model-00001-of-00002.safetensors",
286
+ "vision_encoder.model.encoder.model.visual.blocks.10.norm2.weight": "model-00001-of-00002.safetensors",
287
+ "vision_encoder.model.encoder.model.visual.blocks.11.attn.proj.bias": "model-00001-of-00002.safetensors",
288
+ "vision_encoder.model.encoder.model.visual.blocks.11.attn.proj.weight": "model-00001-of-00002.safetensors",
289
+ "vision_encoder.model.encoder.model.visual.blocks.11.attn.qkv.bias": "model-00001-of-00002.safetensors",
290
+ "vision_encoder.model.encoder.model.visual.blocks.11.attn.qkv.weight": "model-00001-of-00002.safetensors",
291
+ "vision_encoder.model.encoder.model.visual.blocks.11.mlp.fc1.bias": "model-00001-of-00002.safetensors",
292
+ "vision_encoder.model.encoder.model.visual.blocks.11.mlp.fc1.weight": "model-00001-of-00002.safetensors",
293
+ "vision_encoder.model.encoder.model.visual.blocks.11.mlp.fc2.bias": "model-00001-of-00002.safetensors",
294
+ "vision_encoder.model.encoder.model.visual.blocks.11.mlp.fc2.weight": "model-00001-of-00002.safetensors",
295
+ "vision_encoder.model.encoder.model.visual.blocks.11.norm1.bias": "model-00001-of-00002.safetensors",
296
+ "vision_encoder.model.encoder.model.visual.blocks.11.norm1.weight": "model-00001-of-00002.safetensors",
297
+ "vision_encoder.model.encoder.model.visual.blocks.11.norm2.bias": "model-00001-of-00002.safetensors",
298
+ "vision_encoder.model.encoder.model.visual.blocks.11.norm2.weight": "model-00001-of-00002.safetensors",
299
+ "vision_encoder.model.encoder.model.visual.blocks.12.attn.proj.bias": "model-00001-of-00002.safetensors",
300
+ "vision_encoder.model.encoder.model.visual.blocks.12.attn.proj.weight": "model-00001-of-00002.safetensors",
301
+ "vision_encoder.model.encoder.model.visual.blocks.12.attn.qkv.bias": "model-00001-of-00002.safetensors",
302
+ "vision_encoder.model.encoder.model.visual.blocks.12.attn.qkv.weight": "model-00001-of-00002.safetensors",
303
+ "vision_encoder.model.encoder.model.visual.blocks.12.mlp.fc1.bias": "model-00001-of-00002.safetensors",
304
+ "vision_encoder.model.encoder.model.visual.blocks.12.mlp.fc1.weight": "model-00001-of-00002.safetensors",
305
+ "vision_encoder.model.encoder.model.visual.blocks.12.mlp.fc2.bias": "model-00001-of-00002.safetensors",
306
+ "vision_encoder.model.encoder.model.visual.blocks.12.mlp.fc2.weight": "model-00001-of-00002.safetensors",
307
+ "vision_encoder.model.encoder.model.visual.blocks.12.norm1.bias": "model-00001-of-00002.safetensors",
308
+ "vision_encoder.model.encoder.model.visual.blocks.12.norm1.weight": "model-00001-of-00002.safetensors",
309
+ "vision_encoder.model.encoder.model.visual.blocks.12.norm2.bias": "model-00001-of-00002.safetensors",
310
+ "vision_encoder.model.encoder.model.visual.blocks.12.norm2.weight": "model-00001-of-00002.safetensors",
311
+ "vision_encoder.model.encoder.model.visual.blocks.13.attn.proj.bias": "model-00001-of-00002.safetensors",
312
+ "vision_encoder.model.encoder.model.visual.blocks.13.attn.proj.weight": "model-00001-of-00002.safetensors",
313
+ "vision_encoder.model.encoder.model.visual.blocks.13.attn.qkv.bias": "model-00001-of-00002.safetensors",
314
+ "vision_encoder.model.encoder.model.visual.blocks.13.attn.qkv.weight": "model-00001-of-00002.safetensors",
315
+ "vision_encoder.model.encoder.model.visual.blocks.13.mlp.fc1.bias": "model-00001-of-00002.safetensors",
316
+ "vision_encoder.model.encoder.model.visual.blocks.13.mlp.fc1.weight": "model-00001-of-00002.safetensors",
317
+ "vision_encoder.model.encoder.model.visual.blocks.13.mlp.fc2.bias": "model-00001-of-00002.safetensors",
318
+ "vision_encoder.model.encoder.model.visual.blocks.13.mlp.fc2.weight": "model-00001-of-00002.safetensors",
319
+ "vision_encoder.model.encoder.model.visual.blocks.13.norm1.bias": "model-00001-of-00002.safetensors",
320
+ "vision_encoder.model.encoder.model.visual.blocks.13.norm1.weight": "model-00001-of-00002.safetensors",
321
+ "vision_encoder.model.encoder.model.visual.blocks.13.norm2.bias": "model-00001-of-00002.safetensors",
322
+ "vision_encoder.model.encoder.model.visual.blocks.13.norm2.weight": "model-00001-of-00002.safetensors",
323
+ "vision_encoder.model.encoder.model.visual.blocks.14.attn.proj.bias": "model-00001-of-00002.safetensors",
324
+ "vision_encoder.model.encoder.model.visual.blocks.14.attn.proj.weight": "model-00001-of-00002.safetensors",
325
+ "vision_encoder.model.encoder.model.visual.blocks.14.attn.qkv.bias": "model-00001-of-00002.safetensors",
326
+ "vision_encoder.model.encoder.model.visual.blocks.14.attn.qkv.weight": "model-00001-of-00002.safetensors",
327
+ "vision_encoder.model.encoder.model.visual.blocks.14.mlp.fc1.bias": "model-00001-of-00002.safetensors",
328
+ "vision_encoder.model.encoder.model.visual.blocks.14.mlp.fc1.weight": "model-00001-of-00002.safetensors",
329
+ "vision_encoder.model.encoder.model.visual.blocks.14.mlp.fc2.bias": "model-00001-of-00002.safetensors",
330
+ "vision_encoder.model.encoder.model.visual.blocks.14.mlp.fc2.weight": "model-00001-of-00002.safetensors",
331
+ "vision_encoder.model.encoder.model.visual.blocks.14.norm1.bias": "model-00001-of-00002.safetensors",
332
+ "vision_encoder.model.encoder.model.visual.blocks.14.norm1.weight": "model-00001-of-00002.safetensors",
333
+ "vision_encoder.model.encoder.model.visual.blocks.14.norm2.bias": "model-00001-of-00002.safetensors",
334
+ "vision_encoder.model.encoder.model.visual.blocks.14.norm2.weight": "model-00001-of-00002.safetensors",
335
+ "vision_encoder.model.encoder.model.visual.blocks.15.attn.proj.bias": "model-00001-of-00002.safetensors",
336
+ "vision_encoder.model.encoder.model.visual.blocks.15.attn.proj.weight": "model-00001-of-00002.safetensors",
337
+ "vision_encoder.model.encoder.model.visual.blocks.15.attn.qkv.bias": "model-00001-of-00002.safetensors",
338
+ "vision_encoder.model.encoder.model.visual.blocks.15.attn.qkv.weight": "model-00001-of-00002.safetensors",
339
+ "vision_encoder.model.encoder.model.visual.blocks.15.mlp.fc1.bias": "model-00001-of-00002.safetensors",
340
+ "vision_encoder.model.encoder.model.visual.blocks.15.mlp.fc1.weight": "model-00001-of-00002.safetensors",
341
+ "vision_encoder.model.encoder.model.visual.blocks.15.mlp.fc2.bias": "model-00001-of-00002.safetensors",
342
+ "vision_encoder.model.encoder.model.visual.blocks.15.mlp.fc2.weight": "model-00001-of-00002.safetensors",
343
+ "vision_encoder.model.encoder.model.visual.blocks.15.norm1.bias": "model-00001-of-00002.safetensors",
344
+ "vision_encoder.model.encoder.model.visual.blocks.15.norm1.weight": "model-00001-of-00002.safetensors",
345
+ "vision_encoder.model.encoder.model.visual.blocks.15.norm2.bias": "model-00001-of-00002.safetensors",
346
+ "vision_encoder.model.encoder.model.visual.blocks.15.norm2.weight": "model-00001-of-00002.safetensors",
347
+ "vision_encoder.model.encoder.model.visual.blocks.16.attn.proj.bias": "model-00001-of-00002.safetensors",
348
+ "vision_encoder.model.encoder.model.visual.blocks.16.attn.proj.weight": "model-00001-of-00002.safetensors",
349
+ "vision_encoder.model.encoder.model.visual.blocks.16.attn.qkv.bias": "model-00001-of-00002.safetensors",
350
+ "vision_encoder.model.encoder.model.visual.blocks.16.attn.qkv.weight": "model-00001-of-00002.safetensors",
351
+ "vision_encoder.model.encoder.model.visual.blocks.16.mlp.fc1.bias": "model-00001-of-00002.safetensors",
352
+ "vision_encoder.model.encoder.model.visual.blocks.16.mlp.fc1.weight": "model-00001-of-00002.safetensors",
353
+ "vision_encoder.model.encoder.model.visual.blocks.16.mlp.fc2.bias": "model-00001-of-00002.safetensors",
354
+ "vision_encoder.model.encoder.model.visual.blocks.16.mlp.fc2.weight": "model-00001-of-00002.safetensors",
355
+ "vision_encoder.model.encoder.model.visual.blocks.16.norm1.bias": "model-00001-of-00002.safetensors",
356
+ "vision_encoder.model.encoder.model.visual.blocks.16.norm1.weight": "model-00001-of-00002.safetensors",
357
+ "vision_encoder.model.encoder.model.visual.blocks.16.norm2.bias": "model-00001-of-00002.safetensors",
358
+ "vision_encoder.model.encoder.model.visual.blocks.16.norm2.weight": "model-00001-of-00002.safetensors",
359
+ "vision_encoder.model.encoder.model.visual.blocks.17.attn.proj.bias": "model-00001-of-00002.safetensors",
360
+ "vision_encoder.model.encoder.model.visual.blocks.17.attn.proj.weight": "model-00001-of-00002.safetensors",
361
+ "vision_encoder.model.encoder.model.visual.blocks.17.attn.qkv.bias": "model-00001-of-00002.safetensors",
362
+ "vision_encoder.model.encoder.model.visual.blocks.17.attn.qkv.weight": "model-00001-of-00002.safetensors",
363
+ "vision_encoder.model.encoder.model.visual.blocks.17.mlp.fc1.bias": "model-00001-of-00002.safetensors",
364
+ "vision_encoder.model.encoder.model.visual.blocks.17.mlp.fc1.weight": "model-00001-of-00002.safetensors",
365
+ "vision_encoder.model.encoder.model.visual.blocks.17.mlp.fc2.bias": "model-00001-of-00002.safetensors",
366
+ "vision_encoder.model.encoder.model.visual.blocks.17.mlp.fc2.weight": "model-00001-of-00002.safetensors",
367
+ "vision_encoder.model.encoder.model.visual.blocks.17.norm1.bias": "model-00001-of-00002.safetensors",
368
+ "vision_encoder.model.encoder.model.visual.blocks.17.norm1.weight": "model-00001-of-00002.safetensors",
369
+ "vision_encoder.model.encoder.model.visual.blocks.17.norm2.bias": "model-00001-of-00002.safetensors",
370
+ "vision_encoder.model.encoder.model.visual.blocks.17.norm2.weight": "model-00001-of-00002.safetensors",
371
+ "vision_encoder.model.encoder.model.visual.blocks.18.attn.proj.bias": "model-00001-of-00002.safetensors",
372
+ "vision_encoder.model.encoder.model.visual.blocks.18.attn.proj.weight": "model-00001-of-00002.safetensors",
373
+ "vision_encoder.model.encoder.model.visual.blocks.18.attn.qkv.bias": "model-00001-of-00002.safetensors",
374
+ "vision_encoder.model.encoder.model.visual.blocks.18.attn.qkv.weight": "model-00001-of-00002.safetensors",
375
+ "vision_encoder.model.encoder.model.visual.blocks.18.mlp.fc1.bias": "model-00001-of-00002.safetensors",
376
+ "vision_encoder.model.encoder.model.visual.blocks.18.mlp.fc1.weight": "model-00001-of-00002.safetensors",
377
+ "vision_encoder.model.encoder.model.visual.blocks.18.mlp.fc2.bias": "model-00001-of-00002.safetensors",
378
+ "vision_encoder.model.encoder.model.visual.blocks.18.mlp.fc2.weight": "model-00001-of-00002.safetensors",
379
+ "vision_encoder.model.encoder.model.visual.blocks.18.norm1.bias": "model-00001-of-00002.safetensors",
380
+ "vision_encoder.model.encoder.model.visual.blocks.18.norm1.weight": "model-00001-of-00002.safetensors",
381
+ "vision_encoder.model.encoder.model.visual.blocks.18.norm2.bias": "model-00001-of-00002.safetensors",
382
+ "vision_encoder.model.encoder.model.visual.blocks.18.norm2.weight": "model-00001-of-00002.safetensors",
383
+ "vision_encoder.model.encoder.model.visual.blocks.19.attn.proj.bias": "model-00001-of-00002.safetensors",
384
+ "vision_encoder.model.encoder.model.visual.blocks.19.attn.proj.weight": "model-00001-of-00002.safetensors",
385
+ "vision_encoder.model.encoder.model.visual.blocks.19.attn.qkv.bias": "model-00001-of-00002.safetensors",
386
+ "vision_encoder.model.encoder.model.visual.blocks.19.attn.qkv.weight": "model-00001-of-00002.safetensors",
387
+ "vision_encoder.model.encoder.model.visual.blocks.19.mlp.fc1.bias": "model-00001-of-00002.safetensors",
388
+ "vision_encoder.model.encoder.model.visual.blocks.19.mlp.fc1.weight": "model-00001-of-00002.safetensors",
389
+ "vision_encoder.model.encoder.model.visual.blocks.19.mlp.fc2.bias": "model-00001-of-00002.safetensors",
390
+ "vision_encoder.model.encoder.model.visual.blocks.19.mlp.fc2.weight": "model-00001-of-00002.safetensors",
391
+ "vision_encoder.model.encoder.model.visual.blocks.19.norm1.bias": "model-00001-of-00002.safetensors",
392
+ "vision_encoder.model.encoder.model.visual.blocks.19.norm1.weight": "model-00001-of-00002.safetensors",
393
+ "vision_encoder.model.encoder.model.visual.blocks.19.norm2.bias": "model-00001-of-00002.safetensors",
394
+ "vision_encoder.model.encoder.model.visual.blocks.19.norm2.weight": "model-00001-of-00002.safetensors",
395
+ "vision_encoder.model.encoder.model.visual.blocks.2.attn.proj.bias": "model-00001-of-00002.safetensors",
396
+ "vision_encoder.model.encoder.model.visual.blocks.2.attn.proj.weight": "model-00001-of-00002.safetensors",
397
+ "vision_encoder.model.encoder.model.visual.blocks.2.attn.qkv.bias": "model-00001-of-00002.safetensors",
398
+ "vision_encoder.model.encoder.model.visual.blocks.2.attn.qkv.weight": "model-00001-of-00002.safetensors",
399
+ "vision_encoder.model.encoder.model.visual.blocks.2.mlp.fc1.bias": "model-00001-of-00002.safetensors",
400
+ "vision_encoder.model.encoder.model.visual.blocks.2.mlp.fc1.weight": "model-00001-of-00002.safetensors",
401
+ "vision_encoder.model.encoder.model.visual.blocks.2.mlp.fc2.bias": "model-00001-of-00002.safetensors",
402
+ "vision_encoder.model.encoder.model.visual.blocks.2.mlp.fc2.weight": "model-00001-of-00002.safetensors",
403
+ "vision_encoder.model.encoder.model.visual.blocks.2.norm1.bias": "model-00001-of-00002.safetensors",
404
+ "vision_encoder.model.encoder.model.visual.blocks.2.norm1.weight": "model-00001-of-00002.safetensors",
405
+ "vision_encoder.model.encoder.model.visual.blocks.2.norm2.bias": "model-00001-of-00002.safetensors",
406
+ "vision_encoder.model.encoder.model.visual.blocks.2.norm2.weight": "model-00001-of-00002.safetensors",
407
+ "vision_encoder.model.encoder.model.visual.blocks.20.attn.proj.bias": "model-00001-of-00002.safetensors",
408
+ "vision_encoder.model.encoder.model.visual.blocks.20.attn.proj.weight": "model-00001-of-00002.safetensors",
409
+ "vision_encoder.model.encoder.model.visual.blocks.20.attn.qkv.bias": "model-00001-of-00002.safetensors",
410
+ "vision_encoder.model.encoder.model.visual.blocks.20.attn.qkv.weight": "model-00001-of-00002.safetensors",
411
+ "vision_encoder.model.encoder.model.visual.blocks.20.mlp.fc1.bias": "model-00001-of-00002.safetensors",
412
+ "vision_encoder.model.encoder.model.visual.blocks.20.mlp.fc1.weight": "model-00001-of-00002.safetensors",
413
+ "vision_encoder.model.encoder.model.visual.blocks.20.mlp.fc2.bias": "model-00001-of-00002.safetensors",
414
+ "vision_encoder.model.encoder.model.visual.blocks.20.mlp.fc2.weight": "model-00001-of-00002.safetensors",
415
+ "vision_encoder.model.encoder.model.visual.blocks.20.norm1.bias": "model-00001-of-00002.safetensors",
416
+ "vision_encoder.model.encoder.model.visual.blocks.20.norm1.weight": "model-00001-of-00002.safetensors",
417
+ "vision_encoder.model.encoder.model.visual.blocks.20.norm2.bias": "model-00001-of-00002.safetensors",
418
+ "vision_encoder.model.encoder.model.visual.blocks.20.norm2.weight": "model-00001-of-00002.safetensors",
419
+ "vision_encoder.model.encoder.model.visual.blocks.21.attn.proj.bias": "model-00001-of-00002.safetensors",
420
+ "vision_encoder.model.encoder.model.visual.blocks.21.attn.proj.weight": "model-00001-of-00002.safetensors",
421
+ "vision_encoder.model.encoder.model.visual.blocks.21.attn.qkv.bias": "model-00001-of-00002.safetensors",
422
+ "vision_encoder.model.encoder.model.visual.blocks.21.attn.qkv.weight": "model-00001-of-00002.safetensors",
423
+ "vision_encoder.model.encoder.model.visual.blocks.21.mlp.fc1.bias": "model-00001-of-00002.safetensors",
424
+ "vision_encoder.model.encoder.model.visual.blocks.21.mlp.fc1.weight": "model-00001-of-00002.safetensors",
425
+ "vision_encoder.model.encoder.model.visual.blocks.21.mlp.fc2.bias": "model-00001-of-00002.safetensors",
426
+ "vision_encoder.model.encoder.model.visual.blocks.21.mlp.fc2.weight": "model-00001-of-00002.safetensors",
427
+ "vision_encoder.model.encoder.model.visual.blocks.21.norm1.bias": "model-00001-of-00002.safetensors",
428
+ "vision_encoder.model.encoder.model.visual.blocks.21.norm1.weight": "model-00001-of-00002.safetensors",
429
+ "vision_encoder.model.encoder.model.visual.blocks.21.norm2.bias": "model-00001-of-00002.safetensors",
430
+ "vision_encoder.model.encoder.model.visual.blocks.21.norm2.weight": "model-00001-of-00002.safetensors",
431
+ "vision_encoder.model.encoder.model.visual.blocks.22.attn.proj.bias": "model-00001-of-00002.safetensors",
432
+ "vision_encoder.model.encoder.model.visual.blocks.22.attn.proj.weight": "model-00001-of-00002.safetensors",
433
+ "vision_encoder.model.encoder.model.visual.blocks.22.attn.qkv.bias": "model-00001-of-00002.safetensors",
434
+ "vision_encoder.model.encoder.model.visual.blocks.22.attn.qkv.weight": "model-00001-of-00002.safetensors",
435
+ "vision_encoder.model.encoder.model.visual.blocks.22.mlp.fc1.bias": "model-00001-of-00002.safetensors",
436
+ "vision_encoder.model.encoder.model.visual.blocks.22.mlp.fc1.weight": "model-00001-of-00002.safetensors",
437
+ "vision_encoder.model.encoder.model.visual.blocks.22.mlp.fc2.bias": "model-00001-of-00002.safetensors",
438
+ "vision_encoder.model.encoder.model.visual.blocks.22.mlp.fc2.weight": "model-00001-of-00002.safetensors",
439
+ "vision_encoder.model.encoder.model.visual.blocks.22.norm1.bias": "model-00001-of-00002.safetensors",
440
+ "vision_encoder.model.encoder.model.visual.blocks.22.norm1.weight": "model-00001-of-00002.safetensors",
441
+ "vision_encoder.model.encoder.model.visual.blocks.22.norm2.bias": "model-00001-of-00002.safetensors",
442
+ "vision_encoder.model.encoder.model.visual.blocks.22.norm2.weight": "model-00001-of-00002.safetensors",
443
+ "vision_encoder.model.encoder.model.visual.blocks.23.attn.proj.bias": "model-00001-of-00002.safetensors",
444
+ "vision_encoder.model.encoder.model.visual.blocks.23.attn.proj.weight": "model-00001-of-00002.safetensors",
445
+ "vision_encoder.model.encoder.model.visual.blocks.23.attn.qkv.bias": "model-00001-of-00002.safetensors",
446
+ "vision_encoder.model.encoder.model.visual.blocks.23.attn.qkv.weight": "model-00001-of-00002.safetensors",
447
+ "vision_encoder.model.encoder.model.visual.blocks.23.mlp.fc1.bias": "model-00001-of-00002.safetensors",
448
+ "vision_encoder.model.encoder.model.visual.blocks.23.mlp.fc1.weight": "model-00001-of-00002.safetensors",
449
+ "vision_encoder.model.encoder.model.visual.blocks.23.mlp.fc2.bias": "model-00001-of-00002.safetensors",
450
+ "vision_encoder.model.encoder.model.visual.blocks.23.mlp.fc2.weight": "model-00001-of-00002.safetensors",
451
+ "vision_encoder.model.encoder.model.visual.blocks.23.norm1.bias": "model-00001-of-00002.safetensors",
452
+ "vision_encoder.model.encoder.model.visual.blocks.23.norm1.weight": "model-00001-of-00002.safetensors",
453
+ "vision_encoder.model.encoder.model.visual.blocks.23.norm2.bias": "model-00001-of-00002.safetensors",
454
+ "vision_encoder.model.encoder.model.visual.blocks.23.norm2.weight": "model-00001-of-00002.safetensors",
455
+ "vision_encoder.model.encoder.model.visual.blocks.24.attn.proj.bias": "model-00001-of-00002.safetensors",
456
+ "vision_encoder.model.encoder.model.visual.blocks.24.attn.proj.weight": "model-00001-of-00002.safetensors",
457
+ "vision_encoder.model.encoder.model.visual.blocks.24.attn.qkv.bias": "model-00001-of-00002.safetensors",
458
+ "vision_encoder.model.encoder.model.visual.blocks.24.attn.qkv.weight": "model-00001-of-00002.safetensors",
459
+ "vision_encoder.model.encoder.model.visual.blocks.24.mlp.fc1.bias": "model-00001-of-00002.safetensors",
460
+ "vision_encoder.model.encoder.model.visual.blocks.24.mlp.fc1.weight": "model-00001-of-00002.safetensors",
461
+ "vision_encoder.model.encoder.model.visual.blocks.24.mlp.fc2.bias": "model-00001-of-00002.safetensors",
462
+ "vision_encoder.model.encoder.model.visual.blocks.24.mlp.fc2.weight": "model-00001-of-00002.safetensors",
463
+ "vision_encoder.model.encoder.model.visual.blocks.24.norm1.bias": "model-00001-of-00002.safetensors",
464
+ "vision_encoder.model.encoder.model.visual.blocks.24.norm1.weight": "model-00001-of-00002.safetensors",
465
+ "vision_encoder.model.encoder.model.visual.blocks.24.norm2.bias": "model-00001-of-00002.safetensors",
466
+ "vision_encoder.model.encoder.model.visual.blocks.24.norm2.weight": "model-00001-of-00002.safetensors",
467
+ "vision_encoder.model.encoder.model.visual.blocks.25.attn.proj.bias": "model-00001-of-00002.safetensors",
468
+ "vision_encoder.model.encoder.model.visual.blocks.25.attn.proj.weight": "model-00001-of-00002.safetensors",
469
+ "vision_encoder.model.encoder.model.visual.blocks.25.attn.qkv.bias": "model-00001-of-00002.safetensors",
470
+ "vision_encoder.model.encoder.model.visual.blocks.25.attn.qkv.weight": "model-00001-of-00002.safetensors",
471
+ "vision_encoder.model.encoder.model.visual.blocks.25.mlp.fc1.bias": "model-00001-of-00002.safetensors",
472
+ "vision_encoder.model.encoder.model.visual.blocks.25.mlp.fc1.weight": "model-00001-of-00002.safetensors",
473
+ "vision_encoder.model.encoder.model.visual.blocks.25.mlp.fc2.bias": "model-00001-of-00002.safetensors",
474
+ "vision_encoder.model.encoder.model.visual.blocks.25.mlp.fc2.weight": "model-00001-of-00002.safetensors",
475
+ "vision_encoder.model.encoder.model.visual.blocks.25.norm1.bias": "model-00001-of-00002.safetensors",
476
+ "vision_encoder.model.encoder.model.visual.blocks.25.norm1.weight": "model-00001-of-00002.safetensors",
477
+ "vision_encoder.model.encoder.model.visual.blocks.25.norm2.bias": "model-00001-of-00002.safetensors",
478
+ "vision_encoder.model.encoder.model.visual.blocks.25.norm2.weight": "model-00001-of-00002.safetensors",
479
+ "vision_encoder.model.encoder.model.visual.blocks.26.attn.proj.bias": "model-00001-of-00002.safetensors",
480
+ "vision_encoder.model.encoder.model.visual.blocks.26.attn.proj.weight": "model-00001-of-00002.safetensors",
481
+ "vision_encoder.model.encoder.model.visual.blocks.26.attn.qkv.bias": "model-00001-of-00002.safetensors",
482
+ "vision_encoder.model.encoder.model.visual.blocks.26.attn.qkv.weight": "model-00001-of-00002.safetensors",
483
+ "vision_encoder.model.encoder.model.visual.blocks.26.mlp.fc1.bias": "model-00001-of-00002.safetensors",
484
+ "vision_encoder.model.encoder.model.visual.blocks.26.mlp.fc1.weight": "model-00001-of-00002.safetensors",
485
+ "vision_encoder.model.encoder.model.visual.blocks.26.mlp.fc2.bias": "model-00001-of-00002.safetensors",
486
+ "vision_encoder.model.encoder.model.visual.blocks.26.mlp.fc2.weight": "model-00001-of-00002.safetensors",
487
+ "vision_encoder.model.encoder.model.visual.blocks.26.norm1.bias": "model-00001-of-00002.safetensors",
488
+ "vision_encoder.model.encoder.model.visual.blocks.26.norm1.weight": "model-00001-of-00002.safetensors",
489
+ "vision_encoder.model.encoder.model.visual.blocks.26.norm2.bias": "model-00001-of-00002.safetensors",
490
+ "vision_encoder.model.encoder.model.visual.blocks.26.norm2.weight": "model-00001-of-00002.safetensors",
491
+ "vision_encoder.model.encoder.model.visual.blocks.3.attn.proj.bias": "model-00001-of-00002.safetensors",
492
+ "vision_encoder.model.encoder.model.visual.blocks.3.attn.proj.weight": "model-00001-of-00002.safetensors",
493
+ "vision_encoder.model.encoder.model.visual.blocks.3.attn.qkv.bias": "model-00001-of-00002.safetensors",
494
+ "vision_encoder.model.encoder.model.visual.blocks.3.attn.qkv.weight": "model-00001-of-00002.safetensors",
495
+ "vision_encoder.model.encoder.model.visual.blocks.3.mlp.fc1.bias": "model-00001-of-00002.safetensors",
496
+ "vision_encoder.model.encoder.model.visual.blocks.3.mlp.fc1.weight": "model-00001-of-00002.safetensors",
497
+ "vision_encoder.model.encoder.model.visual.blocks.3.mlp.fc2.bias": "model-00001-of-00002.safetensors",
498
+ "vision_encoder.model.encoder.model.visual.blocks.3.mlp.fc2.weight": "model-00001-of-00002.safetensors",
499
+ "vision_encoder.model.encoder.model.visual.blocks.3.norm1.bias": "model-00001-of-00002.safetensors",
500
+ "vision_encoder.model.encoder.model.visual.blocks.3.norm1.weight": "model-00001-of-00002.safetensors",
501
+ "vision_encoder.model.encoder.model.visual.blocks.3.norm2.bias": "model-00001-of-00002.safetensors",
502
+ "vision_encoder.model.encoder.model.visual.blocks.3.norm2.weight": "model-00001-of-00002.safetensors",
503
+ "vision_encoder.model.encoder.model.visual.blocks.4.attn.proj.bias": "model-00001-of-00002.safetensors",
504
+ "vision_encoder.model.encoder.model.visual.blocks.4.attn.proj.weight": "model-00001-of-00002.safetensors",
505
+ "vision_encoder.model.encoder.model.visual.blocks.4.attn.qkv.bias": "model-00001-of-00002.safetensors",
506
+ "vision_encoder.model.encoder.model.visual.blocks.4.attn.qkv.weight": "model-00001-of-00002.safetensors",
507
+ "vision_encoder.model.encoder.model.visual.blocks.4.mlp.fc1.bias": "model-00001-of-00002.safetensors",
508
+ "vision_encoder.model.encoder.model.visual.blocks.4.mlp.fc1.weight": "model-00001-of-00002.safetensors",
509
+ "vision_encoder.model.encoder.model.visual.blocks.4.mlp.fc2.bias": "model-00001-of-00002.safetensors",
510
+ "vision_encoder.model.encoder.model.visual.blocks.4.mlp.fc2.weight": "model-00001-of-00002.safetensors",
511
+ "vision_encoder.model.encoder.model.visual.blocks.4.norm1.bias": "model-00001-of-00002.safetensors",
512
+ "vision_encoder.model.encoder.model.visual.blocks.4.norm1.weight": "model-00001-of-00002.safetensors",
513
+ "vision_encoder.model.encoder.model.visual.blocks.4.norm2.bias": "model-00001-of-00002.safetensors",
514
+ "vision_encoder.model.encoder.model.visual.blocks.4.norm2.weight": "model-00001-of-00002.safetensors",
515
+ "vision_encoder.model.encoder.model.visual.blocks.5.attn.proj.bias": "model-00001-of-00002.safetensors",
516
+ "vision_encoder.model.encoder.model.visual.blocks.5.attn.proj.weight": "model-00001-of-00002.safetensors",
517
+ "vision_encoder.model.encoder.model.visual.blocks.5.attn.qkv.bias": "model-00001-of-00002.safetensors",
518
+ "vision_encoder.model.encoder.model.visual.blocks.5.attn.qkv.weight": "model-00001-of-00002.safetensors",
519
+ "vision_encoder.model.encoder.model.visual.blocks.5.mlp.fc1.bias": "model-00001-of-00002.safetensors",
520
+ "vision_encoder.model.encoder.model.visual.blocks.5.mlp.fc1.weight": "model-00001-of-00002.safetensors",
521
+ "vision_encoder.model.encoder.model.visual.blocks.5.mlp.fc2.bias": "model-00001-of-00002.safetensors",
522
+ "vision_encoder.model.encoder.model.visual.blocks.5.mlp.fc2.weight": "model-00001-of-00002.safetensors",
523
+ "vision_encoder.model.encoder.model.visual.blocks.5.norm1.bias": "model-00001-of-00002.safetensors",
524
+ "vision_encoder.model.encoder.model.visual.blocks.5.norm1.weight": "model-00001-of-00002.safetensors",
525
+ "vision_encoder.model.encoder.model.visual.blocks.5.norm2.bias": "model-00001-of-00002.safetensors",
526
+ "vision_encoder.model.encoder.model.visual.blocks.5.norm2.weight": "model-00001-of-00002.safetensors",
527
+ "vision_encoder.model.encoder.model.visual.blocks.6.attn.proj.bias": "model-00001-of-00002.safetensors",
528
+ "vision_encoder.model.encoder.model.visual.blocks.6.attn.proj.weight": "model-00001-of-00002.safetensors",
529
+ "vision_encoder.model.encoder.model.visual.blocks.6.attn.qkv.bias": "model-00001-of-00002.safetensors",
530
+ "vision_encoder.model.encoder.model.visual.blocks.6.attn.qkv.weight": "model-00001-of-00002.safetensors",
531
+ "vision_encoder.model.encoder.model.visual.blocks.6.mlp.fc1.bias": "model-00001-of-00002.safetensors",
532
+ "vision_encoder.model.encoder.model.visual.blocks.6.mlp.fc1.weight": "model-00001-of-00002.safetensors",
533
+ "vision_encoder.model.encoder.model.visual.blocks.6.mlp.fc2.bias": "model-00001-of-00002.safetensors",
534
+ "vision_encoder.model.encoder.model.visual.blocks.6.mlp.fc2.weight": "model-00001-of-00002.safetensors",
535
+ "vision_encoder.model.encoder.model.visual.blocks.6.norm1.bias": "model-00001-of-00002.safetensors",
536
+ "vision_encoder.model.encoder.model.visual.blocks.6.norm1.weight": "model-00001-of-00002.safetensors",
537
+ "vision_encoder.model.encoder.model.visual.blocks.6.norm2.bias": "model-00001-of-00002.safetensors",
538
+ "vision_encoder.model.encoder.model.visual.blocks.6.norm2.weight": "model-00001-of-00002.safetensors",
539
+ "vision_encoder.model.encoder.model.visual.blocks.7.attn.proj.bias": "model-00001-of-00002.safetensors",
540
+ "vision_encoder.model.encoder.model.visual.blocks.7.attn.proj.weight": "model-00001-of-00002.safetensors",
541
+ "vision_encoder.model.encoder.model.visual.blocks.7.attn.qkv.bias": "model-00001-of-00002.safetensors",
542
+ "vision_encoder.model.encoder.model.visual.blocks.7.attn.qkv.weight": "model-00001-of-00002.safetensors",
543
+ "vision_encoder.model.encoder.model.visual.blocks.7.mlp.fc1.bias": "model-00001-of-00002.safetensors",
544
+ "vision_encoder.model.encoder.model.visual.blocks.7.mlp.fc1.weight": "model-00001-of-00002.safetensors",
545
+ "vision_encoder.model.encoder.model.visual.blocks.7.mlp.fc2.bias": "model-00001-of-00002.safetensors",
546
+ "vision_encoder.model.encoder.model.visual.blocks.7.mlp.fc2.weight": "model-00001-of-00002.safetensors",
547
+ "vision_encoder.model.encoder.model.visual.blocks.7.norm1.bias": "model-00001-of-00002.safetensors",
548
+ "vision_encoder.model.encoder.model.visual.blocks.7.norm1.weight": "model-00001-of-00002.safetensors",
549
+ "vision_encoder.model.encoder.model.visual.blocks.7.norm2.bias": "model-00001-of-00002.safetensors",
550
+ "vision_encoder.model.encoder.model.visual.blocks.7.norm2.weight": "model-00001-of-00002.safetensors",
551
+ "vision_encoder.model.encoder.model.visual.blocks.8.attn.proj.bias": "model-00001-of-00002.safetensors",
552
+ "vision_encoder.model.encoder.model.visual.blocks.8.attn.proj.weight": "model-00001-of-00002.safetensors",
553
+ "vision_encoder.model.encoder.model.visual.blocks.8.attn.qkv.bias": "model-00001-of-00002.safetensors",
554
+ "vision_encoder.model.encoder.model.visual.blocks.8.attn.qkv.weight": "model-00001-of-00002.safetensors",
555
+ "vision_encoder.model.encoder.model.visual.blocks.8.mlp.fc1.bias": "model-00001-of-00002.safetensors",
556
+ "vision_encoder.model.encoder.model.visual.blocks.8.mlp.fc1.weight": "model-00001-of-00002.safetensors",
557
+ "vision_encoder.model.encoder.model.visual.blocks.8.mlp.fc2.bias": "model-00001-of-00002.safetensors",
558
+ "vision_encoder.model.encoder.model.visual.blocks.8.mlp.fc2.weight": "model-00001-of-00002.safetensors",
559
+ "vision_encoder.model.encoder.model.visual.blocks.8.norm1.bias": "model-00001-of-00002.safetensors",
560
+ "vision_encoder.model.encoder.model.visual.blocks.8.norm1.weight": "model-00001-of-00002.safetensors",
561
+ "vision_encoder.model.encoder.model.visual.blocks.8.norm2.bias": "model-00001-of-00002.safetensors",
562
+ "vision_encoder.model.encoder.model.visual.blocks.8.norm2.weight": "model-00001-of-00002.safetensors",
563
+ "vision_encoder.model.encoder.model.visual.blocks.9.attn.proj.bias": "model-00001-of-00002.safetensors",
564
+ "vision_encoder.model.encoder.model.visual.blocks.9.attn.proj.weight": "model-00001-of-00002.safetensors",
565
+ "vision_encoder.model.encoder.model.visual.blocks.9.attn.qkv.bias": "model-00001-of-00002.safetensors",
566
+ "vision_encoder.model.encoder.model.visual.blocks.9.attn.qkv.weight": "model-00001-of-00002.safetensors",
567
+ "vision_encoder.model.encoder.model.visual.blocks.9.mlp.fc1.bias": "model-00001-of-00002.safetensors",
568
+ "vision_encoder.model.encoder.model.visual.blocks.9.mlp.fc1.weight": "model-00001-of-00002.safetensors",
569
+ "vision_encoder.model.encoder.model.visual.blocks.9.mlp.fc2.bias": "model-00001-of-00002.safetensors",
570
+ "vision_encoder.model.encoder.model.visual.blocks.9.mlp.fc2.weight": "model-00001-of-00002.safetensors",
571
+ "vision_encoder.model.encoder.model.visual.blocks.9.norm1.bias": "model-00001-of-00002.safetensors",
572
+ "vision_encoder.model.encoder.model.visual.blocks.9.norm1.weight": "model-00001-of-00002.safetensors",
573
+ "vision_encoder.model.encoder.model.visual.blocks.9.norm2.bias": "model-00001-of-00002.safetensors",
574
+ "vision_encoder.model.encoder.model.visual.blocks.9.norm2.weight": "model-00001-of-00002.safetensors",
575
+ "vision_encoder.model.encoder.model.visual.norm.bias": "model-00001-of-00002.safetensors",
576
+ "vision_encoder.model.encoder.model.visual.norm.weight": "model-00001-of-00002.safetensors",
577
+ "vision_encoder.model.encoder.model.visual.patch_embed.linear.bias": "model-00001-of-00002.safetensors",
578
+ "vision_encoder.model.encoder.model.visual.patch_embed.linear.weight": "model-00001-of-00002.safetensors",
579
+ "vision_encoder.model.encoder.model.visual.pos_embed": "model-00001-of-00002.safetensors",
580
+ "vision_encoder.model.projection.ln.bias": "model-00001-of-00002.safetensors",
581
+ "vision_encoder.model.projection.ln.weight": "model-00001-of-00002.safetensors",
582
+ "vision_encoder.model.projection.mlp1.fc1.bias": "model-00001-of-00002.safetensors",
583
+ "vision_encoder.model.projection.mlp1.fc1.weight": "model-00001-of-00002.safetensors",
584
+ "vision_encoder.model.projection.mlp1.fc2.bias": "model-00001-of-00002.safetensors",
585
+ "vision_encoder.model.projection.mlp1.fc2.weight": "model-00001-of-00002.safetensors",
586
+ "vision_encoder.model.projection.mlp2.fc1.bias": "model-00001-of-00002.safetensors",
587
+ "vision_encoder.model.projection.mlp2.fc1.weight": "model-00001-of-00002.safetensors",
588
+ "vision_encoder.model.projection.mlp2.fc2.bias": "model-00001-of-00002.safetensors",
589
+ "vision_encoder.model.projection.mlp2.fc2.weight": "model-00001-of-00002.safetensors"
590
+ }
591
+ }
modeling_phi.py ADDED
@@ -0,0 +1,720 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Microsoft Corporation.
2
+ # Licensed under the MIT license.
3
+ #
4
+ # Copyright (c) 2022, Tri Dao, [email protected].
5
+ # Licensed under the BSD 3-Clause License.
6
+
7
+ from dataclasses import dataclass, field
8
+ from typing import Any, Dict, Optional, Union, Tuple
9
+
10
+ import math
11
+ import torch
12
+ import torch.nn as nn
13
+ from einops import rearrange, repeat
14
+ from transformers import PretrainedConfig, PreTrainedModel
15
+ from transformers.activations import ACT2FN
16
+ from transformers.modeling_outputs import CausalLMOutputWithPast
17
+
18
+ from .configuration_moondream import PhiConfig
19
+
20
+ FusedDense = None
21
+
22
+
23
+ @dataclass
24
+ class InferenceParams:
25
+ max_seqlen: int
26
+ max_batch_size: int
27
+ seqlen_offset: int = 0
28
+ batch_size_offset: int = 0
29
+ key_value_memory_dict: Dict[str, Any] = field(default_factory=dict)
30
+ lengths_per_sample: torch.Tensor = None
31
+
32
+
33
+ class Embedding(nn.Module):
34
+ def __init__(self, config: PretrainedConfig):
35
+ super().__init__()
36
+ self.wte = nn.Embedding(config.vocab_size, config.n_embd)
37
+ self.drop = nn.Dropout(config.embd_pdrop)
38
+
39
+ def forward(self, input_ids: torch.LongTensor) -> torch.FloatTensor:
40
+ return self.drop(self.wte(input_ids.view(-1, input_ids.size(-1))))
41
+
42
+
43
+ def _apply_rotary_emb(x, cos, sin):
44
+ seqlen, rotary_dim = x.size(1), cos.size(1) * 2
45
+ x_rot, x_pass = x[..., :rotary_dim], x[..., rotary_dim:]
46
+ x1, x2 = x_rot.chunk(2, dim=-1)
47
+ c, s = cos[:seqlen].unsqueeze(1), sin[:seqlen].unsqueeze(1)
48
+ x_rot = torch.cat([x1 * c - x2 * s, x1 * s + x2 * c], dim=-1)
49
+ return torch.cat([x_rot.to(x.dtype), x_pass], dim=-1)
50
+
51
+
52
+ def _apply_rotary_emb_kv(
53
+ kv: torch.FloatTensor, cos: torch.FloatTensor, sin: torch.FloatTensor
54
+ ) -> torch.FloatTensor:
55
+ seqlen, rotary_dim = kv.shape[1], cos.shape[-1] * 2
56
+ k_rot = kv[:, :, 0, :, :rotary_dim].chunk(2, dim=-1)
57
+ k_pass = kv[:, :, 0, :, rotary_dim:]
58
+ c, s = cos[:seqlen].unsqueeze(1), sin[:seqlen].unsqueeze(1)
59
+ k_rot = torch.cat(
60
+ [k_rot[0] * c - k_rot[1] * s, k_rot[0] * s + k_rot[1] * c], dim=-1
61
+ )
62
+ return torch.cat(
63
+ [torch.cat([k_rot, k_pass], dim=-1).unsqueeze(2), kv[:, :, 1:2, :, :]], dim=2
64
+ )
65
+
66
+
67
+ def _apply_rotary_emb_qkv(
68
+ qkv: torch.FloatTensor, cos: torch.FloatTensor, sin: torch.FloatTensor
69
+ ) -> torch.FloatTensor:
70
+ seqlen, rotary_dim = qkv.shape[1], cos.shape[1] * 2
71
+
72
+ c = cos[:seqlen].unsqueeze(1)
73
+ s = sin[:seqlen].unsqueeze(1)
74
+
75
+ qkv_rot = torch.stack(
76
+ [
77
+ torch.cat(
78
+ [
79
+ qkv[:, :, i, :, : rotary_dim // 2] * c
80
+ - qkv[:, :, i, :, rotary_dim // 2 : rotary_dim] * s,
81
+ qkv[:, :, i, :, : rotary_dim // 2] * s
82
+ + qkv[:, :, i, :, rotary_dim // 2 : rotary_dim] * c,
83
+ ],
84
+ dim=-1,
85
+ ).to(qkv.dtype)
86
+ for i in range(2)
87
+ ],
88
+ dim=2,
89
+ )
90
+
91
+ qkv_pass = qkv[:, :, :2, :, rotary_dim:].unsqueeze(2)
92
+ qkv_v = qkv[:, :, 2:3, :, :]
93
+ return torch.cat([qkv_rot, qkv_pass, qkv_v], dim=2)
94
+
95
+
96
+ class RotaryEmbedding(nn.Module):
97
+ # Enhanced Transformer with Rotary Position Embedding (https://arxiv.org/pdf/2104.09864.pdf)
98
+ def __init__(
99
+ self,
100
+ dim: int,
101
+ base: int = 10000,
102
+ scale_base: Optional[float] = None,
103
+ pos_idx_in_fp32: bool = True,
104
+ max_position_embeddings: int = 2048,
105
+ device: Optional[str] = None,
106
+ ) -> None:
107
+ super().__init__()
108
+ # fp32 is preferred since the output of `torch.arange` can be quite large and bf16 would lose a lot of precision
109
+ self.dim, self.base, self.pos_idx_in_fp32, self.device = (
110
+ dim,
111
+ float(base),
112
+ pos_idx_in_fp32,
113
+ device,
114
+ )
115
+ self.max_position_embeddings = max_position_embeddings
116
+ if scale_base is not None:
117
+ raise NotImplementedError
118
+
119
+ # Generate and register the non-trainable buffers
120
+ self.register_buffer(
121
+ "inv_freq", self._compute_inv_freq(device), persistent=False
122
+ )
123
+ self.register_buffer(
124
+ "scale", self._calculate_scale(dim, scale_base, device), persistent=False
125
+ )
126
+ self._update_cos_sin_cache(
127
+ max_position_embeddings, device=device, dtype=torch.float32
128
+ )
129
+
130
+ def _calculate_scale(self, dim, scale_base, device):
131
+ return (
132
+ (
133
+ (
134
+ torch.arange(0, dim, 2, device=device, dtype=torch.float32)
135
+ + 0.4 * dim
136
+ )
137
+ / (1.4 * dim)
138
+ )
139
+ if scale_base is not None
140
+ else None
141
+ )
142
+
143
+ def _compute_inv_freq(self, device: Optional[str] = None) -> torch.FloatTensor:
144
+ return 1.0 / (
145
+ self.base
146
+ ** (
147
+ torch.arange(0, self.dim, 2, device=device, dtype=torch.float32)
148
+ / self.dim
149
+ )
150
+ )
151
+
152
+ def _update_cos_sin_cache(
153
+ self,
154
+ seqlen: int,
155
+ device: Optional[str] = None,
156
+ dtype: Optional[torch.dtype] = None,
157
+ ) -> None:
158
+ self._seq_len_cached = seqlen
159
+ t = torch.arange(
160
+ seqlen,
161
+ device=device,
162
+ dtype=torch.float32 if self.pos_idx_in_fp32 else self.inv_freq.dtype,
163
+ )
164
+ inv_freq = (
165
+ self._compute_inv_freq(device=device)
166
+ if self.pos_idx_in_fp32 and self.inv_freq.dtype != torch.float32
167
+ else self.inv_freq
168
+ )
169
+
170
+ freqs = torch.outer(t, inv_freq)
171
+
172
+ def apply_scale(freqs, scale, operator, dtype):
173
+ result = operator(freqs)
174
+ return (result / scale).to(dtype) if scale is not None else result.to(dtype)
175
+
176
+ if scale := self.scale:
177
+ power = (
178
+ torch.arange(seqlen, dtype=scale.dtype, device=scale.device)
179
+ - seqlen // 2
180
+ ) / self.scale_base
181
+ scale = scale.to(device=power.device) ** power.unsqueeze(1)
182
+
183
+ self._cos_cached = apply_scale(
184
+ freqs, 1 / scale if scale is not None else None, torch.cos, dtype
185
+ )
186
+ self._sin_cached = apply_scale(
187
+ freqs, 1 / scale if scale is not None else None, torch.sin, dtype
188
+ )
189
+ if scale is not None:
190
+ self._cos_k_cached = apply_scale(freqs, scale, torch.cos, dtype)
191
+ self._sin_k_cached = apply_scale(freqs, scale, torch.sin, dtype)
192
+
193
+ def forward(
194
+ self,
195
+ qkv: torch.Tensor,
196
+ kv: Optional[torch.Tensor] = None,
197
+ seqlen_offset: int = 0,
198
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
199
+ should_update = (
200
+ self._seq_len_cached < qkv.shape[1] + seqlen_offset
201
+ or self._cos_cached.device != qkv.device
202
+ or self._cos_cached.dtype != qkv.dtype
203
+ or (self.training and self._cos_cached.is_inference())
204
+ )
205
+
206
+ if should_update:
207
+ self._update_cos_sin_cache(
208
+ qkv.shape[1] + seqlen_offset, device=qkv.device, dtype=qkv.dtype
209
+ )
210
+
211
+ offset_cos = self._cos_cached[seqlen_offset:]
212
+ offset_sin = self._sin_cached[seqlen_offset:]
213
+
214
+ if kv is None:
215
+ return _apply_rotary_emb_qkv(qkv, offset_cos, offset_sin)
216
+ else:
217
+ return _apply_rotary_emb(qkv, offset_cos, offset_sin), _apply_rotary_emb_kv(
218
+ kv, offset_cos, offset_sin
219
+ )
220
+
221
+
222
+ class MLP(nn.Module):
223
+ def __init__(
224
+ self,
225
+ config: PretrainedConfig,
226
+ n_inner: Optional[int] = None,
227
+ act_fn: Optional[str] = None,
228
+ ) -> None:
229
+ super().__init__()
230
+ n_inner = n_inner or getattr(config, "n_inner", None) or 4 * config.n_embd
231
+ act_fn = act_fn or config.activation_function
232
+
233
+ self.fc1 = nn.Linear(config.n_embd, n_inner)
234
+ self.fc2 = nn.Linear(n_inner, config.n_embd)
235
+ self.act = ACT2FN[act_fn]
236
+
237
+ def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor:
238
+ return self.fc2(self.act(self.fc1(hidden_states)))
239
+
240
+
241
+ # Flash Attention (https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py)
242
+ class SelfAttention(nn.Module):
243
+ def __init__(
244
+ self,
245
+ causal: bool = True,
246
+ softmax_scale: Optional[float] = None,
247
+ attention_dropout: float = 0.0,
248
+ ):
249
+ super().__init__()
250
+ self.causal = causal
251
+ self.softmax_scale = softmax_scale
252
+ self.drop = nn.Dropout(attention_dropout)
253
+
254
+ @torch.autocast("cpu", enabled=False)
255
+ @torch.autocast("cuda", enabled=False)
256
+ def forward(
257
+ self,
258
+ qkv: torch.FloatTensor,
259
+ causal: Optional[bool] = None,
260
+ key_padding_mask: Optional[torch.BoolTensor] = None,
261
+ ):
262
+ q, k, v = qkv.chunk(3, dim=-1)
263
+ scale = self.softmax_scale or 1.0 / q.size(-1) ** 0.5
264
+
265
+ scores = (
266
+ torch.einsum("bthd,bshd->bhts", q.to(torch.float32), k.to(torch.float32))
267
+ * scale
268
+ )
269
+ if causal or self.causal:
270
+ scores.triu_(1).fill_(-10000.0)
271
+ if key_padding_mask is not None:
272
+ scores.masked_fill_(key_padding_mask[:, None, None, :], -10000.0)
273
+
274
+ attn = self.drop(torch.softmax(scores, dim=-1).to(v.dtype))
275
+ return torch.einsum("bhts,bshd->bthd", attn, v)
276
+
277
+
278
+ # Flash Attention (https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/modules/mha.py)
279
+ class CrossAttention(nn.Module):
280
+ def __init__(self, causal=True, softmax_scale=None, attention_dropout=0.0):
281
+ super().__init__()
282
+ self.causal = causal
283
+ self.softmax_scale = softmax_scale
284
+ self.drop = nn.Dropout(attention_dropout)
285
+
286
+ @torch.autocast("cpu", enabled=False)
287
+ @torch.autocast("cuda", enabled=False)
288
+ def forward(
289
+ self,
290
+ q: torch.FloatTensor,
291
+ kv: torch.FloatTensor,
292
+ causal: bool = None,
293
+ key_padding_mask: Optional[torch.BoolTensor] = None,
294
+ ) -> torch.FloatTensor:
295
+ batch_size, seqlen_q = q.shape[0], q.shape[1]
296
+ seqlen_k = kv.shape[1]
297
+
298
+ if kv.shape[3] != q.shape[2]:
299
+ kv = repeat(kv, "... hkv d -> ... (hkv g) d", g=q.shape[2] // kv.shape[3])
300
+ k, v = kv.unbind(dim=2)
301
+
302
+ q = q.to(torch.float32)
303
+ k = k.to(torch.float32)
304
+
305
+ causal = self.causal if causal is None else causal
306
+ softmax_scale = self.softmax_scale or 1.0 / math.sqrt(q.shape[-1])
307
+
308
+ # Autocast is manually disabled to avoid `torch.einsum` performing the operation using float16, which might lead to overflow
309
+ scores = torch.einsum("bthd,bshd->bhts", q, k * softmax_scale)
310
+
311
+ if key_padding_mask is not None:
312
+ padding_mask = torch.full(
313
+ (batch_size, seqlen_k),
314
+ -10000.0,
315
+ dtype=scores.dtype,
316
+ device=scores.device,
317
+ )
318
+ padding_mask.masked_fill_(key_padding_mask, 0.0)
319
+ scores = scores + rearrange(padding_mask, "b s -> b 1 1 s")
320
+
321
+ if causal:
322
+ rows = rearrange(
323
+ torch.arange(seqlen_q, device=q.device, dtype=torch.long), "s -> s 1"
324
+ )
325
+ cols = torch.arange(seqlen_k, device=k.device, dtype=torch.long)
326
+ causal_mask = cols > rows + seqlen_k - seqlen_q
327
+ scores = scores.masked_fill(causal_mask, -10000.0)
328
+
329
+ attention = torch.softmax(scores, dim=-1).to(v.dtype)
330
+ attention = self.drop(attention)
331
+ output = torch.einsum("bhts,bshd->bthd", attention, v)
332
+
333
+ return output
334
+
335
+
336
+ def _find_mha_dims(
337
+ config: PretrainedConfig,
338
+ n_head: Optional[int] = None,
339
+ n_head_kv: Optional[int] = None,
340
+ head_dim: Optional[int] = None,
341
+ ) -> Tuple[int, int]:
342
+ if n_head is None and head_dim is None:
343
+ head_dim = config.n_embd // config.n_head
344
+ n_head = config.n_head
345
+ elif n_head is None or head_dim is None:
346
+ raise ValueError("`n_head` and `head_dim` must be both specified or `None`.")
347
+ if n_head_kv is None:
348
+ n_head_kv = getattr(config, "n_head_kv", None) or n_head
349
+ return n_head, n_head_kv, head_dim
350
+
351
+
352
+ def _update_kv_cache(
353
+ kv: torch.FloatTensor, inference_params: InferenceParams, layer_idx: int
354
+ ) -> torch.FloatTensor:
355
+ num_heads, head_dim = kv.shape[-2:]
356
+ layer_memory = inference_params.key_value_memory_dict.setdefault(
357
+ layer_idx,
358
+ torch.empty(
359
+ inference_params.max_batch_size,
360
+ inference_params.max_seqlen,
361
+ 2,
362
+ num_heads,
363
+ head_dim,
364
+ dtype=kv.dtype,
365
+ device=kv.device,
366
+ ),
367
+ )
368
+
369
+ batch_slice = slice(
370
+ inference_params.batch_size_offset,
371
+ inference_params.batch_size_offset + kv.shape[0],
372
+ )
373
+ seqlen_slice = slice(
374
+ inference_params.seqlen_offset, inference_params.seqlen_offset + kv.shape[1]
375
+ )
376
+
377
+ if seqlen_slice.stop >= inference_params.max_seqlen:
378
+ layer_memory = torch.cat((layer_memory, kv), dim=1)
379
+ inference_params.key_value_memory_dict[layer_idx] = layer_memory
380
+
381
+ layer_memory[batch_slice, seqlen_slice, ...] = kv
382
+ return layer_memory[batch_slice, : seqlen_slice.stop, ...]
383
+
384
+
385
+ # Multi-head attention layer with rotary embeddings
386
+ class MHA(nn.Module):
387
+ def __init__(
388
+ self,
389
+ config,
390
+ dtype=None,
391
+ device=None,
392
+ rotary_dim=None,
393
+ rotary_base=10000.0,
394
+ rotary_scale_base=None,
395
+ n_head=None,
396
+ n_head_kv=None,
397
+ head_dim=None,
398
+ bias=True,
399
+ causal=True,
400
+ softmax_scale=None,
401
+ layer_idx=None,
402
+ return_residual=False,
403
+ checkpointing=False,
404
+ ):
405
+ super().__init__()
406
+
407
+ # Set rotary embedding if specified
408
+ self.rotary_dim = rotary_dim or getattr(config, "rotary_dim", 0)
409
+ if self.rotary_dim:
410
+ self.rotary_emb = RotaryEmbedding(
411
+ self.rotary_dim,
412
+ base=rotary_base,
413
+ scale_base=rotary_scale_base,
414
+ device=device,
415
+ max_position_embeddings=config.n_positions,
416
+ )
417
+
418
+ # Determine MHA dims from arguments or config
419
+ self.n_head, self.n_head_kv, self.head_dim = _find_mha_dims(
420
+ config, n_head, n_head_kv, head_dim
421
+ )
422
+ op_size = self.head_dim * (self.n_head + 2 * self.n_head_kv)
423
+ hidden_size = config.n_embd
424
+
425
+ # Choose Linear class based on config, FusedDense is optional
426
+ LinearClass = (
427
+ FusedDense if config.fused_dense and FusedDense is not None else nn.Linear
428
+ )
429
+ self.Wqkv = LinearClass(
430
+ hidden_size, op_size, bias=bias, device=device, dtype=dtype
431
+ )
432
+ self.out_proj = LinearClass(
433
+ hidden_size, hidden_size, bias=bias, device=device, dtype=dtype
434
+ )
435
+
436
+ # Initialize attention mechanisms
437
+ attn_kwargs = {
438
+ "causal": causal,
439
+ "softmax_scale": softmax_scale,
440
+ "attention_dropout": config.attn_pdrop,
441
+ }
442
+ self.inner_attn = SelfAttention(**attn_kwargs)
443
+ self.inner_cross_attn = CrossAttention(**attn_kwargs)
444
+
445
+ self.layer_idx = layer_idx
446
+ self.return_residual = return_residual
447
+ self.checkpointing = checkpointing
448
+
449
+ def _forward_self_attn(
450
+ self, x: torch.FloatTensor, key_padding_mask: Optional[torch.BoolTensor]
451
+ ) -> torch.FloatTensor:
452
+ qkv = rearrange(
453
+ self.Wqkv(x), "... (three h d) -> ... three h d", three=3, d=self.head_dim
454
+ )
455
+ if self.rotary_dim > 0:
456
+ qkv = self.rotary_emb(qkv)
457
+ attn_func = (
458
+ torch.utils.checkpoint.checkpoint
459
+ if self.checkpointing
460
+ else lambda f, *args, **kwargs: f(*args, **kwargs)
461
+ )
462
+ return attn_func(self.inner_attn, qkv, key_padding_mask=key_padding_mask)
463
+
464
+ def _forward_cross_attn(
465
+ self,
466
+ x: torch.FloatTensor,
467
+ past_key_values: Optional[InferenceParams],
468
+ key_padding_mask: Optional[torch.BoolTensor],
469
+ ) -> torch.FloatTensor:
470
+ qkv = self.Wqkv(x)
471
+ q, kv = (
472
+ qkv[..., : self.n_head * self.head_dim],
473
+ qkv[..., self.n_head * self.head_dim :],
474
+ )
475
+ q = rearrange(q, "... (h d) -> ... h d", d=self.head_dim)
476
+ kv = rearrange(kv, "... (two hkv d) -> ... two hkv d", two=2, d=self.head_dim)
477
+
478
+ seqlen_offset = (
479
+ past_key_values.seqlen_offset if past_key_values is not None else 0
480
+ )
481
+ causal = None if seqlen_offset == 0 else False
482
+ if self.rotary_dim > 0:
483
+ q, kv = self.rotary_emb(q, kv=kv, seqlen_offset=seqlen_offset)
484
+
485
+ if past_key_values is not None:
486
+ kv = _update_kv_cache(kv, past_key_values, self.layer_idx)
487
+
488
+ attn_func = (
489
+ torch.utils.checkpoint.checkpoint
490
+ if self.checkpointing
491
+ else lambda fn, *args, **kwargs: fn(*args, **kwargs)
492
+ )
493
+
494
+ return attn_func(
495
+ self.inner_cross_attn,
496
+ q,
497
+ kv,
498
+ key_padding_mask=key_padding_mask,
499
+ causal=causal,
500
+ )
501
+
502
+ def forward(
503
+ self,
504
+ x: torch.FloatTensor,
505
+ past_key_values: Optional[InferenceParams] = None,
506
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
507
+ ) -> Tuple[torch.FloatTensor, torch.FloatTensor]:
508
+ attention_mask = attention_mask.bool() if attention_mask is not None else None
509
+ use_cross_attn = self.n_head != self.n_head_kv or past_key_values is not None
510
+ attn_output_function = (
511
+ self._forward_cross_attn if use_cross_attn else self._forward_self_attn
512
+ )
513
+ attn_output = (
514
+ attn_output_function(x, past_key_values, attention_mask)
515
+ if use_cross_attn
516
+ else attn_output_function(x, attention_mask)
517
+ )
518
+ output = self.out_proj(rearrange(attn_output, "... h d -> ... (h d)"))
519
+ return (output, x) if self.return_residual else output
520
+
521
+
522
+ # Parallel block. This block applies parallel mixer and MLP layers to the input (used in GPT-J and CodeGen).
523
+ class ParallelBlock(nn.Module):
524
+ def __init__(self, config: PretrainedConfig, block_idx: Optional[int] = None):
525
+ super().__init__()
526
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
527
+ self.resid_dropout = nn.Dropout(config.resid_pdrop)
528
+ self.block_idx = block_idx
529
+ self.mixer = MHA(config, layer_idx=block_idx)
530
+ self.mlp = MLP(config)
531
+
532
+ def forward(
533
+ self,
534
+ hidden_states: torch.FloatTensor,
535
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
536
+ attention_mask: Optional[torch.BoolTensor] = None,
537
+ ) -> torch.FloatTensor:
538
+ residual = hidden_states
539
+ hidden_states = self.ln(hidden_states)
540
+
541
+ attn_outputs = self.mixer(
542
+ hidden_states,
543
+ past_key_values=past_key_values,
544
+ attention_mask=attention_mask,
545
+ )
546
+ if isinstance(attn_outputs, tuple):
547
+ attn_outputs = attn_outputs[0]
548
+
549
+ attn_outputs = self.resid_dropout(attn_outputs)
550
+ feed_forward_hidden_states = self.resid_dropout(self.mlp(hidden_states))
551
+ return attn_outputs + feed_forward_hidden_states + residual
552
+
553
+
554
+ class CausalLMHead(nn.Module):
555
+ """Causal Language Modeling head. Simplified version."""
556
+
557
+ def __init__(self, config):
558
+ super().__init__()
559
+ self.ln = nn.LayerNorm(config.n_embd, eps=config.layer_norm_epsilon)
560
+ self.linear = nn.Linear(config.n_embd, config.vocab_size)
561
+
562
+ def forward(self, hidden_states):
563
+ return self.linear(self.ln(hidden_states)).to(torch.float32)
564
+
565
+
566
+ # Improving Language Understanding by Generative Pre-Training
567
+ # (https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
568
+ class CausalLMLoss(nn.Module):
569
+ def __init__(self, shift_labels: bool = True) -> None:
570
+ super().__init__()
571
+ self.shift_labels = shift_labels
572
+ self.loss_fct = nn.CrossEntropyLoss()
573
+
574
+ def forward(
575
+ self, logits: torch.FloatTensor, labels: torch.LongTensor
576
+ ) -> torch.FloatTensor:
577
+ if self.shift_labels:
578
+ logits, labels = logits[..., :-1, :], labels[..., 1:]
579
+ return self.loss_fct(logits.reshape(-1, logits.size(-1)), labels.reshape(-1))
580
+
581
+
582
+ class PhiPreTrainedModel(PreTrainedModel):
583
+ config_class = PhiConfig
584
+ base_model_prefix = "transformer"
585
+ supports_gradient_checkpointing = False
586
+ _no_split_modules = ["ParallelBlock"]
587
+
588
+ def __init__(self, *inputs, **kwargs) -> None:
589
+ super().__init__(*inputs, **kwargs)
590
+
591
+ def prepare_inputs_for_generation(
592
+ self,
593
+ input_ids: torch.LongTensor = None,
594
+ inputs_embeds: torch.FloatTensor = None,
595
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
596
+ attention_mask: Optional[Union[torch.LongTensor, torch.BoolTensor]] = None,
597
+ **kwargs,
598
+ ) -> Dict[str, Any]:
599
+ if input_ids is None and inputs_embeds is None:
600
+ raise ValueError(
601
+ "You have to specify either `input_ids` or `inputs_embeds`."
602
+ )
603
+
604
+ max_batch_size = (
605
+ inputs_embeds.shape[0] if inputs_embeds is not None else input_ids.shape[0]
606
+ )
607
+ seqlen_offset = (
608
+ inputs_embeds.shape[1] + input_ids.shape[1] - 2
609
+ if inputs_embeds is not None
610
+ else input_ids.shape[1] - 1
611
+ )
612
+
613
+ args = (
614
+ {"inputs_embeds": inputs_embeds}
615
+ if inputs_embeds is not None
616
+ else {"input_ids": input_ids}
617
+ )
618
+
619
+ if not isinstance(past_key_values, InferenceParams):
620
+ past_key_values = InferenceParams(
621
+ max_seqlen=self.config.n_positions,
622
+ max_batch_size=max_batch_size,
623
+ seqlen_offset=0,
624
+ batch_size_offset=0,
625
+ key_value_memory_dict={},
626
+ lengths_per_sample=None,
627
+ )
628
+ else:
629
+ past_key_values.seqlen_offset = seqlen_offset
630
+ args = {"input_ids": input_ids[:, -1].unsqueeze(-1)}
631
+
632
+ return {
633
+ **args,
634
+ "past_key_values": past_key_values,
635
+ "attention_mask": attention_mask,
636
+ }
637
+
638
+
639
+ class PhiModel(PhiPreTrainedModel):
640
+ _keys_to_ignore_on_load_missing = [""]
641
+ _keys_to_ignore_on_load_unexpected = [r"h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"]
642
+
643
+ def __init__(self, config: PhiConfig) -> None:
644
+ super().__init__(config)
645
+ self.embd = Embedding(config)
646
+ self.h = nn.ModuleList(
647
+ [ParallelBlock(config, block_idx=i) for i in range(config.n_layer)]
648
+ )
649
+ self.gradient_checkpointing = config.gradient_checkpointing
650
+ self.post_init()
651
+
652
+ def get_input_embeddings(self) -> nn.Embedding:
653
+ return self.embd.wte
654
+
655
+ def set_input_embeddings(self, new_embeddings: nn.Embedding) -> None:
656
+ self.embd.wte = new_embeddings
657
+
658
+ def forward(
659
+ self,
660
+ input_ids: torch.LongTensor = None,
661
+ inputs_embeds: torch.FloatTensor = None,
662
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
663
+ attention_mask: Optional[torch.BoolTensor] = None,
664
+ ) -> torch.FloatTensor:
665
+ if (input_ids is None) == (inputs_embeds is None):
666
+ raise ValueError("Specify exactly one of `input_ids` or `inputs_embeds`.")
667
+ hidden_states = self.embd(input_ids) if input_ids is not None else inputs_embeds
668
+
669
+ for layer in self.h:
670
+ func = layer.__call__ if self.gradient_checkpointing else layer
671
+ args = (hidden_states, past_key_values, attention_mask)
672
+ hidden_states = (
673
+ torch.utils.checkpoint.checkpoint(func, *args, use_reentrant=True)
674
+ if self.gradient_checkpointing
675
+ else func(*args)
676
+ )
677
+
678
+ return hidden_states
679
+
680
+
681
+ class PhiForCausalLM(PhiPreTrainedModel):
682
+ _keys_to_ignore_on_load_missing, _keys_to_ignore_on_load_unexpected = (
683
+ [""],
684
+ [r"transformer\.h\.\d+\.mlp.(fc_in|fc_out)\.(weight|bias)"],
685
+ )
686
+
687
+ def __init__(self, config: PhiConfig) -> None:
688
+ super().__init__(config)
689
+ self.transformer = PhiModel(config)
690
+ self.lm_head = CausalLMHead(config)
691
+ self.loss = CausalLMLoss()
692
+ self.post_init()
693
+
694
+ def get_output_embeddings(self) -> nn.Linear:
695
+ return self.lm_head.linear
696
+
697
+ def set_output_embeddings(self, new_embeddings: nn.Linear) -> None:
698
+ self.lm_head.linear = new_embeddings
699
+
700
+ def forward(
701
+ self,
702
+ input_ids: torch.LongTensor = None,
703
+ inputs_embeds: torch.FloatTensor = None,
704
+ past_key_values: Optional[Union[torch.FloatTensor, InferenceParams]] = None,
705
+ attention_mask: Optional[torch.BoolTensor] = None,
706
+ labels: Optional[torch.LongTensor] = None,
707
+ **kwargs,
708
+ ) -> CausalLMOutputWithPast:
709
+ hidden_states = self.transformer(
710
+ input_ids=input_ids,
711
+ inputs_embeds=inputs_embeds,
712
+ past_key_values=past_key_values,
713
+ attention_mask=attention_mask,
714
+ )
715
+ lm_logits = self.lm_head(hidden_states)
716
+ loss = self.loss(lm_logits, labels) if labels is not None else None
717
+
718
+ return CausalLMOutputWithPast(
719
+ loss=loss, logits=lm_logits, past_key_values=past_key_values
720
+ )
moondream.py ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from .vision_encoder import VisionEncoder
4
+ from .configuration_moondream import MoondreamConfig
5
+ from transformers import PreTrainedModel
6
+ import re
7
+
8
+ from .modeling_phi import PhiForCausalLM
9
+ from .configuration_moondream import PhiConfig
10
+
11
+ class Moondream(PreTrainedModel):
12
+ config_class = MoondreamConfig
13
+
14
+ def __init__(self, config):
15
+ super().__init__(config)
16
+ self.vision_encoder = VisionEncoder()
17
+
18
+ if type(config.phi_config) == dict:
19
+ phi_config = PhiConfig(**config.phi_config)
20
+ else:
21
+ phi_config = config.phi_config
22
+ self.text_model = PhiForCausalLM(phi_config)
23
+
24
+ @property
25
+ def device(self):
26
+ return self.text_model.device
27
+
28
+ def encode_image(self, image):
29
+ return self.vision_encoder(image)
30
+
31
+ def input_embeds(self, prompt, image_embeds, tokenizer):
32
+ def _tokenize(txt):
33
+ return tokenizer(
34
+ txt, return_tensors="pt", add_special_tokens=False
35
+ ).input_ids.to(self.device)
36
+
37
+ text_emb = self.text_model.get_input_embeddings()
38
+
39
+ # Add BOS token
40
+ embeds = []
41
+ embeds.append(
42
+ text_emb((torch.tensor([[tokenizer.bos_token_id]], device=self.device)))
43
+ )
44
+
45
+ if "<image>" not in prompt:
46
+ embeds.append(text_emb(_tokenize(prompt)))
47
+ else:
48
+ assert prompt.count("<image>") == 1
49
+ before, after = prompt.split("<image>")
50
+ embeds.append(text_emb(_tokenize(f"{before}<image>")))
51
+ embeds.append(image_embeds.to(self.device))
52
+ embeds.append(text_emb(_tokenize(f"</image>{after}")))
53
+
54
+ return torch.cat(embeds, dim=1)
55
+
56
+ def generate(
57
+ self,
58
+ image_embeds,
59
+ prompt,
60
+ tokenizer,
61
+ eos_text="<END>",
62
+ max_new_tokens=128,
63
+ **kwargs,
64
+ ):
65
+ eos_tokens = tokenizer(eos_text, add_special_tokens=False)[0].ids
66
+
67
+ generate_config = {
68
+ "eos_token_id": eos_tokens,
69
+ "bos_token_id": tokenizer.bos_token_id,
70
+ "pad_token_id": tokenizer.eos_token_id,
71
+ "max_new_tokens": max_new_tokens,
72
+ **kwargs,
73
+ }
74
+
75
+ with torch.no_grad():
76
+ inputs_embeds = self.input_embeds(prompt, image_embeds, tokenizer)
77
+ output_ids = self.text_model.generate(
78
+ inputs_embeds=inputs_embeds, **generate_config
79
+ )
80
+
81
+ return tokenizer.batch_decode(output_ids, skip_special_tokens=True)
82
+
83
+ def answer_question(
84
+ self,
85
+ image_embeds,
86
+ question,
87
+ tokenizer,
88
+ chat_history="",
89
+ result_queue=None,
90
+ **kwargs,
91
+ ):
92
+ prompt = f"<image>\n\n{chat_history}Question: {question}\n\nAnswer: "
93
+ answer = self.generate(
94
+ image_embeds,
95
+ prompt,
96
+ eos_text="<END>",
97
+ tokenizer=tokenizer,
98
+ max_new_tokens=256,
99
+ **kwargs,
100
+ )[0]
101
+ cleaned_answer = re.sub("<$", "", re.sub("END$", "", answer)).strip()
102
+
103
+ # Use the result_queue to pass the result if it is provided
104
+ if result_queue:
105
+ result_queue.put(cleaned_answer)
106
+ else:
107
+ return cleaned_answer
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fdf24bf76befcf76cc645098359eba0e183a0d70d5d554f4e1582b0beb9ebf6c
3
+ size 135
special_tokens_map.json ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<|endoftext|>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "<|endoftext|>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "unk_token": {
17
+ "content": "<|endoftext|>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ }
23
+ }
text_model.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80449790d25d30d0bd0d5855067657779ba513b05b9208e2ea5e28d3e822af42
3
+ size 135
text_model.py ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from torch import nn
2
+ import transformers
3
+ from .modeling_phi import PhiForCausalLM
4
+ from .configuration_moondream import PhiConfig
5
+
6
+ transformers.logging.set_verbosity_error()
7
+
8
+
9
+ class TextModel(nn.Module):
10
+ def __init__(self, config) -> None:
11
+ super().__init__()
12
+
13
+ if type(config.phi_config) == dict:
14
+ phi_config = PhiConfig(**config.phi_config)
15
+ else:
16
+ phi_config = config.phi_config
17
+
18
+ self.model = PhiForCausalLM(phi_config)
19
+ self.text_emb = self.model.get_input_embeddings()
text_model_cfg.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "microsoft/phi-1_5",
3
+ "activation_function": "gelu_new",
4
+ "architectures": [
5
+ "PhiForCausalLM"
6
+ ],
7
+ "attn_pdrop": 0.0,
8
+ "auto_map": {
9
+ "AutoConfig": "configuration_phi.PhiConfig",
10
+ "AutoModelForCausalLM": "modeling_phi.PhiForCausalLM"
11
+ },
12
+ "embd_pdrop": 0.0,
13
+ "flash_attn": false,
14
+ "flash_rotary": false,
15
+ "fused_dense": false,
16
+ "initializer_range": 0.02,
17
+ "layer_norm_epsilon": 1e-05,
18
+ "model_type": "phi-msft",
19
+ "n_embd": 2048,
20
+ "n_head": 32,
21
+ "n_head_kv": null,
22
+ "n_inner": null,
23
+ "n_layer": 24,
24
+ "n_positions": 2048,
25
+ "resid_pdrop": 0.0,
26
+ "rotary_dim": 32,
27
+ "tie_word_embeddings": false,
28
+ "torch_dtype": "float16",
29
+ "transformers_version": "4.34.1",
30
+ "vocab_size": 51200
31
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "\t\t": 50294,
3
+ "\t\t\t": 50293,
4
+ "\t\t\t\t": 50292,
5
+ "\t\t\t\t\t": 50291,
6
+ "\t\t\t\t\t\t": 50290,
7
+ "\t\t\t\t\t\t\t": 50289,
8
+ "\t\t\t\t\t\t\t\t": 50288,
9
+ "\t\t\t\t\t\t\t\t\t": 50287,
10
+ " ": 50286,
11
+ " ": 50285,
12
+ " ": 50284,
13
+ " ": 50283,
14
+ " ": 50282,
15
+ " ": 50281,
16
+ " ": 50280,
17
+ " ": 50279,
18
+ " ": 50278,
19
+ " ": 50277,
20
+ " ": 50276,
21
+ " ": 50275,
22
+ " ": 50274,
23
+ " ": 50273,
24
+ " ": 50272,
25
+ " ": 50271,
26
+ " ": 50270,
27
+ " ": 50269,
28
+ " ": 50268,
29
+ " ": 50267,
30
+ " ": 50266,
31
+ " ": 50265,
32
+ " ": 50264,
33
+ " ": 50263,
34
+ " ": 50262,
35
+ " ": 50261,
36
+ " ": 50260,
37
+ " ": 50259,
38
+ " ": 50258,
39
+ " ": 50257
40
+ }
tokenizer/merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|endoftext|>",
3
+ "eos_token": "<|endoftext|>",
4
+ "unk_token": "<|endoftext|>"
5
+ }
tokenizer/tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer/tokenizer_config.json ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "50257": {
13
+ "content": " ",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": false
19
+ },
20
+ "50258": {
21
+ "content": " ",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": false
27
+ },
28
+ "50259": {
29
+ "content": " ",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": false
35
+ },
36
+ "50260": {
37
+ "content": " ",
38
+ "lstrip": false,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": false
43
+ },
44
+ "50261": {
45
+ "content": " ",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": false
51
+ },
52
+ "50262": {
53
+ "content": " ",
54
+ "lstrip": false,
55
+ "normalized": true,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": false
59
+ },
60
+ "50263": {
61
+ "content": " ",
62
+ "lstrip": false,
63
+ "normalized": true,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": false
67
+ },
68
+ "50264": {
69
+ "content": " ",
70
+ "lstrip": false,
71
+ "normalized": true,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": false
75
+ },
76
+ "50265": {
77
+ "content": " ",
78
+ "lstrip": false,
79
+ "normalized": true,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": false
83
+ },
84
+ "50266": {
85
+ "content": " ",
86
+ "lstrip": false,
87
+ "normalized": true,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": false
91
+ },
92
+ "50267": {
93
+ "content": " ",
94
+ "lstrip": false,
95
+ "normalized": true,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": false
99
+ },
100
+ "50268": {
101
+ "content": " ",
102
+ "lstrip": false,
103
+ "normalized": true,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": false
107
+ },
108
+ "50269": {
109
+ "content": " ",
110
+ "lstrip": false,
111
+ "normalized": true,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": false
115
+ },
116
+ "50270": {
117
+ "content": " ",
118
+ "lstrip": false,
119
+ "normalized": true,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "50271": {
125
+ "content": " ",
126
+ "lstrip": false,
127
+ "normalized": true,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "50272": {
133
+ "content": " ",
134
+ "lstrip": false,
135
+ "normalized": true,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "50273": {
141
+ "content": " ",
142
+ "lstrip": false,
143
+ "normalized": true,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "50274": {
149
+ "content": " ",
150
+ "lstrip": false,
151
+ "normalized": true,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "50275": {
157
+ "content": " ",
158
+ "lstrip": false,
159
+ "normalized": true,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "50276": {
165
+ "content": " ",
166
+ "lstrip": false,
167
+ "normalized": true,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "50277": {
173
+ "content": " ",
174
+ "lstrip": false,
175
+ "normalized": true,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ },
180
+ "50278": {
181
+ "content": " ",
182
+ "lstrip": false,
183
+ "normalized": true,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": false
187
+ },
188
+ "50279": {
189
+ "content": " ",
190
+ "lstrip": false,
191
+ "normalized": true,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": false
195
+ },
196
+ "50280": {
197
+ "content": " ",
198
+ "lstrip": false,
199
+ "normalized": true,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": false
203
+ },
204
+ "50281": {
205
+ "content": " ",
206
+ "lstrip": false,
207
+ "normalized": true,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": false
211
+ },
212
+ "50282": {
213
+ "content": " ",
214
+ "lstrip": false,
215
+ "normalized": true,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": false
219
+ },
220
+ "50283": {
221
+ "content": " ",
222
+ "lstrip": false,
223
+ "normalized": true,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": false
227
+ },
228
+ "50284": {
229
+ "content": " ",
230
+ "lstrip": false,
231
+ "normalized": true,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": false
235
+ },
236
+ "50285": {
237
+ "content": " ",
238
+ "lstrip": false,
239
+ "normalized": true,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": false
243
+ },
244
+ "50286": {
245
+ "content": " ",
246
+ "lstrip": false,
247
+ "normalized": true,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": false
251
+ },
252
+ "50287": {
253
+ "content": "\t\t\t\t\t\t\t\t\t",
254
+ "lstrip": false,
255
+ "normalized": true,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": false
259
+ },
260
+ "50288": {
261
+ "content": "\t\t\t\t\t\t\t\t",
262
+ "lstrip": false,
263
+ "normalized": true,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": false
267
+ },
268
+ "50289": {
269
+ "content": "\t\t\t\t\t\t\t",
270
+ "lstrip": false,
271
+ "normalized": true,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": false
275
+ },
276
+ "50290": {
277
+ "content": "\t\t\t\t\t\t",
278
+ "lstrip": false,
279
+ "normalized": true,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": false
283
+ },
284
+ "50291": {
285
+ "content": "\t\t\t\t\t",
286
+ "lstrip": false,
287
+ "normalized": true,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": false
291
+ },
292
+ "50292": {
293
+ "content": "\t\t\t\t",
294
+ "lstrip": false,
295
+ "normalized": true,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": false
299
+ },
300
+ "50293": {
301
+ "content": "\t\t\t",
302
+ "lstrip": false,
303
+ "normalized": true,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": false
307
+ },
308
+ "50294": {
309
+ "content": "\t\t",
310
+ "lstrip": false,
311
+ "normalized": true,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": false
315
+ }
316
+ },
317
+ "bos_token": "<|endoftext|>",
318
+ "clean_up_tokenization_spaces": true,
319
+ "eos_token": "<|endoftext|>",
320
+ "model_max_length": 2048,
321
+ "tokenizer_class": "CodeGenTokenizer",
322
+ "unk_token": "<|endoftext|>"
323
+ }
tokenizer/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,323 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "50256": {
5
+ "content": "<|endoftext|>",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "50257": {
13
+ "content": " ",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": false
19
+ },
20
+ "50258": {
21
+ "content": " ",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": false
27
+ },
28
+ "50259": {
29
+ "content": " ",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": false
35
+ },
36
+ "50260": {
37
+ "content": " ",
38
+ "lstrip": false,
39
+ "normalized": true,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": false
43
+ },
44
+ "50261": {
45
+ "content": " ",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false,
50
+ "special": false
51
+ },
52
+ "50262": {
53
+ "content": " ",
54
+ "lstrip": false,
55
+ "normalized": true,
56
+ "rstrip": false,
57
+ "single_word": false,
58
+ "special": false
59
+ },
60
+ "50263": {
61
+ "content": " ",
62
+ "lstrip": false,
63
+ "normalized": true,
64
+ "rstrip": false,
65
+ "single_word": false,
66
+ "special": false
67
+ },
68
+ "50264": {
69
+ "content": " ",
70
+ "lstrip": false,
71
+ "normalized": true,
72
+ "rstrip": false,
73
+ "single_word": false,
74
+ "special": false
75
+ },
76
+ "50265": {
77
+ "content": " ",
78
+ "lstrip": false,
79
+ "normalized": true,
80
+ "rstrip": false,
81
+ "single_word": false,
82
+ "special": false
83
+ },
84
+ "50266": {
85
+ "content": " ",
86
+ "lstrip": false,
87
+ "normalized": true,
88
+ "rstrip": false,
89
+ "single_word": false,
90
+ "special": false
91
+ },
92
+ "50267": {
93
+ "content": " ",
94
+ "lstrip": false,
95
+ "normalized": true,
96
+ "rstrip": false,
97
+ "single_word": false,
98
+ "special": false
99
+ },
100
+ "50268": {
101
+ "content": " ",
102
+ "lstrip": false,
103
+ "normalized": true,
104
+ "rstrip": false,
105
+ "single_word": false,
106
+ "special": false
107
+ },
108
+ "50269": {
109
+ "content": " ",
110
+ "lstrip": false,
111
+ "normalized": true,
112
+ "rstrip": false,
113
+ "single_word": false,
114
+ "special": false
115
+ },
116
+ "50270": {
117
+ "content": " ",
118
+ "lstrip": false,
119
+ "normalized": true,
120
+ "rstrip": false,
121
+ "single_word": false,
122
+ "special": false
123
+ },
124
+ "50271": {
125
+ "content": " ",
126
+ "lstrip": false,
127
+ "normalized": true,
128
+ "rstrip": false,
129
+ "single_word": false,
130
+ "special": false
131
+ },
132
+ "50272": {
133
+ "content": " ",
134
+ "lstrip": false,
135
+ "normalized": true,
136
+ "rstrip": false,
137
+ "single_word": false,
138
+ "special": false
139
+ },
140
+ "50273": {
141
+ "content": " ",
142
+ "lstrip": false,
143
+ "normalized": true,
144
+ "rstrip": false,
145
+ "single_word": false,
146
+ "special": false
147
+ },
148
+ "50274": {
149
+ "content": " ",
150
+ "lstrip": false,
151
+ "normalized": true,
152
+ "rstrip": false,
153
+ "single_word": false,
154
+ "special": false
155
+ },
156
+ "50275": {
157
+ "content": " ",
158
+ "lstrip": false,
159
+ "normalized": true,
160
+ "rstrip": false,
161
+ "single_word": false,
162
+ "special": false
163
+ },
164
+ "50276": {
165
+ "content": " ",
166
+ "lstrip": false,
167
+ "normalized": true,
168
+ "rstrip": false,
169
+ "single_word": false,
170
+ "special": false
171
+ },
172
+ "50277": {
173
+ "content": " ",
174
+ "lstrip": false,
175
+ "normalized": true,
176
+ "rstrip": false,
177
+ "single_word": false,
178
+ "special": false
179
+ },
180
+ "50278": {
181
+ "content": " ",
182
+ "lstrip": false,
183
+ "normalized": true,
184
+ "rstrip": false,
185
+ "single_word": false,
186
+ "special": false
187
+ },
188
+ "50279": {
189
+ "content": " ",
190
+ "lstrip": false,
191
+ "normalized": true,
192
+ "rstrip": false,
193
+ "single_word": false,
194
+ "special": false
195
+ },
196
+ "50280": {
197
+ "content": " ",
198
+ "lstrip": false,
199
+ "normalized": true,
200
+ "rstrip": false,
201
+ "single_word": false,
202
+ "special": false
203
+ },
204
+ "50281": {
205
+ "content": " ",
206
+ "lstrip": false,
207
+ "normalized": true,
208
+ "rstrip": false,
209
+ "single_word": false,
210
+ "special": false
211
+ },
212
+ "50282": {
213
+ "content": " ",
214
+ "lstrip": false,
215
+ "normalized": true,
216
+ "rstrip": false,
217
+ "single_word": false,
218
+ "special": false
219
+ },
220
+ "50283": {
221
+ "content": " ",
222
+ "lstrip": false,
223
+ "normalized": true,
224
+ "rstrip": false,
225
+ "single_word": false,
226
+ "special": false
227
+ },
228
+ "50284": {
229
+ "content": " ",
230
+ "lstrip": false,
231
+ "normalized": true,
232
+ "rstrip": false,
233
+ "single_word": false,
234
+ "special": false
235
+ },
236
+ "50285": {
237
+ "content": " ",
238
+ "lstrip": false,
239
+ "normalized": true,
240
+ "rstrip": false,
241
+ "single_word": false,
242
+ "special": false
243
+ },
244
+ "50286": {
245
+ "content": " ",
246
+ "lstrip": false,
247
+ "normalized": true,
248
+ "rstrip": false,
249
+ "single_word": false,
250
+ "special": false
251
+ },
252
+ "50287": {
253
+ "content": "\t\t\t\t\t\t\t\t\t",
254
+ "lstrip": false,
255
+ "normalized": true,
256
+ "rstrip": false,
257
+ "single_word": false,
258
+ "special": false
259
+ },
260
+ "50288": {
261
+ "content": "\t\t\t\t\t\t\t\t",
262
+ "lstrip": false,
263
+ "normalized": true,
264
+ "rstrip": false,
265
+ "single_word": false,
266
+ "special": false
267
+ },
268
+ "50289": {
269
+ "content": "\t\t\t\t\t\t\t",
270
+ "lstrip": false,
271
+ "normalized": true,
272
+ "rstrip": false,
273
+ "single_word": false,
274
+ "special": false
275
+ },
276
+ "50290": {
277
+ "content": "\t\t\t\t\t\t",
278
+ "lstrip": false,
279
+ "normalized": true,
280
+ "rstrip": false,
281
+ "single_word": false,
282
+ "special": false
283
+ },
284
+ "50291": {
285
+ "content": "\t\t\t\t\t",
286
+ "lstrip": false,
287
+ "normalized": true,
288
+ "rstrip": false,
289
+ "single_word": false,
290
+ "special": false
291
+ },
292
+ "50292": {
293
+ "content": "\t\t\t\t",
294
+ "lstrip": false,
295
+ "normalized": true,
296
+ "rstrip": false,
297
+ "single_word": false,
298
+ "special": false
299
+ },
300
+ "50293": {
301
+ "content": "\t\t\t",
302
+ "lstrip": false,
303
+ "normalized": true,
304
+ "rstrip": false,
305
+ "single_word": false,
306
+ "special": false
307
+ },
308
+ "50294": {
309
+ "content": "\t\t",
310
+ "lstrip": false,
311
+ "normalized": true,
312
+ "rstrip": false,
313
+ "single_word": false,
314
+ "special": false
315
+ }
316
+ },
317
+ "bos_token": "<|endoftext|>",
318
+ "clean_up_tokenization_spaces": true,
319
+ "eos_token": "<|endoftext|>",
320
+ "model_max_length": 2048,
321
+ "tokenizer_class": "CodeGenTokenizer",
322
+ "unk_token": "<|endoftext|>"
323
+ }
vision.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1f53a594ea82e4d3a84c78e022f67a1033edd719ed9bee54d29993ba0f246496
3
+ size 135
vision_encoder.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from torch import nn
3
+ from PIL import Image
4
+ from einops import rearrange
5
+ from torchvision.transforms.v2 import (
6
+ Compose,
7
+ Resize,
8
+ InterpolationMode,
9
+ ToImage,
10
+ ToDtype,
11
+ Normalize,
12
+ )
13
+ import timm
14
+
15
+
16
+ class VisualHolder(nn.Module):
17
+ def __init__(self, model):
18
+ super().__init__()
19
+ self.visual = model
20
+
21
+ def forward(self, x):
22
+ return self.visual(x)
23
+
24
+
25
+ class ModelHolder(nn.Module):
26
+ def __init__(self, model):
27
+ super().__init__()
28
+ self.model = model
29
+
30
+ def forward(self, x):
31
+ return self.model(x)
32
+
33
+
34
+ class LinearPatchEmbedding(nn.Module):
35
+ def __init__(self, conv):
36
+ super().__init__()
37
+ self.linear = nn.Linear(588, 1152)
38
+ self.linear.weight.data = conv.weight.data.view(1152, -1)
39
+ if conv.bias is not None:
40
+ self.linear.bias.data = conv.bias.data
41
+
42
+ def forward(self, x):
43
+ return self.linear(x)
44
+
45
+
46
+ class MLP(nn.Module):
47
+ def __init__(
48
+ self,
49
+ in_features: int,
50
+ hidden_features: int = None,
51
+ out_features: int = None,
52
+ act_layer: nn.Module = nn.GELU,
53
+ ) -> None:
54
+ super().__init__()
55
+ out_features = out_features or in_features
56
+ hidden_features = hidden_features or in_features
57
+ self.fc1 = nn.Linear(in_features, hidden_features)
58
+ self.act = act_layer()
59
+ self.fc2 = nn.Linear(hidden_features, out_features)
60
+
61
+ torch.nn.init.kaiming_normal_(
62
+ self.fc1.weight, mode="fan_in", nonlinearity="relu"
63
+ )
64
+ torch.nn.init.kaiming_normal_(
65
+ self.fc2.weight, mode="fan_in", nonlinearity="relu"
66
+ )
67
+
68
+ def forward(self, x: torch.Tensor) -> torch.Tensor:
69
+ x = self.fc1(x)
70
+ x = self.act(x)
71
+ x = self.fc2(x)
72
+ return x
73
+
74
+
75
+ class VisionProjection(nn.Module):
76
+ def __init__(self):
77
+ super().__init__()
78
+
79
+ image_embedding_dim = 1152
80
+ model_dim = 2048
81
+ hidden_dim = model_dim * 4
82
+
83
+ self.mlp = MLP(image_embedding_dim, hidden_dim, model_dim)
84
+
85
+ @property
86
+ def device(self):
87
+ return self.mlp.fc1.weight.device
88
+
89
+ def forward(self, x):
90
+ return self.mlp(x)
91
+
92
+
93
+ class VisionEncoder(nn.Module):
94
+ def __init__(self) -> None:
95
+ super().__init__()
96
+
97
+ self.encoder = ModelHolder(
98
+ VisualHolder(timm.create_model("vit_so400m_patch14_siglip_384"))
99
+ )
100
+ self.encoder.model.visual.patch_embed = LinearPatchEmbedding(
101
+ self.encoder.model.visual.patch_embed.proj
102
+ )
103
+ self.encoder.model.visual.attn_pool = nn.Identity()
104
+
105
+ self.projection = VisionProjection()
106
+
107
+ self.preprocess = Compose(
108
+ [
109
+ Resize(size=(378, 378), interpolation=InterpolationMode.BICUBIC),
110
+ ToImage(),
111
+ ToDtype(torch.float32, scale=True),
112
+ Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
113
+ ]
114
+ )
115
+
116
+ @property
117
+ def device(self):
118
+ return self.projection.mlp.fc1.weight.device
119
+
120
+ @property
121
+ def dtype(self):
122
+ return self.projection.mlp.fc1.weight.dtype
123
+
124
+ def __call__(self, image: Image) -> torch.Tensor:
125
+ with torch.no_grad():
126
+ x = (
127
+ self.preprocess(image.convert("RGB"))
128
+ .unsqueeze(0)
129
+ .to(self.device, dtype=self.dtype)
130
+ )
131
+ x = rearrange(x, "b c (h p1) (w p2) -> b (h w) (c p1 p2)", p1=14, p2=14)
132
+
133
+ x = self.encoder(x)
134
+ x = self.projection(x)
135
+
136
+ return x
vocab.json ADDED
The diff for this file is too large to render. See raw diff