Safetensors
English
llava_next
custom_code
nada5 wping commited on
Commit
67e706d
·
verified ·
0 Parent(s):

Super-squash branch 'main' using huggingface_hub

Browse files

Co-authored-by: wping <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: cc-by-nc-4.0
5
+ ---
6
+ ## Introduction
7
+ We introduce MM-Embed, an extension of NV-Embed-v1 with multimodal retrieval capability.
8
+ MM-Embed achieves state-of-the-art results in [UniIR benchmark](https://huggingface.co/TIGER-Lab/UniIR) with 52.7 averaged score compared to 48.9 (the best results in [UnIR benchmark paper](https://eccv.ecva.net/virtual/2024/poster/863)).
9
+ Notably, MM-Embed improves NV-Embed-v1 text retrieval accuracy, from 59.36 to 60.3 on 15 retrieval tasks within Massive Text Embedding Benchmark ([MTEB benchmark](https://arxiv.org/abs/2210.07316)).
10
+ MM-Embed presents several new training strategies, including modality-aware hard negative mining to improve multimodal retrieval accuracy in UniIR, and demonstrating a continual text-to-text fine-tuning method to further enhance the accuracy of text-to-text retrieval while maintaining mulitmodal retrieval accuracy.
11
+
12
+ <!-- For more technical details, refer to our paper: [NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models](https://arxiv.org/pdf/2405.17428). -->
13
+
14
+ <!-- For more benchmark results (other than MTEB), please find the [AIR-Bench](https://huggingface.co/spaces/AIR-Bench/leaderboard) for QA (English only) and Long-Doc. -->
15
+
16
+ ## Model Details
17
+ - Multimodal archietecture: [llava-hf/llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf)
18
+ - Text Embedding LLM: [nvidia/NV-Embed-v1](https://huggingface.co/nvidia/NV-Embed-v1)
19
+
20
+ ## How to use
21
+
22
+ Here are two examples of how to encode queries and passages using Huggingface-transformer. Please find the required package version [here](https://huggingface.co/nvidia/MM-Embed#1-required-packages). See more instructions in various retrieval scenario [here](Here are two examples of how to encode queries and passages using Huggingface-transformer. Please find the required package version [here](https://huggingface.co/nvidia/MM-Embed#1-required-packages). See more instructions in various retrieval scenario [here](https://huggingface.co/nvidia/MM-Embed/blob/main/instructions.json)
23
+
24
+ ### Usage of Multimodal Retrieval (HuggingFace Transformers)
25
+ ```python
26
+ import torch
27
+ import torch.nn.functional as F
28
+ from transformers import AutoTokenizer, AutoModel
29
+ from PIL import Image
30
+ import requests
31
+
32
+ # Each query needs to be accompanied by an corresponding instruction describing the task.
33
+ task_name_to_instruct = {"example": "Retrieve a Wikipedia paragraph that provides an answer to the given query about the image."}
34
+
35
+ img1_url = 'https://cdn.contexttravel.com/image/upload/w_1500,q_60/v1574869648/blog/Facts%20about%20the%20Eiffel%20Tower/eiffelhero.jpg'
36
+ img2_url = 'https://trumpwhitehouse.archives.gov/wp-content/uploads/2021/01/40508989563_514189250a_o-1500x720.jpg'
37
+
38
+ instruction = task_name_to_instruct['example']
39
+ queries = [
40
+ {'txt': 'What country does this place belong to?', 'img': Image.open(requests.get(img1_url, stream=True).raw)},
41
+ {'txt': 'What country does this place belong to?', 'img': Image.open(requests.get(img2_url, stream=True).raw)},
42
+ ]
43
+
44
+ # No instruction needed for retrieval passages
45
+ passages = [
46
+ {'txt': "France, officially the French Republic, is a country located primarily in Western Europe. Its overseas regions and territories include French Guiana in South America, Saint Pierre and Miquelon in the North Atlantic, the French West Indies, and many islands in Oceania and the Indian Ocean, giving it one of the largest discontiguous exclusive economic zones in the world. Metropolitan France shares borders with Belgium and Luxembourg to the north, Germany to the northeast, Switzerland to the east, Italy and Monaco to the southeast, Andorra and Spain to the south, and a maritime border with the United Kingdom to the northwest. Its metropolitan area extends from the Rhine to the Atlantic Ocean and from the Mediterranean Sea to the English Channel and the North Sea. Its eighteen integral regions (five of which are overseas) span a combined area of 643,801 km2 (248,573 sq mi) and have a total population of 68.4 million as of January 2024. France is a semi-presidential republic with its capital in Paris, the country's largest city and main cultural and commercial centre."},
47
+ {'txt': "The United States of America (USA), commonly known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal union of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the states of Alaska to the northwest and the archipelagic Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands. The country has the world's third-largest land area, largest exclusive economic zone, and third-largest population, exceeding 334 million. Its three largest metropolitan areas are New York, Los Angeles, and Chicago, and its three most populous states are California, Texas, and Florida."},
48
+ ]
49
+
50
+ # load model with tokenizer
51
+ model = AutoModel.from_pretrained('nvidia/MM-Embed', trust_remote_code=True)
52
+ model = model.cuda()
53
+
54
+ # get the embeddings, the output embeddings are normalized to one
55
+ max_length = 4096
56
+ query_embeddings = model.encode(queries, is_query=True, instruction=instruction, max_length=max_length)['hidden_states']
57
+ passage_embeddings = model.encode(passages, max_length=max_length)['hidden_states']
58
+
59
+ # compute relevance scores
60
+ scores = (query_embeddings @ passage_embeddings.T) * 100
61
+ print(scores.tolist())
62
+ #[[31.019872665405273, 12.753520965576172], [11.135049819946289, 22.12639617919922]]
63
+ ```
64
+
65
+ ### Usage of Text-to-Text Retrieval (HuggingFace Transformers)
66
+ ```python
67
+ import torch
68
+ import torch.nn.functional as F
69
+ from transformers import AutoTokenizer, AutoModel
70
+
71
+ # Each query needs to be accompanied by an corresponding instruction describing the task.
72
+ task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question"}
73
+
74
+ instruction = task_name_to_instruct['example']
75
+ queries = [
76
+ {'txt': 'are judo throws allowed in wrestling?'},
77
+ {'txt': 'how to become a radiology technician in michigan?'},
78
+ ]
79
+
80
+ # No instruction needed for retrieval passages
81
+ passages = [
82
+ {'txt': "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force."},
83
+ {'txt': "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."},
84
+ ]
85
+
86
+ # load model with tokenizer
87
+ model = AutoModel.from_pretrained('nvidia/MM-Embed', trust_remote_code=True)
88
+ model = model.cuda()
89
+
90
+ # get the embeddings, the output embeddings are normalized to one
91
+ max_length = 4096
92
+ query_embeddings = model.encode(queries, is_query=True, instruction=instruction, max_length=max_length)['hidden_states']
93
+ passage_embeddings = model.encode(passages, max_length=max_length)['hidden_states']
94
+
95
+ # compute relevance scores
96
+ scores = (query_embeddings @ passage_embeddings.T) * 100
97
+ print(scores.tolist())
98
+ #[[80.78538513183594, 2.030935049057007], [3.7138314247131348, 83.22908782958984]]
99
+ ```
100
+
101
+ ## Correspondence to
102
+ Sheng-Chieh Lin ([email protected]), Wei Ping ([email protected])
103
+
104
+ ## Citation
105
+ If you find this code useful in your research, please consider citing:
106
+
107
+ ```bibtex
108
+ @misc{lin2024nvmmembed,
109
+ title={MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs},
110
+ author={Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, Wei Ping},
111
+ year={2024},
112
+ eprint={2411.02571},
113
+ archivePrefix={arXiv},
114
+ primaryClass={cs.CL},
115
+ url={https://arxiv.org/abs/2411.02571},
116
+ }
117
+ ```
118
+ ## License
119
+ This model should not be used for any commercial purpose. Refer the [license](https://spdx.org/licenses/CC-BY-NC-4.0) for the detailed terms.
120
+
121
+ For commercial purpose, we recommend you to use the models of [NeMo Retriever Microservices (NIMs)](https://build.nvidia.com/explore/retrieval).
122
+
123
+
124
+ ## Troubleshooting
125
+
126
+
127
+ #### 1. Required Packages
128
+
129
+ If you have trouble, try installing the python packages as below
130
+ ```python
131
+ pip uninstall -y transformer-engine
132
+ pip install torch==2.2.0
133
+ pip install transformers==4.42.4
134
+ pip install flash-attn==2.2.0
135
+ pip install pillow
136
+ ```
137
+
138
+ #### 2. Access to model nvidia/MMEmbed is restricted. You must be authenticated to access it
139
+
140
+ Use your huggingface access [token](https://huggingface.co/settings/tokens) to execute *"huggingface-cli login"*.
141
+
142
+ ## Model Architectures
143
+
144
+ **Network Architecture:** Decoder-Only Transformer
145
+
146
+ ### Input
147
+ **Input Type(s):** Text, Image <br>
148
+ **Input Format(s):** String, [Pillow Library-Supported Formats](https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html) <br>
149
+ **Input Dimensions:** One-Dimensional (1D), Two Dimensional (2D) <br>
150
+ **Other Properties Related to Input:** Maximum Token Length = 32,768 Tokens <br>
151
+
152
+ ### Output
153
+ **Output Type(s):** Embedding Vector <br>
154
+ **Output Format:** Numeric <br>
155
+ **Model Output:** 1D <br>
156
+ **Other Properties Related to Output:** None <br>
157
+
158
+ ## Software Integration
159
+ **Runtime Engine(s):** PyTorch <br>
160
+
161
+ **Supported Hardware Microarchitecture Compatibility:** NVIDIA Hopper <br>
162
+
163
+ **[Preferred/Supported] Operating System(s):** Linux <br>
164
+
165
+ ## Inference
166
+ **Engine:** PyTorch <br>
167
+ **Test Hardware:** H100 <br>
168
+
169
+ ## Ethical Considerations
170
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
171
+
172
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
173
+
config.json ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "myshare/NV-MMEmbed-v1",
3
+ "add_eos_token": true,
4
+ "architectures": [
5
+ "NVMMEmbedModel"
6
+ ],
7
+ "auto_map": {
8
+ "AutoModel": "modeling_nvmmembed.NVMMEmbedModel"
9
+ },
10
+ "global_image_patch_only": true,
11
+ "ignore_index": -100,
12
+ "image_grid_pinpoints": [
13
+ [
14
+ 336,
15
+ 672
16
+ ],
17
+ [
18
+ 672,
19
+ 336
20
+ ],
21
+ [
22
+ 672,
23
+ 672
24
+ ],
25
+ [
26
+ 1008,
27
+ 336
28
+ ],
29
+ [
30
+ 336,
31
+ 1008
32
+ ]
33
+ ],
34
+ "image_token_index": 32000,
35
+ "model_type": "llava_next",
36
+ "padding_side": "right",
37
+ "projector_hidden_act": "gelu",
38
+ "retriever": "nvidia/NV-Embed-v1",
39
+ "text_config": {
40
+ "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
41
+ "architectures": [
42
+ "MistralForCausalLM"
43
+ ],
44
+ "intermediate_size": 14336,
45
+ "max_position_embeddings": 32768,
46
+ "model_type": "mistral",
47
+ "num_key_value_heads": 8,
48
+ "rms_norm_eps": 1e-05,
49
+ "rope_theta": 1000000.0,
50
+ "sliding_window": null,
51
+ "torch_dtype": "bfloat16",
52
+ "vocab_size": 32064
53
+ },
54
+ "tie_word_embeddings": false,
55
+ "torch_dtype": "float16",
56
+ "transformers_version": "4.42.4",
57
+ "use_image_newline_parameter": true,
58
+ "vision_config": {
59
+ "hidden_size": 1024,
60
+ "image_size": 336,
61
+ "intermediate_size": 4096,
62
+ "model_type": "clip_vision_model",
63
+ "num_attention_heads": 16,
64
+ "num_hidden_layers": 24,
65
+ "patch_size": 14,
66
+ "projection_dim": 768,
67
+ "vocab_size": 32000
68
+ },
69
+ "vision_feature_layer": -2,
70
+ "vision_feature_select_strategy": "default",
71
+ "vocab_size": 32064
72
+ }
generation_config.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "transformers_version": "4.42.4"
6
+ }
instructions.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {"VisualNews": {"query_instruction": ["Identify the news-related image in line with the described event.", "I want you to retrieve an image of this news caption.", "Display an image that best captures the following caption from the news.", "Based on the caption, provide the most fitting image for the news story."], "query_modality": "text", "candidate_modality": "image"}}
2
+ {"MSCOCO": {"query_instruction": ["I want you to retrieve an image of this daily life description.", "Show me an image that best captures the following common scene description.", "Find me an everyday image that matches the given caption.", "Identify the image showcasing the described everyday scene."], "query_modality": "text", "candidate_modality": "image"}}
3
+ {"Fashion200K": {"query_instruction": ["You need to identify the image that corresponds to the fashion product description provided.", "Identify the fashion image that aligns with the described product.", "Match the provided description to the correct fashion item photo.", "Based on the following fashion description, retrieve the best matching image."], "query_modality": "text", "candidate_modality": "image"}}
4
+ {"WebQA": {"query_instruction": ["I want to find an answer to the question. Can you find some snippets that provide evidence from Wikipedia?", "Retrieve passages from Wikipedia that provide answers to the following question.", "I'm looking for a Wikipedia snippet that answers this question.", "You have to find a Wikipedia paragraph that provides the answer to the question."], "query_modality": "text", "candidate_modality": "text"}}
5
+ {"EDIS": {"query_instruction": ["Find a news image that matches the provided caption.", "I'm looking for an image that aligns with this news caption.", "Can you pair this news caption with the right image?", "Identify the news photo for the given caption."], "query_modality": "text", "candidate_modality": "image,text"}}
6
+ {"WebQA": {"query_instruction": ["Provide with me an image from Wikipedia to answer this question.", "Find a Wikipedia image that answers this question.", "I want to know the answer to this question. Please find the related Wikipedia image for me.", "You need to retrieve an evidence image from Wikipedia to address this question."], "query_modality": "text", "candidate_modality": "image,text"}}
7
+ {"VisualNews": {"query_instruction": ["Find a caption for the news in the given photo.", "Based on the shown image, retrieve an appropriate news caption.", "I want to know the caption for this news image.", "Provide a news-related caption for the displayed image."], "query_modality": "image", "candidate_modality": "text"}}
8
+ {"MSCOCO": {"query_instruction": ["Find an image caption describing the following everyday image.", "I want to locate the caption that best describes this everyday scene image.", "Retrieve the caption for the displayed day-to-day image.", "Can you find a caption talking about this daily life image?"], "query_modality": "image", "candidate_modality": "text"}}
9
+ {"Fashion200K": {"query_instruction": ["Based on the displayed image, retrieve the corresponding fashion description.", "Find a product description for the fashion item in the image.", "I want to find a matching description for the fashion item in this image.", "Can you retrieve the description for the fashion item in the image?"], "query_modality": "image", "candidate_modality": "text"}}
10
+ {"NIGHTS": {"query_instruction": ["Find a day-to-day image that looks similar to the provided image.", "You need to identify the common scene image that aligns most with this reference image.", "Which everyday image is the most similar to the reference image?", "Find a daily life image that is identical to the given one."], "query_modality": "image", "candidate_modality": "image"}}
11
+ {"OVEN": {"query_instruction": ["You have to find a Wikipedia segment that identifies this image's subject.", "Determine the Wikipedia snippet that identifies the visual entity in the image.", "Retrieve a Wikipedia paragraph that provides an answer to the given query about the image.", "I want to find a paragraph from Wikipedia that answers my question about this image."], "query_modality": "image,text", "candidate_modality": "text"}}
12
+ {"INFOSEEK": {"query_instruction": ["Determine the Wikipedia snippet that matches the question of this image.", "Retrieve a Wikipedia paragraph that provides an answer to the given query about the image.", "You have to find a Wikipedia segment that answers the question about the displayed image.", "I want to find a paragraph from Wikipedia that answers my question about this image."], "query_modality": "image,text", "candidate_modality": "text"}}
13
+ {"FashionIQ": {"query_instruction": ["Find a fashion image that aligns with the reference image and style note.", "I'm looking for a similar fashion product image with the described style changes.", "Given the reference image and design hint, identify the matching fashion image.", "With the reference image and modification instructions, find the described fashion look."], "query_modality": "image,text", "candidate_modality": "image"}}
14
+ {"CIRR": {"query_instruction": ["I'm looking for a similar everyday image with the described changes.", "Retrieve a day-to-day image that aligns with the modification instructions of the provided image.", "Can you help me find a daily image that meets the modification from the given image?", "Pull up a common scene image like this one, but with the modifications I asked for."], "query_modality": "image,text", "candidate_modality": "image"}}
15
+ {"OVEN": {"query_instruction": ["I want to find an image and subject description from Wikipedia that answers my question about this image.", "Determine the Wikipedia image-snippet pair that clarifies the entity in this picture.", "Retrieve a Wikipedia image-description pair that provides evidence for the question of this image.", "I want to know the subject in the photo. Can you provide the relevant Wikipedia section and image?"], "query_modality": "image,text", "candidate_modality": "image,text"}}
16
+ {"INFOSEEK": {"query_instruction": ["Determine the Wikipedia image-snippet pair that matches my question about this image.", "I want to find an image and subject description from Wikipedia that answers my question about this image.", "I want to address the query about this picture. Please pull up a relevant Wikipedia section and image.", "Retrieve a Wikipedia image-description pair that provides evidence for the question of this image."], "query_modality": "image,text", "candidate_modality": "image,text"}}
17
+ {"msmarco": {"query_instruction": ["Given a question, retrieve passages that answer the question", "Given a question, retrieve documents that can help answer the question", "Given a web search query, retrieve relevant passages that answer the query"], "query_modality": "text", "candidate_modality": "text"}}
18
+ {"nq": {"query_instruction": ["Given a question, retrieve Wikipedia passages that answer the question"], "query_modality": "text", "candidate_modality": "text"}}
19
+ {"quora": {"query_instruction": ["Given a question, retrieve questions that are semantically equivalent to the given question", "Find questions that have the same meaning as the input question"], "query_modality": "text", "candidate_modality": "text"}}
20
+ {"fever": {"query_instruction": ["Given a claim, retrieve documents that support or refute the claim"], "query_modality": "text", "candidate_modality": "text"}}
21
+ {"hotpotqa": {"query_instruction": ["Given a multi-hop question, retrieve documents that can help answer the question"], "query_modality": "text", "candidate_modality": "text"}}
22
+ {"arguana": {"query_instruction": ["Given a claim, find documents that refute the claim"], "query_modality": "text", "candidate_modality": "text"}}
23
+ {"climate-fever": {"query_instruction": ["Given a claim about climate change, retrieve documents that support or refute the claim"], "query_modality": "text", "candidate_modality": "text"}}
24
+ {"dbpedia-entity": {"query_instruction": ["Given a query, retrieve relevant entity descriptions from DBPedia"], "query_modality": "text", "candidate_modality": "text"}}
25
+ {"fiqa": {"query_instruction": ["Given a financial question, retrieve user replies that best answer the question"], "query_modality": "text", "candidate_modality": "text"}}
26
+ {"nfcorpus": {"query_instruction": ["Given a question, retrieve relevant documents that best answer the question"], "query_modality": "text", "candidate_modality": "text"}}
27
+ {"scidocs": {"query_instruction": ["Given a scientific paper title, retrieve paper abstracts that are cited by the given paper"], "query_modality": "text", "candidate_modality": "text"}}
28
+ {"scifact": {"query_instruction": ["Given a scientific claim, retrieve documents that support or refute the claim"], "query_modality": "text", "candidate_modality": "text"}}
29
+ {"webis-touche2020": {"query_instruction": ["Given a question, retrieve detailed and persuasive arguments that answer the question"], "query_modality": "text", "candidate_modality": "text"}}
30
+ {"trec-covid": {"query_instruction": ["Given a query on COVID-19, retrieve documents that answer the query"], "query_modality": "text", "candidate_modality": "text"}}
31
+ {"cqadupstack": {"query_instruction": ["Given a question, retrieve detailed question descriptions from Stackexchange that are duplicates to the given question"], "query_modality": "text", "candidate_modality": "text"}}
model-00001-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52d79efc12781395e25255c4fb4f32f2ee000390c1ceb53ecb8b2e462f26c676
3
+ size 4921093336
model-00002-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:fd7e4cb7a81545c66443194c3ac1a0525d818918ed2b93b7138e8b8599a3cff6
3
+ size 4915916968
model-00003-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ca5242bfd0996e916795d5669adffaca27579297cd3148e8e9330209e3edff7c
3
+ size 4915916976
model-00004-of-00004.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b3d55872a103c2ff295b4abdc9a411b068760c856a7f195d269013fffe2d8a84
3
+ size 1598179616
model.safetensors.index.json ADDED
@@ -0,0 +1,707 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 16351014912
4
+ },
5
+ "weight_map": {
6
+ "image_newline": "model-00001-of-00004.safetensors",
7
+ "language_model.embed_tokens.weight": "model-00001-of-00004.safetensors",
8
+ "language_model.layers.0.input_layernorm.weight": "model-00001-of-00004.safetensors",
9
+ "language_model.layers.0.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
10
+ "language_model.layers.0.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
11
+ "language_model.layers.0.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
12
+ "language_model.layers.0.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
13
+ "language_model.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
14
+ "language_model.layers.0.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
15
+ "language_model.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
16
+ "language_model.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
17
+ "language_model.layers.1.input_layernorm.weight": "model-00001-of-00004.safetensors",
18
+ "language_model.layers.1.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
19
+ "language_model.layers.1.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
20
+ "language_model.layers.1.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
21
+ "language_model.layers.1.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
22
+ "language_model.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
23
+ "language_model.layers.1.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
24
+ "language_model.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
25
+ "language_model.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
26
+ "language_model.layers.10.input_layernorm.weight": "model-00002-of-00004.safetensors",
27
+ "language_model.layers.10.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
28
+ "language_model.layers.10.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
29
+ "language_model.layers.10.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
30
+ "language_model.layers.10.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
31
+ "language_model.layers.10.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
32
+ "language_model.layers.10.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
33
+ "language_model.layers.10.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
34
+ "language_model.layers.10.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
35
+ "language_model.layers.11.input_layernorm.weight": "model-00002-of-00004.safetensors",
36
+ "language_model.layers.11.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
37
+ "language_model.layers.11.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
38
+ "language_model.layers.11.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
39
+ "language_model.layers.11.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
40
+ "language_model.layers.11.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
41
+ "language_model.layers.11.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
42
+ "language_model.layers.11.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
43
+ "language_model.layers.11.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
44
+ "language_model.layers.12.input_layernorm.weight": "model-00002-of-00004.safetensors",
45
+ "language_model.layers.12.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
46
+ "language_model.layers.12.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
47
+ "language_model.layers.12.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
48
+ "language_model.layers.12.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
49
+ "language_model.layers.12.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
50
+ "language_model.layers.12.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
51
+ "language_model.layers.12.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
52
+ "language_model.layers.12.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
53
+ "language_model.layers.13.input_layernorm.weight": "model-00002-of-00004.safetensors",
54
+ "language_model.layers.13.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
55
+ "language_model.layers.13.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
56
+ "language_model.layers.13.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
57
+ "language_model.layers.13.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
58
+ "language_model.layers.13.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
59
+ "language_model.layers.13.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
60
+ "language_model.layers.13.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
61
+ "language_model.layers.13.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
62
+ "language_model.layers.14.input_layernorm.weight": "model-00002-of-00004.safetensors",
63
+ "language_model.layers.14.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
64
+ "language_model.layers.14.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
65
+ "language_model.layers.14.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
66
+ "language_model.layers.14.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
67
+ "language_model.layers.14.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
68
+ "language_model.layers.14.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
69
+ "language_model.layers.14.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
70
+ "language_model.layers.14.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
71
+ "language_model.layers.15.input_layernorm.weight": "model-00002-of-00004.safetensors",
72
+ "language_model.layers.15.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
73
+ "language_model.layers.15.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
74
+ "language_model.layers.15.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
75
+ "language_model.layers.15.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
76
+ "language_model.layers.15.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
77
+ "language_model.layers.15.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
78
+ "language_model.layers.15.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
79
+ "language_model.layers.15.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
80
+ "language_model.layers.16.input_layernorm.weight": "model-00002-of-00004.safetensors",
81
+ "language_model.layers.16.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
82
+ "language_model.layers.16.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
83
+ "language_model.layers.16.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
84
+ "language_model.layers.16.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
85
+ "language_model.layers.16.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
86
+ "language_model.layers.16.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
87
+ "language_model.layers.16.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
88
+ "language_model.layers.16.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
89
+ "language_model.layers.17.input_layernorm.weight": "model-00002-of-00004.safetensors",
90
+ "language_model.layers.17.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
91
+ "language_model.layers.17.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
92
+ "language_model.layers.17.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
93
+ "language_model.layers.17.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
94
+ "language_model.layers.17.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
95
+ "language_model.layers.17.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
96
+ "language_model.layers.17.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
97
+ "language_model.layers.17.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
98
+ "language_model.layers.18.input_layernorm.weight": "model-00002-of-00004.safetensors",
99
+ "language_model.layers.18.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
100
+ "language_model.layers.18.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
101
+ "language_model.layers.18.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
102
+ "language_model.layers.18.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
103
+ "language_model.layers.18.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
104
+ "language_model.layers.18.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
105
+ "language_model.layers.18.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
106
+ "language_model.layers.18.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
107
+ "language_model.layers.19.input_layernorm.weight": "model-00002-of-00004.safetensors",
108
+ "language_model.layers.19.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
109
+ "language_model.layers.19.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
110
+ "language_model.layers.19.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
111
+ "language_model.layers.19.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
112
+ "language_model.layers.19.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
113
+ "language_model.layers.19.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
114
+ "language_model.layers.19.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
115
+ "language_model.layers.19.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
116
+ "language_model.layers.2.input_layernorm.weight": "model-00001-of-00004.safetensors",
117
+ "language_model.layers.2.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
118
+ "language_model.layers.2.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
119
+ "language_model.layers.2.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
120
+ "language_model.layers.2.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
121
+ "language_model.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
122
+ "language_model.layers.2.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
123
+ "language_model.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
124
+ "language_model.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
125
+ "language_model.layers.20.input_layernorm.weight": "model-00003-of-00004.safetensors",
126
+ "language_model.layers.20.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
127
+ "language_model.layers.20.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
128
+ "language_model.layers.20.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
129
+ "language_model.layers.20.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
130
+ "language_model.layers.20.self_attn.k_proj.weight": "model-00002-of-00004.safetensors",
131
+ "language_model.layers.20.self_attn.o_proj.weight": "model-00002-of-00004.safetensors",
132
+ "language_model.layers.20.self_attn.q_proj.weight": "model-00002-of-00004.safetensors",
133
+ "language_model.layers.20.self_attn.v_proj.weight": "model-00002-of-00004.safetensors",
134
+ "language_model.layers.21.input_layernorm.weight": "model-00003-of-00004.safetensors",
135
+ "language_model.layers.21.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
136
+ "language_model.layers.21.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
137
+ "language_model.layers.21.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
138
+ "language_model.layers.21.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
139
+ "language_model.layers.21.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
140
+ "language_model.layers.21.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
141
+ "language_model.layers.21.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
142
+ "language_model.layers.21.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
143
+ "language_model.layers.22.input_layernorm.weight": "model-00003-of-00004.safetensors",
144
+ "language_model.layers.22.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
145
+ "language_model.layers.22.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
146
+ "language_model.layers.22.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
147
+ "language_model.layers.22.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
148
+ "language_model.layers.22.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
149
+ "language_model.layers.22.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
150
+ "language_model.layers.22.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
151
+ "language_model.layers.22.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
152
+ "language_model.layers.23.input_layernorm.weight": "model-00003-of-00004.safetensors",
153
+ "language_model.layers.23.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
154
+ "language_model.layers.23.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
155
+ "language_model.layers.23.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
156
+ "language_model.layers.23.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
157
+ "language_model.layers.23.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
158
+ "language_model.layers.23.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
159
+ "language_model.layers.23.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
160
+ "language_model.layers.23.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
161
+ "language_model.layers.24.input_layernorm.weight": "model-00003-of-00004.safetensors",
162
+ "language_model.layers.24.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
163
+ "language_model.layers.24.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
164
+ "language_model.layers.24.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
165
+ "language_model.layers.24.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
166
+ "language_model.layers.24.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
167
+ "language_model.layers.24.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
168
+ "language_model.layers.24.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
169
+ "language_model.layers.24.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
170
+ "language_model.layers.25.input_layernorm.weight": "model-00003-of-00004.safetensors",
171
+ "language_model.layers.25.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
172
+ "language_model.layers.25.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
173
+ "language_model.layers.25.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
174
+ "language_model.layers.25.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
175
+ "language_model.layers.25.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
176
+ "language_model.layers.25.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
177
+ "language_model.layers.25.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
178
+ "language_model.layers.25.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
179
+ "language_model.layers.26.input_layernorm.weight": "model-00003-of-00004.safetensors",
180
+ "language_model.layers.26.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
181
+ "language_model.layers.26.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
182
+ "language_model.layers.26.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
183
+ "language_model.layers.26.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
184
+ "language_model.layers.26.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
185
+ "language_model.layers.26.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
186
+ "language_model.layers.26.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
187
+ "language_model.layers.26.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
188
+ "language_model.layers.27.input_layernorm.weight": "model-00003-of-00004.safetensors",
189
+ "language_model.layers.27.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
190
+ "language_model.layers.27.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
191
+ "language_model.layers.27.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
192
+ "language_model.layers.27.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
193
+ "language_model.layers.27.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
194
+ "language_model.layers.27.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
195
+ "language_model.layers.27.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
196
+ "language_model.layers.27.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
197
+ "language_model.layers.28.input_layernorm.weight": "model-00003-of-00004.safetensors",
198
+ "language_model.layers.28.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
199
+ "language_model.layers.28.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
200
+ "language_model.layers.28.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
201
+ "language_model.layers.28.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
202
+ "language_model.layers.28.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
203
+ "language_model.layers.28.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
204
+ "language_model.layers.28.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
205
+ "language_model.layers.28.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
206
+ "language_model.layers.29.input_layernorm.weight": "model-00003-of-00004.safetensors",
207
+ "language_model.layers.29.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
208
+ "language_model.layers.29.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
209
+ "language_model.layers.29.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
210
+ "language_model.layers.29.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
211
+ "language_model.layers.29.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
212
+ "language_model.layers.29.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
213
+ "language_model.layers.29.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
214
+ "language_model.layers.29.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
215
+ "language_model.layers.3.input_layernorm.weight": "model-00001-of-00004.safetensors",
216
+ "language_model.layers.3.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
217
+ "language_model.layers.3.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
218
+ "language_model.layers.3.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
219
+ "language_model.layers.3.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
220
+ "language_model.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
221
+ "language_model.layers.3.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
222
+ "language_model.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
223
+ "language_model.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
224
+ "language_model.layers.30.input_layernorm.weight": "model-00003-of-00004.safetensors",
225
+ "language_model.layers.30.mlp.down_proj.weight": "model-00003-of-00004.safetensors",
226
+ "language_model.layers.30.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
227
+ "language_model.layers.30.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
228
+ "language_model.layers.30.post_attention_layernorm.weight": "model-00003-of-00004.safetensors",
229
+ "language_model.layers.30.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
230
+ "language_model.layers.30.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
231
+ "language_model.layers.30.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
232
+ "language_model.layers.30.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
233
+ "language_model.layers.31.input_layernorm.weight": "model-00004-of-00004.safetensors",
234
+ "language_model.layers.31.mlp.down_proj.weight": "model-00004-of-00004.safetensors",
235
+ "language_model.layers.31.mlp.gate_proj.weight": "model-00003-of-00004.safetensors",
236
+ "language_model.layers.31.mlp.up_proj.weight": "model-00003-of-00004.safetensors",
237
+ "language_model.layers.31.post_attention_layernorm.weight": "model-00004-of-00004.safetensors",
238
+ "language_model.layers.31.self_attn.k_proj.weight": "model-00003-of-00004.safetensors",
239
+ "language_model.layers.31.self_attn.o_proj.weight": "model-00003-of-00004.safetensors",
240
+ "language_model.layers.31.self_attn.q_proj.weight": "model-00003-of-00004.safetensors",
241
+ "language_model.layers.31.self_attn.v_proj.weight": "model-00003-of-00004.safetensors",
242
+ "language_model.layers.4.input_layernorm.weight": "model-00001-of-00004.safetensors",
243
+ "language_model.layers.4.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
244
+ "language_model.layers.4.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
245
+ "language_model.layers.4.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
246
+ "language_model.layers.4.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
247
+ "language_model.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
248
+ "language_model.layers.4.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
249
+ "language_model.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
250
+ "language_model.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
251
+ "language_model.layers.5.input_layernorm.weight": "model-00001-of-00004.safetensors",
252
+ "language_model.layers.5.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
253
+ "language_model.layers.5.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
254
+ "language_model.layers.5.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
255
+ "language_model.layers.5.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
256
+ "language_model.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
257
+ "language_model.layers.5.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
258
+ "language_model.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
259
+ "language_model.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
260
+ "language_model.layers.6.input_layernorm.weight": "model-00001-of-00004.safetensors",
261
+ "language_model.layers.6.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
262
+ "language_model.layers.6.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
263
+ "language_model.layers.6.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
264
+ "language_model.layers.6.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
265
+ "language_model.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
266
+ "language_model.layers.6.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
267
+ "language_model.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
268
+ "language_model.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
269
+ "language_model.layers.7.input_layernorm.weight": "model-00001-of-00004.safetensors",
270
+ "language_model.layers.7.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
271
+ "language_model.layers.7.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
272
+ "language_model.layers.7.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
273
+ "language_model.layers.7.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
274
+ "language_model.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
275
+ "language_model.layers.7.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
276
+ "language_model.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
277
+ "language_model.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
278
+ "language_model.layers.8.input_layernorm.weight": "model-00001-of-00004.safetensors",
279
+ "language_model.layers.8.mlp.down_proj.weight": "model-00001-of-00004.safetensors",
280
+ "language_model.layers.8.mlp.gate_proj.weight": "model-00001-of-00004.safetensors",
281
+ "language_model.layers.8.mlp.up_proj.weight": "model-00001-of-00004.safetensors",
282
+ "language_model.layers.8.post_attention_layernorm.weight": "model-00001-of-00004.safetensors",
283
+ "language_model.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
284
+ "language_model.layers.8.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
285
+ "language_model.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
286
+ "language_model.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
287
+ "language_model.layers.9.input_layernorm.weight": "model-00002-of-00004.safetensors",
288
+ "language_model.layers.9.mlp.down_proj.weight": "model-00002-of-00004.safetensors",
289
+ "language_model.layers.9.mlp.gate_proj.weight": "model-00002-of-00004.safetensors",
290
+ "language_model.layers.9.mlp.up_proj.weight": "model-00002-of-00004.safetensors",
291
+ "language_model.layers.9.post_attention_layernorm.weight": "model-00002-of-00004.safetensors",
292
+ "language_model.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
293
+ "language_model.layers.9.self_attn.o_proj.weight": "model-00001-of-00004.safetensors",
294
+ "language_model.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
295
+ "language_model.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
296
+ "language_model.norm.weight": "model-00004-of-00004.safetensors",
297
+ "latent_attention_model.cross_attend_blocks.0.fn.to_kv.weight": "model-00004-of-00004.safetensors",
298
+ "latent_attention_model.cross_attend_blocks.0.fn.to_out.weight": "model-00004-of-00004.safetensors",
299
+ "latent_attention_model.cross_attend_blocks.0.fn.to_q.weight": "model-00004-of-00004.safetensors",
300
+ "latent_attention_model.cross_attend_blocks.0.norm.bias": "model-00004-of-00004.safetensors",
301
+ "latent_attention_model.cross_attend_blocks.0.norm.weight": "model-00004-of-00004.safetensors",
302
+ "latent_attention_model.cross_attend_blocks.0.norm_context.bias": "model-00004-of-00004.safetensors",
303
+ "latent_attention_model.cross_attend_blocks.0.norm_context.weight": "model-00004-of-00004.safetensors",
304
+ "latent_attention_model.cross_attend_blocks.1.fn.net.0.bias": "model-00004-of-00004.safetensors",
305
+ "latent_attention_model.cross_attend_blocks.1.fn.net.0.weight": "model-00004-of-00004.safetensors",
306
+ "latent_attention_model.cross_attend_blocks.1.fn.net.2.bias": "model-00004-of-00004.safetensors",
307
+ "latent_attention_model.cross_attend_blocks.1.fn.net.2.weight": "model-00004-of-00004.safetensors",
308
+ "latent_attention_model.cross_attend_blocks.1.norm.bias": "model-00004-of-00004.safetensors",
309
+ "latent_attention_model.cross_attend_blocks.1.norm.weight": "model-00004-of-00004.safetensors",
310
+ "latent_attention_model.latents": "model-00004-of-00004.safetensors",
311
+ "multi_modal_projector.linear_1.bias": "model-00001-of-00004.safetensors",
312
+ "multi_modal_projector.linear_1.weight": "model-00001-of-00004.safetensors",
313
+ "multi_modal_projector.linear_2.bias": "model-00001-of-00004.safetensors",
314
+ "multi_modal_projector.linear_2.weight": "model-00001-of-00004.safetensors",
315
+ "vision_tower.vision_model.embeddings.class_embedding": "model-00001-of-00004.safetensors",
316
+ "vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00001-of-00004.safetensors",
317
+ "vision_tower.vision_model.embeddings.position_embedding.weight": "model-00001-of-00004.safetensors",
318
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00001-of-00004.safetensors",
319
+ "vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00001-of-00004.safetensors",
320
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00001-of-00004.safetensors",
321
+ "vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00001-of-00004.safetensors",
322
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00004.safetensors",
323
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00001-of-00004.safetensors",
324
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00004.safetensors",
325
+ "vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00001-of-00004.safetensors",
326
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
327
+ "vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
328
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
329
+ "vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
330
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
331
+ "vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
332
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
333
+ "vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
334
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00001-of-00004.safetensors",
335
+ "vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00001-of-00004.safetensors",
336
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00001-of-00004.safetensors",
337
+ "vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00001-of-00004.safetensors",
338
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00004.safetensors",
339
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00004.safetensors",
340
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00001-of-00004.safetensors",
341
+ "vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00001-of-00004.safetensors",
342
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
343
+ "vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
344
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
345
+ "vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
346
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
347
+ "vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
348
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
349
+ "vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
350
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00001-of-00004.safetensors",
351
+ "vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00001-of-00004.safetensors",
352
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00001-of-00004.safetensors",
353
+ "vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00001-of-00004.safetensors",
354
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00001-of-00004.safetensors",
355
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00001-of-00004.safetensors",
356
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00004.safetensors",
357
+ "vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00001-of-00004.safetensors",
358
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
359
+ "vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
360
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
361
+ "vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
362
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
363
+ "vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
364
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
365
+ "vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
366
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00001-of-00004.safetensors",
367
+ "vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00001-of-00004.safetensors",
368
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00001-of-00004.safetensors",
369
+ "vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00001-of-00004.safetensors",
370
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00004.safetensors",
371
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00004.safetensors",
372
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00004.safetensors",
373
+ "vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00001-of-00004.safetensors",
374
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
375
+ "vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
376
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
377
+ "vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
378
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
379
+ "vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
380
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
381
+ "vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
382
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00001-of-00004.safetensors",
383
+ "vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00001-of-00004.safetensors",
384
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00001-of-00004.safetensors",
385
+ "vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00001-of-00004.safetensors",
386
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00001-of-00004.safetensors",
387
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00001-of-00004.safetensors",
388
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00004.safetensors",
389
+ "vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00001-of-00004.safetensors",
390
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
391
+ "vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
392
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
393
+ "vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
394
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
395
+ "vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
396
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
397
+ "vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
398
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00001-of-00004.safetensors",
399
+ "vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00001-of-00004.safetensors",
400
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00001-of-00004.safetensors",
401
+ "vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00001-of-00004.safetensors",
402
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00004.safetensors",
403
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00001-of-00004.safetensors",
404
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00001-of-00004.safetensors",
405
+ "vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00004.safetensors",
406
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
407
+ "vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
408
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
409
+ "vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
410
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
411
+ "vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
412
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
413
+ "vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
414
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00001-of-00004.safetensors",
415
+ "vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00001-of-00004.safetensors",
416
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00001-of-00004.safetensors",
417
+ "vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00001-of-00004.safetensors",
418
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00004.safetensors",
419
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00001-of-00004.safetensors",
420
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00001-of-00004.safetensors",
421
+ "vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00004.safetensors",
422
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
423
+ "vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
424
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
425
+ "vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
426
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
427
+ "vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
428
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
429
+ "vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
430
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00001-of-00004.safetensors",
431
+ "vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00001-of-00004.safetensors",
432
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00001-of-00004.safetensors",
433
+ "vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00001-of-00004.safetensors",
434
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00004.safetensors",
435
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00001-of-00004.safetensors",
436
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00004.safetensors",
437
+ "vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00004.safetensors",
438
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
439
+ "vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
440
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
441
+ "vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
442
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
443
+ "vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
444
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
445
+ "vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
446
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00001-of-00004.safetensors",
447
+ "vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00001-of-00004.safetensors",
448
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00001-of-00004.safetensors",
449
+ "vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00001-of-00004.safetensors",
450
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00001-of-00004.safetensors",
451
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00004.safetensors",
452
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00001-of-00004.safetensors",
453
+ "vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00001-of-00004.safetensors",
454
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
455
+ "vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
456
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
457
+ "vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
458
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
459
+ "vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
460
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
461
+ "vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
462
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00001-of-00004.safetensors",
463
+ "vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00001-of-00004.safetensors",
464
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00001-of-00004.safetensors",
465
+ "vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00001-of-00004.safetensors",
466
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00001-of-00004.safetensors",
467
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00001-of-00004.safetensors",
468
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00001-of-00004.safetensors",
469
+ "vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00001-of-00004.safetensors",
470
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
471
+ "vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
472
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
473
+ "vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
474
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
475
+ "vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
476
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
477
+ "vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
478
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00001-of-00004.safetensors",
479
+ "vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00001-of-00004.safetensors",
480
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00001-of-00004.safetensors",
481
+ "vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00001-of-00004.safetensors",
482
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00004.safetensors",
483
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00001-of-00004.safetensors",
484
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00004.safetensors",
485
+ "vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00004.safetensors",
486
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
487
+ "vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
488
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
489
+ "vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
490
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
491
+ "vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
492
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
493
+ "vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
494
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00001-of-00004.safetensors",
495
+ "vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00001-of-00004.safetensors",
496
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00001-of-00004.safetensors",
497
+ "vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00001-of-00004.safetensors",
498
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00004.safetensors",
499
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00004.safetensors",
500
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00001-of-00004.safetensors",
501
+ "vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00004.safetensors",
502
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
503
+ "vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
504
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
505
+ "vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
506
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
507
+ "vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
508
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
509
+ "vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
510
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00001-of-00004.safetensors",
511
+ "vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00001-of-00004.safetensors",
512
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00001-of-00004.safetensors",
513
+ "vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00001-of-00004.safetensors",
514
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00004.safetensors",
515
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00004.safetensors",
516
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00001-of-00004.safetensors",
517
+ "vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00004.safetensors",
518
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
519
+ "vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
520
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
521
+ "vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
522
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
523
+ "vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
524
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
525
+ "vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
526
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00001-of-00004.safetensors",
527
+ "vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00001-of-00004.safetensors",
528
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00001-of-00004.safetensors",
529
+ "vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00001-of-00004.safetensors",
530
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00004.safetensors",
531
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00004.safetensors",
532
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00004.safetensors",
533
+ "vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00004.safetensors",
534
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
535
+ "vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
536
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
537
+ "vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
538
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
539
+ "vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
540
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
541
+ "vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
542
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00001-of-00004.safetensors",
543
+ "vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00001-of-00004.safetensors",
544
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00001-of-00004.safetensors",
545
+ "vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00001-of-00004.safetensors",
546
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00004.safetensors",
547
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00004.safetensors",
548
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00004.safetensors",
549
+ "vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00004.safetensors",
550
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
551
+ "vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
552
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
553
+ "vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
554
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
555
+ "vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
556
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
557
+ "vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
558
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00001-of-00004.safetensors",
559
+ "vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00001-of-00004.safetensors",
560
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00001-of-00004.safetensors",
561
+ "vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00001-of-00004.safetensors",
562
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00004.safetensors",
563
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00001-of-00004.safetensors",
564
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00004.safetensors",
565
+ "vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00004.safetensors",
566
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
567
+ "vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
568
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
569
+ "vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
570
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
571
+ "vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
572
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
573
+ "vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
574
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00001-of-00004.safetensors",
575
+ "vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00001-of-00004.safetensors",
576
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00001-of-00004.safetensors",
577
+ "vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00001-of-00004.safetensors",
578
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00001-of-00004.safetensors",
579
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00001-of-00004.safetensors",
580
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00004.safetensors",
581
+ "vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00004.safetensors",
582
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
583
+ "vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
584
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
585
+ "vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
586
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
587
+ "vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
588
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
589
+ "vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
590
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00001-of-00004.safetensors",
591
+ "vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00001-of-00004.safetensors",
592
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00001-of-00004.safetensors",
593
+ "vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00001-of-00004.safetensors",
594
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00001-of-00004.safetensors",
595
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00004.safetensors",
596
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00004.safetensors",
597
+ "vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00001-of-00004.safetensors",
598
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
599
+ "vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
600
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
601
+ "vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
602
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
603
+ "vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
604
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
605
+ "vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
606
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00001-of-00004.safetensors",
607
+ "vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00001-of-00004.safetensors",
608
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00001-of-00004.safetensors",
609
+ "vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00001-of-00004.safetensors",
610
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00004.safetensors",
611
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00001-of-00004.safetensors",
612
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00004.safetensors",
613
+ "vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00001-of-00004.safetensors",
614
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
615
+ "vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
616
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
617
+ "vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
618
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
619
+ "vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
620
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
621
+ "vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
622
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00001-of-00004.safetensors",
623
+ "vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00001-of-00004.safetensors",
624
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00001-of-00004.safetensors",
625
+ "vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00001-of-00004.safetensors",
626
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00004.safetensors",
627
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00004.safetensors",
628
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00004.safetensors",
629
+ "vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00004.safetensors",
630
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
631
+ "vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
632
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
633
+ "vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
634
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
635
+ "vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
636
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
637
+ "vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
638
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00001-of-00004.safetensors",
639
+ "vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00001-of-00004.safetensors",
640
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00001-of-00004.safetensors",
641
+ "vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00001-of-00004.safetensors",
642
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00004.safetensors",
643
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00001-of-00004.safetensors",
644
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00004.safetensors",
645
+ "vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00001-of-00004.safetensors",
646
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
647
+ "vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
648
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
649
+ "vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
650
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
651
+ "vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
652
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
653
+ "vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
654
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00001-of-00004.safetensors",
655
+ "vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00001-of-00004.safetensors",
656
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00001-of-00004.safetensors",
657
+ "vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00001-of-00004.safetensors",
658
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00001-of-00004.safetensors",
659
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00004.safetensors",
660
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00001-of-00004.safetensors",
661
+ "vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00004.safetensors",
662
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
663
+ "vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
664
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
665
+ "vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
666
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
667
+ "vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
668
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
669
+ "vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
670
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00001-of-00004.safetensors",
671
+ "vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00001-of-00004.safetensors",
672
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00001-of-00004.safetensors",
673
+ "vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00001-of-00004.safetensors",
674
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00004.safetensors",
675
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00004.safetensors",
676
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00004.safetensors",
677
+ "vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00001-of-00004.safetensors",
678
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
679
+ "vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
680
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
681
+ "vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
682
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
683
+ "vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
684
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
685
+ "vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
686
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00001-of-00004.safetensors",
687
+ "vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00001-of-00004.safetensors",
688
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00001-of-00004.safetensors",
689
+ "vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00001-of-00004.safetensors",
690
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00004.safetensors",
691
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00004.safetensors",
692
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00001-of-00004.safetensors",
693
+ "vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00004.safetensors",
694
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00001-of-00004.safetensors",
695
+ "vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00001-of-00004.safetensors",
696
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00001-of-00004.safetensors",
697
+ "vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00001-of-00004.safetensors",
698
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00001-of-00004.safetensors",
699
+ "vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00001-of-00004.safetensors",
700
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00001-of-00004.safetensors",
701
+ "vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00001-of-00004.safetensors",
702
+ "vision_tower.vision_model.post_layernorm.bias": "model-00001-of-00004.safetensors",
703
+ "vision_tower.vision_model.post_layernorm.weight": "model-00001-of-00004.safetensors",
704
+ "vision_tower.vision_model.pre_layrnorm.bias": "model-00001-of-00004.safetensors",
705
+ "vision_tower.vision_model.pre_layrnorm.weight": "model-00001-of-00004.safetensors"
706
+ }
707
+ }
modeling_nvmmembed.py ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn.functional as F
3
+ from peft import PeftModel
4
+ from transformers import AutoTokenizer, AutoModel
5
+
6
+
7
+ import math
8
+ from dataclasses import dataclass
9
+ from typing import List, Optional, Tuple, Union
10
+
11
+ import numpy as np
12
+ import torch
13
+ import torch.utils.checkpoint
14
+ from torch import nn
15
+
16
+ from transformers import AutoModel, AutoConfig
17
+ from transformers import LlavaNextProcessor
18
+ from transformers import LlavaNextForConditionalGeneration, LlavaNextConfig
19
+ from transformers.models.llava_next.modeling_llava_next import LlavaNextCausalLMOutputWithPast, image_size_to_num_patches
20
+
21
+ class NVMMEmbedModel(LlavaNextForConditionalGeneration):
22
+ def __init__(self, config: LlavaNextConfig):
23
+ super().__init__(config)
24
+
25
+ nvemb_config = AutoConfig.from_pretrained(config.retriever, trust_remote_code=True)
26
+ nvemb_model = AutoModel.from_config(nvemb_config, trust_remote_code=True)
27
+ self.language_model = nvemb_model.embedding_model
28
+ self.latent_attention_model = nvemb_model.latent_attention_model
29
+
30
+ self.preprocess_fn = LlavaNextProcessor.from_pretrained(config._name_or_path)
31
+ self.preprocess_fn.tokenizer.padding_side = config.padding_side
32
+ self.preprocess_fn.tokenizer.add_eos_token = config.add_eos_token
33
+ self.global_image_patch_only = config.global_image_patch_only
34
+
35
+
36
+ def create_pool_mask(self, attention_mask, instruction_lengths):
37
+ pool_mask = attention_mask.clone()
38
+ if instruction_lengths.unique().shape[0] == 1:
39
+ length = instruction_lengths[0].item()
40
+ pool_mask[:, :length] = 0
41
+ else:
42
+ for i, length in enumerate(instruction_lengths):
43
+ pool_mask[i, :length] = 0
44
+ return pool_mask
45
+
46
+ def calculate_instruction_length(self, tokenizer, prompts, prefix):
47
+ instructions = []
48
+ instruction_lengths = []
49
+ for prompt in prompts:
50
+ if prefix in prompt:
51
+ instruction = prompt.split(prefix)[0]
52
+ input_ids = tokenizer(instruction, return_tensors=None)['input_ids']
53
+ instruction_length = len(input_ids)
54
+ if '<image>' in instruction:
55
+ instruction_length += (576 - 1)
56
+ instruction_lengths.append(instruction_length)
57
+ else:
58
+ instruction_lengths.append(0)
59
+ return instruction_lengths
60
+
61
+ def forward(
62
+ self,
63
+ input_ids: torch.LongTensor = None,
64
+ pixel_values: torch.FloatTensor = None,
65
+ image_sizes: Optional[torch.LongTensor] = None,
66
+ attention_mask: Optional[torch.Tensor] = None,
67
+ instruction_lengths: Optional[torch.Tensor] = None,
68
+ position_ids: Optional[torch.LongTensor] = None,
69
+ past_key_values: Optional[List[torch.FloatTensor]] = None,
70
+ inputs_embeds: Optional[torch.FloatTensor] = None,
71
+ vision_feature_layer: Optional[int] = None,
72
+ vision_feature_select_strategy: Optional[str] = None,
73
+ labels: Optional[torch.LongTensor] = None,
74
+ use_cache: Optional[bool] = None,
75
+ output_attentions: Optional[bool] = None,
76
+ output_hidden_states: Optional[bool] = None,
77
+ return_dict: Optional[bool] = None,
78
+ ) -> Union[Tuple, LlavaNextCausalLMOutputWithPast]:
79
+ r"""
80
+ Args:
81
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
82
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
83
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
84
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
85
+
86
+ Returns:
87
+
88
+ Example:
89
+
90
+ ```python
91
+ >>> from PIL import Image
92
+ >>> import requests
93
+ >>> from transformers import AutoProcessor, LlavaNextForConditionalGeneration
94
+
95
+ >>> model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
96
+ >>> processor = AutoProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
97
+
98
+ >>> prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
99
+ >>> url = "https://www.ilankelman.org/stopsigns/australia.jpg"
100
+ >>> image = Image.open(requests.get(url, stream=True).raw)
101
+
102
+ >>> inputs = processor(text=prompt, images=image, return_tensors="pt")
103
+
104
+ >>> # Generate
105
+ >>> generate_ids = model.generate(**inputs, max_length=30)
106
+ >>> processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
107
+ "[INST] \nWhat is shown in this image? [/INST] The image appears to be a radar chart, which is a type of multi-dimensional plot (...)"
108
+ ```"""
109
+
110
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
111
+ output_hidden_states = (
112
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
113
+ )
114
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
115
+ vision_feature_layer = (
116
+ vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
117
+ )
118
+ vision_feature_select_strategy = (
119
+ vision_feature_select_strategy
120
+ if vision_feature_select_strategy is not None
121
+ else self.config.vision_feature_select_strategy
122
+ )
123
+ clip_global_image_feature = None
124
+
125
+ if inputs_embeds is None:
126
+ # 1. Extract the input embeddings
127
+ # In case image_token_index is not in the embeddings (extra token but embedding don't have it)
128
+ for_inputs_embeds_ids = input_ids.clone()
129
+ for_inputs_embeds_ids[(input_ids == self.config.image_token_index)] = 0
130
+ for_inputs_embeds_ids[(input_ids == 32001)] = 2 #We use tokenizer from Llava-Next but later replace PAD with EOS Token
131
+ inputs_embeds = self.language_model.get_input_embeddings()(for_inputs_embeds_ids)
132
+ # 2. Merge text and images
133
+ if pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) > 0:
134
+ # ! infer image_num_patches from image_sizes
135
+ image_num_patches = [
136
+ image_size_to_num_patches(
137
+ image_size=imsize,
138
+ grid_pinpoints=self.config.image_grid_pinpoints,
139
+ patch_size=self.config.vision_config.image_size,
140
+ )
141
+ for imsize in image_sizes
142
+ ]
143
+ # figure out if pixel_values is concatenated or stacked
144
+ if pixel_values.dim() == 5:
145
+ # stacking when input is (batch_size, num_patches, num_channels, height, width)
146
+ _pixel_values_list = [
147
+ pix_val[:num_patch] for pix_val, num_patch in zip(pixel_values, image_num_patches)
148
+ ]
149
+ if pixel_values.shape[1] == 1:
150
+ image_num_patches = [1 for imsize in image_sizes]
151
+ pixel_values = torch.cat(_pixel_values_list, dim=0)
152
+ elif pixel_values.dim() != 4:
153
+ # otherwise has to be stacked from list of (num_patches, num_channels, height, width)
154
+ raise ValueError(f"pixel_values of shape {pixel_values.shape}, expect to be of 4 or 5 dimensions")
155
+
156
+ image_features = self.vision_tower(pixel_values, output_hidden_states=True)
157
+ clip_global_image_feature = image_features.pooler_output
158
+ selected_image_feature = image_features.hidden_states[vision_feature_layer]
159
+
160
+ if vision_feature_select_strategy == "default":
161
+ selected_image_feature = selected_image_feature[:, 1:]
162
+ elif vision_feature_select_strategy == "full":
163
+ selected_image_feature = selected_image_feature
164
+
165
+ image_features = self.multi_modal_projector(selected_image_feature)
166
+ image_features = torch.split(image_features, image_num_patches, dim=0)
167
+
168
+ # NOTE we only support multimodal_patch_merge_type == "spatial_unpad"
169
+
170
+ image_features, feature_lens = self.pack_image_features(
171
+ image_features,
172
+ image_sizes,
173
+ image_newline=self.image_newline,
174
+ )
175
+
176
+ inputs_embeds = inputs_embeds.to(image_features.dtype)
177
+ inputs_embeds, attention_mask, position_ids, labels, _ = self._merge_input_ids_with_image_features(
178
+ image_features,
179
+ feature_lens,
180
+ inputs_embeds,
181
+ input_ids,
182
+ attention_mask,
183
+ position_ids,
184
+ labels=labels,
185
+ )
186
+
187
+ # pixel_values is not None but is empty ---> text only cases
188
+ elif pixel_values is not None and input_ids.shape[1] != 1 and pixel_values.size(0) == 0:
189
+ # there are no images
190
+ pass
191
+
192
+ # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
193
+ # generation with cache
194
+ elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
195
+ # Retrieve the first layer to inspect the logits and mask out the hidden states
196
+ # that are set to 0
197
+ first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
198
+
199
+ # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
200
+ batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
201
+
202
+ # Get the target length
203
+ target_length = input_ids.shape[1]
204
+ past_length = first_layer_past_key_value.shape[-1]
205
+
206
+ extended_attention_mask = torch.ones(
207
+ (attention_mask.shape[0], past_length),
208
+ dtype=attention_mask.dtype,
209
+ device=attention_mask.device,
210
+ )
211
+
212
+ # Filter out only the tokens that can be un-attended, this can happen
213
+ # if one uses Llava + Fused modules where the cache on the
214
+ # first iteration is already big enough, or if one passes custom cache
215
+ valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
216
+ new_batch_index = batch_index[valid_indices]
217
+ new_non_attended_tokens = non_attended_tokens[valid_indices]
218
+
219
+ # Zero-out the places where we don't need to attend
220
+ extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
221
+
222
+ attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
223
+
224
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
225
+
226
+ outputs = self.language_model(
227
+ attention_mask=attention_mask,
228
+ position_ids=position_ids,
229
+ past_key_values=past_key_values,
230
+ inputs_embeds=inputs_embeds,
231
+ use_cache=use_cache,
232
+ output_attentions=output_attentions,
233
+ output_hidden_states=output_hidden_states,
234
+ return_dict=return_dict,
235
+ )
236
+
237
+ pool_mask = self.create_pool_mask(attention_mask, instruction_lengths)
238
+
239
+ embeds = self.latent_attention_model(
240
+ outputs.last_hidden_state,
241
+ pool_mask,
242
+ )
243
+
244
+
245
+ return LlavaNextCausalLMOutputWithPast(
246
+ loss=None,
247
+ logits=None,
248
+ past_key_values=None,
249
+ hidden_states=embeds,
250
+ attentions=outputs.attentions,
251
+ image_hidden_states=clip_global_image_feature,
252
+ )
253
+
254
+ @torch.no_grad()
255
+ def encode(self, inputs, is_query = False, instruction = None, max_length = 512, query_prefix = 'Query: '):
256
+ assert type(inputs) == list, 'inputs should be a list of dictionay'
257
+ prompts, imgs = [], []
258
+ if is_query:
259
+ if instruction is not None:
260
+ prompt_template = f"Instruct: {instruction}\n{query_prefix}<image>\n<text>"
261
+ else:
262
+ prompt_template = f"{query_prefix}<image>\n<text>"
263
+ else:
264
+ prompt_template = f"<image>\n<text>"
265
+
266
+ for input_ in inputs:
267
+ if 'img' in input_:
268
+ imgs.append(input_['img'])
269
+ prompt = prompt_template
270
+ else:
271
+ prompt = prompt_template.replace('<image>\n', '')
272
+
273
+ if ('txt' in input_) and (input_['txt'] is not None):
274
+ prompt = prompt.replace('<text>', input_['txt'])
275
+ else:
276
+ prompt = prompt.replace('<text>', '')
277
+
278
+ prompts.append(prompt)
279
+
280
+ if len(imgs) == 0:
281
+ imgs = None
282
+ collated_features = self.preprocess_fn(prompts, imgs, return_tensors="pt", padding="longest", max_length=max_length, truncation=True).to(self.device)
283
+ if self.global_image_patch_only and (imgs is not None): # we only use global image patch as default
284
+ collated_features['pixel_values'] = collated_features['pixel_values'][:, 0:1]
285
+
286
+ instruction_lengths = self.calculate_instruction_length(self.preprocess_fn.tokenizer, prompts, f'\n{query_prefix}')
287
+ collated_features['instruction_lengths'] = torch.tensor(instruction_lengths).to(self.device)
288
+
289
+ return self(**collated_features)
290
+
291
+
292
+ AutoModel.register(LlavaNextConfig, NVMMEmbedModel)
293
+ NVMMEmbedModel.register_for_auto_class("AutoModel")
preprocessor_config.json ADDED
@@ -0,0 +1,52 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "aspect_ratio_setting": "anyres",
3
+ "crop_size": {
4
+ "height": 336,
5
+ "width": 336
6
+ },
7
+ "do_center_crop": true,
8
+ "do_convert_rgb": true,
9
+ "do_normalize": true,
10
+ "do_pad": true,
11
+ "do_rescale": true,
12
+ "do_resize": true,
13
+ "image_grid_pinpoints": [
14
+ [
15
+ 336,
16
+ 672
17
+ ],
18
+ [
19
+ 672,
20
+ 336
21
+ ],
22
+ [
23
+ 672,
24
+ 672
25
+ ],
26
+ [
27
+ 1008,
28
+ 336
29
+ ],
30
+ [
31
+ 336,
32
+ 1008
33
+ ]
34
+ ],
35
+ "image_mean": [
36
+ 0.48145466,
37
+ 0.4578275,
38
+ 0.40821073
39
+ ],
40
+ "image_processor_type": "LlavaNextImageProcessor",
41
+ "image_std": [
42
+ 0.26862954,
43
+ 0.26130258,
44
+ 0.27577711
45
+ ],
46
+ "processor_class": "LlavaNextProcessor",
47
+ "resample": 3,
48
+ "rescale_factor": 0.00392156862745098,
49
+ "size": {
50
+ "shortest_edge": 336
51
+ }
52
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dadfd56d766715c61d2ef780a525ab43b8e6da4de6865bda3d95fdef5e134055
3
+ size 493443
tokenizer_config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "add_prefix_space": null,
5
+ "added_tokens_decoder": {
6
+ "0": {
7
+ "content": "<unk>",
8
+ "lstrip": false,
9
+ "normalized": false,
10
+ "rstrip": false,
11
+ "single_word": false,
12
+ "special": true
13
+ },
14
+ "1": {
15
+ "content": "<s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false,
20
+ "special": true
21
+ },
22
+ "2": {
23
+ "content": "</s>",
24
+ "lstrip": false,
25
+ "normalized": false,
26
+ "rstrip": false,
27
+ "single_word": false,
28
+ "special": true
29
+ },
30
+ "32000": {
31
+ "content": "<image>",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false,
36
+ "special": true
37
+ },
38
+ "32001": {
39
+ "content": "<pad>",
40
+ "lstrip": false,
41
+ "normalized": false,
42
+ "rstrip": false,
43
+ "single_word": false,
44
+ "special": true
45
+ }
46
+ },
47
+ "additional_special_tokens": [],
48
+ "bos_token": "<s>",
49
+ "chat_template": "{{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}",
50
+ "clean_up_tokenization_spaces": false,
51
+ "eos_token": "</s>",
52
+ "legacy": true,
53
+ "max_length": null,
54
+ "model_max_length": 1000000000000000019884624838656,
55
+ "pad_to_multiple_of": null,
56
+ "pad_token": "<pad>",
57
+ "pad_token_type_id": 0,
58
+ "padding_side": "left",
59
+ "processor_class": "LlavaNextProcessor",
60
+ "sp_model_kwargs": {},
61
+ "spaces_between_special_tokens": false,
62
+ "tokenizer_class": "LlamaTokenizer",
63
+ "unk_token": "<unk>",
64
+ "use_default_system_prompt": false
65
+ }