gmastrapas commited on
Commit
3f4007a
1 Parent(s): 727b2f2

feat: update checkpoint to latest

Browse files
README.md CHANGED
@@ -133,6 +133,7 @@ inference: false
133
  <b>Jina CLIP: your CLIP model is also your text retriever!</b>
134
  </p>
135
 
 
136
  ## Quick Start
137
 
138
  [Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
@@ -144,8 +145,8 @@ inference: false
144
 
145
  `jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
146
  * *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
147
- * *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and in as a result computation and storage costs as well.
148
- * *visual document retrieval performance boost* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. This enable `jina-clip-v2` as a strong encoder for future vLLM based retriever.
149
 
150
  Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
151
  This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
@@ -155,6 +156,7 @@ This dual capability makes it an excellent tool for multimodal retrieval-augment
155
 
156
  [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
157
 
 
158
  ## Usage
159
 
160
  1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
@@ -252,6 +254,7 @@ console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cr
252
  console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
253
  ```
254
 
 
255
  ## Performance
256
 
257
  ### Text-Image Retrieval
@@ -262,10 +265,12 @@ Coming soon!
262
 
263
  Coming soon!
264
 
 
265
  ## Contact
266
 
267
  Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
268
 
 
269
  ## Citation
270
 
271
  If you find `jina-clip-v2` useful in your research, please cite the following paper:
 
133
  <b>Jina CLIP: your CLIP model is also your text retriever!</b>
134
  </p>
135
 
136
+
137
  ## Quick Start
138
 
139
  [Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
 
145
 
146
  `jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
147
  * *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
148
+ * *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
149
+ * *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
150
 
151
  Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
152
  This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
 
156
 
157
  [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
158
 
159
+
160
  ## Usage
161
 
162
  1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
 
254
  console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
255
  ```
256
 
257
+
258
  ## Performance
259
 
260
  ### Text-Image Retrieval
 
265
 
266
  Coming soon!
267
 
268
+
269
  ## Contact
270
 
271
  Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
272
 
273
+
274
  ## Citation
275
 
276
  If you find `jina-clip-v2` useful in your research, please cite the following paper:
config.json CHANGED
@@ -43,7 +43,7 @@
43
  "embed_dim": 1024,
44
  "fused_layer_norm": false,
45
  "head_width": 64,
46
- "image_size": 384,
47
  "intp_freq": true,
48
  "layers": 24,
49
  "ls_init_value": null,
 
43
  "embed_dim": 1024,
44
  "fused_layer_norm": false,
45
  "head_width": 64,
46
+ "image_size": 512,
47
  "intp_freq": true,
48
  "layers": 24,
49
  "ls_init_value": null,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:a753294ed5d3d6dc4ae43f784824cdc3a6cbb7e8a815bff2ab200a3f411141a0
3
- size 1729527426
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:771f189199cdc89d19ea7f01c120cb370a8e38405b882b7d20f04347cc372e13
3
+ size 1730688642
preprocessor_config.json CHANGED
@@ -13,7 +13,7 @@
13
  ],
14
  "processor_class": "JinaCLIPProcessor",
15
  "resize_mode": "shortest",
16
- "size": 384,
17
  "std": [
18
  0.26862954,
19
  0.26130258,
 
13
  ],
14
  "processor_class": "JinaCLIPProcessor",
15
  "resize_mode": "shortest",
16
+ "size": 512,
17
  "std": [
18
  0.26862954,
19
  0.26130258,
pytorch_model.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7dcfd3e9d325dd8a59bbce810b59be028f41fc5c6a478e4cc9b5ba0701f61004
3
- size 1729735014
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f1759bc4662735c42f65262d3d3477aa2dda6a947d6c504d9aaca17b5cd051d9
3
+ size 1730896230