jinaai
/

jina-clip-v2

Model card Files Files and versions Community

gmastrapas commited on Nov 14, 2024

Commit

3f4007a

1 Parent(s): 727b2f2

feat: update checkpoint to latest

Browse files

Files changed (5) hide show

README.md +7 -2
config.json +1 -1
model.safetensors +2 -2
preprocessor_config.json +1 -1
pytorch_model.bin +2 -2

README.md CHANGED Viewed

@@ -133,6 +133,7 @@ inference: false
 <b>Jina CLIP: your CLIP model is also your text retriever!</b>
 </p>
 ## Quick Start
 [Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
@@ -144,8 +145,8 @@ inference: false
 `jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
 * *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
-* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and in as a result computation and storage costs as well.
-* *visual document retrieval performance boost* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. This enable `jina-clip-v2` as a strong encoder for future vLLM based retriever.
 Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
 This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
@@ -155,6 +156,7 @@ This dual capability makes it an excellent tool for multimodal retrieval-augment
 [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
 ## Usage
 1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
@@ -252,6 +254,7 @@ console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cr
 console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
 ```
 ## Performance
 ### Text-Image Retrieval
@@ -262,10 +265,12 @@ Coming soon!
 Coming soon!
 ## Contact
 Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 ## Citation
 If you find `jina-clip-v2` useful in your research, please cite the following paper:

 <b>Jina CLIP: your CLIP model is also your text retriever!</b>
 </p>
 ## Quick Start
 [Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
 `jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
 * *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
+* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
+* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
 Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
 This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
 [Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
 ## Usage
 1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
 console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
 ```
 ## Performance
 ### Text-Image Retrieval
 Coming soon!
 ## Contact
 Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
 ## Citation
 If you find `jina-clip-v2` useful in your research, please cite the following paper:

config.json CHANGED Viewed

@@ -43,7 +43,7 @@
         "embed_dim": 1024,
         "fused_layer_norm": false,
         "head_width": 64,
-        "image_size": 384,
         "intp_freq": true,
         "layers": 24,
         "ls_init_value": null,

         "embed_dim": 1024,
         "fused_layer_norm": false,
         "head_width": 64,
+        "image_size": 512,
         "intp_freq": true,
         "layers": 24,
         "ls_init_value": null,

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a753294ed5d3d6dc4ae43f784824cdc3a6cbb7e8a815bff2ab200a3f411141a0
-size 1729527426

 version https://git-lfs.github.com/spec/v1
+oid sha256:771f189199cdc89d19ea7f01c120cb370a8e38405b882b7d20f04347cc372e13
+size 1730688642

preprocessor_config.json CHANGED Viewed

@@ -13,7 +13,7 @@
     ],
     "processor_class": "JinaCLIPProcessor",
     "resize_mode": "shortest",
-    "size": 384,
     "std": [
         0.26862954,
         0.26130258,

     ],
     "processor_class": "JinaCLIPProcessor",
     "resize_mode": "shortest",
+    "size": 512,
     "std": [
         0.26862954,
         0.26130258,

pytorch_model.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7dcfd3e9d325dd8a59bbce810b59be028f41fc5c6a478e4cc9b5ba0701f61004
-size 1729735014

 version https://git-lfs.github.com/spec/v1
+oid sha256:f1759bc4662735c42f65262d3d3477aa2dda6a947d6c504d9aaca17b5cd051d9
+size 1730896230