gmastrapas
commited on
Commit
•
3f4007a
1
Parent(s):
727b2f2
feat: update checkpoint to latest
Browse files- README.md +7 -2
- config.json +1 -1
- model.safetensors +2 -2
- preprocessor_config.json +1 -1
- pytorch_model.bin +2 -2
README.md
CHANGED
@@ -133,6 +133,7 @@ inference: false
|
|
133 |
<b>Jina CLIP: your CLIP model is also your text retriever!</b>
|
134 |
</p>
|
135 |
|
|
|
136 |
## Quick Start
|
137 |
|
138 |
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
|
@@ -144,8 +145,8 @@ inference: false
|
|
144 |
|
145 |
`jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
|
146 |
* *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
147 |
-
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and
|
148 |
-
* *visual document retrieval performance
|
149 |
|
150 |
Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
|
151 |
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
|
@@ -155,6 +156,7 @@ This dual capability makes it an excellent tool for multimodal retrieval-augment
|
|
155 |
|
156 |
[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
|
157 |
|
|
|
158 |
## Usage
|
159 |
|
160 |
1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
@@ -252,6 +254,7 @@ console.log(cos_sim(text_embeds[1].data, image_embeds[0].data)) // text-image cr
|
|
252 |
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
|
253 |
```
|
254 |
|
|
|
255 |
## Performance
|
256 |
|
257 |
### Text-Image Retrieval
|
@@ -262,10 +265,12 @@ Coming soon!
|
|
262 |
|
263 |
Coming soon!
|
264 |
|
|
|
265 |
## Contact
|
266 |
|
267 |
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
268 |
|
|
|
269 |
## Citation
|
270 |
|
271 |
If you find `jina-clip-v2` useful in your research, please cite the following paper:
|
|
|
133 |
<b>Jina CLIP: your CLIP model is also your text retriever!</b>
|
134 |
</p>
|
135 |
|
136 |
+
|
137 |
## Quick Start
|
138 |
|
139 |
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings)
|
|
|
145 |
|
146 |
`jina-clip-v2` is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as:
|
147 |
* *support for multiple languages* - the text tower now supports 100 languages with tuning focus on **Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.**
|
148 |
+
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs.
|
149 |
+
* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures.
|
150 |
|
151 |
Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks.
|
152 |
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model.
|
|
|
156 |
|
157 |
[Check out our paper](https://arxiv.org/abs/2405.20204). Updated technical report for v2 coming soon!
|
158 |
|
159 |
+
|
160 |
## Usage
|
161 |
|
162 |
1. The easiest way to start using jina-clip-v2 is via Jina AI's [Embeddings API](https://jina.ai/embeddings/).
|
|
|
254 |
console.log(cos_sim(text_embeds[1].data, image_embeds[1].data)) // text-image cross-modal similarity
|
255 |
```
|
256 |
|
257 |
+
|
258 |
## Performance
|
259 |
|
260 |
### Text-Image Retrieval
|
|
|
265 |
|
266 |
Coming soon!
|
267 |
|
268 |
+
|
269 |
## Contact
|
270 |
|
271 |
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
|
272 |
|
273 |
+
|
274 |
## Citation
|
275 |
|
276 |
If you find `jina-clip-v2` useful in your research, please cite the following paper:
|
config.json
CHANGED
@@ -43,7 +43,7 @@
|
|
43 |
"embed_dim": 1024,
|
44 |
"fused_layer_norm": false,
|
45 |
"head_width": 64,
|
46 |
-
"image_size":
|
47 |
"intp_freq": true,
|
48 |
"layers": 24,
|
49 |
"ls_init_value": null,
|
|
|
43 |
"embed_dim": 1024,
|
44 |
"fused_layer_norm": false,
|
45 |
"head_width": 64,
|
46 |
+
"image_size": 512,
|
47 |
"intp_freq": true,
|
48 |
"layers": 24,
|
49 |
"ls_init_value": null,
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:771f189199cdc89d19ea7f01c120cb370a8e38405b882b7d20f04347cc372e13
|
3 |
+
size 1730688642
|
preprocessor_config.json
CHANGED
@@ -13,7 +13,7 @@
|
|
13 |
],
|
14 |
"processor_class": "JinaCLIPProcessor",
|
15 |
"resize_mode": "shortest",
|
16 |
-
"size":
|
17 |
"std": [
|
18 |
0.26862954,
|
19 |
0.26130258,
|
|
|
13 |
],
|
14 |
"processor_class": "JinaCLIPProcessor",
|
15 |
"resize_mode": "shortest",
|
16 |
+
"size": 512,
|
17 |
"std": [
|
18 |
0.26862954,
|
19 |
0.26130258,
|
pytorch_model.bin
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:f1759bc4662735c42f65262d3d3477aa2dda6a947d6c504d9aaca17b5cd051d9
|
3 |
+
size 1730896230
|