|
--- |
|
library_name: transformers |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- xlm-roberta |
|
- eva02 |
|
- clip |
|
- feature-extraction |
|
- sentence-similarity |
|
- retrieval |
|
- multimodal |
|
- multi-modal |
|
- crossmodal |
|
- cross-modal |
|
- mteb |
|
- clip-benchmark |
|
- vidore |
|
- transformers |
|
- sentence-transformers |
|
- onnx |
|
- safetensors |
|
- transformers.js |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
inference: false |
|
--- |
|
|
|
<br><br> |
|
|
|
<p align="center"> |
|
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px"> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b> |
|
</p> |
|
|
|
<p align="center"> |
|
<b>Jina CLIP v2: Multilingual Multimodal Embeddings for Texts and Images</b> |
|
</p> |
|
|
|
|
|
## Quick Start |
|
|
|
[Blog](https://jina.ai/news/jina-embeddings-v3-a-frontier-multilingual-embedding-model/#parameter-dimensions) | [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/jinaai.jina-clip-v2) | [AWS SageMaker](https://aws.amazon.com/marketplace/pp/prodview-kdi3xkt62lo32) | [API](https://jina.ai/embeddings) |
|
|
|
|
|
## Intended Usage & Model Info |
|
|
|
`jina-clip-v2` is a state-of-the-art **multilingual and multimodal (text-image) embedding model**. It is a successor to the [`jina-clip-v1`](https://huggingface.co/jinaai/jina-clip-v1) model and brings new features and capabilities, such as: |
|
|
|
* *support for multiple languages* - the text tower is trained on 89 languages with tuning focus on *Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,* and *Vietnamese.* |
|
* *embedding truncation on both image and text vectors* - both towers are trained using [Matryoshka Representation Learning](https://arxiv.org/abs/2205.13147) which enables slicing the output vectors and consequently computation and storage costs. |
|
* *visual document retrieval performance gains* - with an image resolution of 512 (compared to 224 on `jina-clip-v1`) the image tower can now capture finer visual details. This feature along with a more diverse training set enable the model to perform much better on visual document retrieval tasks. Due to this `jina-clip-v2` can be used as an image encoder in vLLM retriever architectures. |
|
|
|
Similar to our predecessor model, `jina-clip-v2` bridges the gap between text-to-text and cross-modal retrieval. Via a single vector space, `jina-clip-v2` offers state-of-the-art performance on both tasks. |
|
This dual capability makes it an excellent tool for multimodal retrieval-augmented generation (MuRAG) applications, enabling seamless text-to-text and text-to-image searches within a single model. |
|
|
|
|
|
## Data, Parameters, Training |
|
|
|
An updated version of our [technical report](https://arxiv.org/abs/2405.20204) with details on `jina-clip-v2` is coming soon. Stay tuned! |
|
|
|
|
|
## Usage |
|
|
|
<details> |
|
<summary>via Jina AI <a href="https://jina.ai/embeddings/">Embedding API</a></summary> |
|
|
|
```bash |
|
curl https://api.jina.ai/v1/embeddings \ |
|
-H "Content-Type: application/json" \ |
|
-H "Authorization: Bearer [JINA_AI_API_TOKEN]" \ |
|
-d @- <<EOFEOF |
|
{ |
|
"model": "jina-clip-v2", |
|
"dimensions": 1024, |
|
"task": "retrieval.query", |
|
"normalized": true, |
|
"embedding_type": "float", |
|
"input": [ |
|
{ |
|
"text": "غروب جميل على الشاطئ" |
|
}, |
|
{ |
|
"text": "海滩上美丽的日落" |
|
}, |
|
{ |
|
"text": "A beautiful sunset over the beach" |
|
}, |
|
{ |
|
"text": "Un beau coucher de soleil sur la plage" |
|
}, |
|
{ |
|
"text": "Ein wunderschöner Sonnenuntergang am Strand" |
|
}, |
|
{ |
|
"text": "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία" |
|
}, |
|
{ |
|
"text": "समुद्र तट पर एक खूबसूरत सूर्यास्त" |
|
}, |
|
{ |
|
"text": "Un bellissimo tramonto sulla spiaggia" |
|
}, |
|
{ |
|
"text": "浜辺に沈む美しい夕日" |
|
}, |
|
{ |
|
"text": "해변 위로 아름다운 일몰" |
|
}, |
|
{ |
|
"image": "https://i.ibb.co/nQNGqL0/beach1.jpg" |
|
}, |
|
{ |
|
"image": "https://i.ibb.co/r5w8hG8/beach2.jpg" |
|
} |
|
] |
|
} |
|
EOFEOF |
|
``` |
|
|
|
</details> |
|
|
|
<details> |
|
<summary>via <a href="https://huggingface.co/docs/transformers/en/index">transformers</a></summary> |
|
|
|
```python |
|
# !pip install transformers einops timm pillow |
|
from transformers import AutoModel |
|
|
|
# Initialize the model |
|
model = AutoModel.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True) |
|
|
|
# Corpus |
|
sentences = [ |
|
'غروب جميل على الشاطئ', # Arabic |
|
'海滩上美丽的日落', # Chinese |
|
'Un beau coucher de soleil sur la plage', # French |
|
'Ein wunderschöner Sonnenuntergang am Strand', # German |
|
'Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία', # Greek |
|
'समुद्र तट पर एक खूबसूरत सूर्यास्त', # Hindi |
|
'Un bellissimo tramonto sulla spiaggia', # Italian |
|
'浜辺に沈む美しい夕日', # Japanese |
|
'해변 위로 아름다운 일몰', # Korean |
|
] |
|
|
|
# Public image URLs or PIL Images |
|
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg'] |
|
|
|
# Choose a matryoshka dimension, set to None to get the full 1024-dim vectors |
|
truncate_dim = 512 |
|
|
|
# Encode text and images |
|
text_embeddings = model.encode_text(sentences, truncate_dim=truncate_dim) |
|
image_embeddings = model.encode_image( |
|
image_urls, truncate_dim=truncate_dim |
|
) # also accepts PIL.Image.Image, local filenames, dataURI |
|
|
|
# Encode query text |
|
query = 'beautiful sunset over the beach' # English |
|
query_embeddings = model.encode_text( |
|
query, task='retrieval.query', truncate_dim=truncate_dim |
|
) |
|
|
|
# Text to Image |
|
print('En -> Img: ' + str(query_embeddings @ image_embeddings[0].T)) |
|
# Image to Image |
|
print('Img -> Img: ' + str(image_embeddings[0] @ image_embeddings[1].T)) |
|
# Text to Text |
|
print('En -> Ar: ' + str(query_embeddings @ text_embeddings[0].T)) |
|
print('En -> Zh: ' + str(query_embeddings @ text_embeddings[1].T)) |
|
print('En -> Fr: ' + str(query_embeddings @ text_embeddings[2].T)) |
|
print('En -> De: ' + str(query_embeddings @ text_embeddings[3].T)) |
|
print('En -> Gr: ' + str(query_embeddings @ text_embeddings[4].T)) |
|
print('En -> Hi: ' + str(query_embeddings @ text_embeddings[5].T)) |
|
print('En -> It: ' + str(query_embeddings @ text_embeddings[6].T)) |
|
print('En -> Jp: ' + str(query_embeddings @ text_embeddings[7].T)) |
|
print('En -> Ko: ' + str(query_embeddings @ text_embeddings[8].T)) |
|
``` |
|
</details> |
|
|
|
<details> |
|
<summary>via <a href="https://sbert.net/">sentence-transformers</a></summary> |
|
|
|
```python |
|
# !pip install sentence-transformers einops timm pillow |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Choose a matryoshka dimension |
|
truncate_dim = 512 |
|
|
|
# Initialize the model |
|
model = SentenceTransformer( |
|
'jinaai/jina-clip-v2', trust_remote_code=True, truncate_dim=truncate_dim |
|
) |
|
|
|
# Corpus |
|
sentences = [ |
|
'غروب جميل على الشاطئ', # Arabic |
|
'海滩上美丽的日落', # Chinese |
|
'Un beau coucher de soleil sur la plage', # French |
|
'Ein wunderschöner Sonnenuntergang am Strand', # German |
|
'Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία', # Greek |
|
'समुद्र तट पर एक खूबसूरत सूर्यास्त', # Hindi |
|
'Un bellissimo tramonto sulla spiaggia', # Italian |
|
'浜辺に沈む美しい夕日', # Japanese |
|
'해변 위로 아름다운 일몰', # Korean |
|
] |
|
|
|
# Public image URLs or PIL Images |
|
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg'] |
|
|
|
# Encode text and images |
|
text_embeddings = model.encode(sentences) |
|
image_embeddings = model.encode(image_urls) # also accepts PIL.Image.Image, local filenames, dataURI |
|
|
|
# Encode query text |
|
query = 'beautiful sunset over the beach' # English |
|
query_embeddings = model.encode(query, prompt_name='retrieval.query') |
|
``` |
|
</details> |
|
|
|
<details> |
|
<summary>via the <a href="https://onnxruntime.ai/">ONNX Runtime</a></summary> |
|
|
|
```python |
|
# !pip install transformers onnxruntime pillow |
|
import onnxruntime as ort |
|
from transformers import AutoImageProcessor, AutoTokenizer |
|
|
|
# Load tokenizer and image processor using transformers |
|
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-clip-v2', trust_remote_code=True) |
|
image_processor = AutoImageProcessor.from_pretrained( |
|
'jinaai/jina-clip-v2', trust_remote_code=True |
|
) |
|
|
|
# Corpus |
|
sentences = [ |
|
'غروب جميل على الشاطئ', # Arabic |
|
'海滩上美丽的日落', # Chinese |
|
'Un beau coucher de soleil sur la plage', # French |
|
'Ein wunderschöner Sonnenuntergang am Strand', # German |
|
'Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία', # Greek |
|
'समुद्र तट पर एक खूबसूरत सूर्यास्त', # Hindi |
|
'Un bellissimo tramonto sulla spiaggia', # Italian |
|
'浜辺に沈む美しい夕日', # Japanese |
|
'해변 위로 아름다운 일몰', # Korean |
|
] |
|
|
|
# Public image URLs or PIL Images |
|
image_urls = ['https://i.ibb.co/nQNGqL0/beach1.jpg', 'https://i.ibb.co/r5w8hG8/beach2.jpg'] |
|
|
|
# Tokenize input texts and transform input images |
|
input_ids = tokenizer(sentences, return_tensors='np')['input_ids'] |
|
pixel_values = image_processor(image_urls)['pixel_values'] |
|
|
|
# Start an ONNX Runtime Session |
|
session = ort.InferenceSession('jina-clip-v2/onnx/model.onnx') |
|
|
|
# Run inference |
|
output = session.run(None, {'input_ids': input_ids, 'pixel_values': pixel_values}) |
|
|
|
# Keep the normalised embeddings, first 2 outputs are un-normalized |
|
_, _, text_embeddings, image_embeddings = output |
|
``` |
|
|
|
</details> |
|
|
|
|
|
## License |
|
|
|
`jina-clip-v2` is listed on AWS & Azure. If you need to use it beyond those platforms or on-premises within your company, note that the models is licensed under CC BY-NC 4.0. For commercial usage inquiries, feel free to [contact us](https://jina.ai/contact-sales/). |
|
|
|
|
|
## Contact |
|
|
|
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |
|
|
|
|
|
## Citation |
|
|
|
If you find `jina-clip-v2` useful in your research, please cite the following paper: |
|
|
|
```bibtex |
|
@misc{2405.20204, |
|
Author = {Andreas Koukounas and Georgios Mastrapas and Michael Günther and Bo Wang and Scott Martens and Isabelle Mohr and Saba Sturua and Mohammad Kalim Akram and Joan Fontanals Martínez and Saahil Ognawala and Susana Guzman and Maximilian Werk and Nan Wang and Han Xiao}, |
|
Title = {Jina CLIP: Your CLIP Model Is Also Your Text Retriever}, |
|
Year = {2024}, |
|
Eprint = {arXiv:2405.20204}, |
|
} |
|
``` |