|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
- mteb |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- no |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
inference: false |
|
library_name: transformers |
|
--- |
|
|
|
<br><br> |
|
|
|
<p align="center"> |
|
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px"> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b> |
|
</p> |
|
|
|
<p align="center"> |
|
<b>Jina Embedding V3: A Multilingual Multi-Task Embedding Model</b> |
|
</p> |
|
|
|
## Quick Start |
|
|
|
The easiest way to start using `jina-embeddings-v3` is with the [Jina Embedding API](https://jina.ai/embeddings/). |
|
|
|
|
|
## Intended Usage & Model Info |
|
|
|
|
|
`jina-embeddings-v3` is a **multilingual multi-task text embedding model** designed for a variety of NLP applications. |
|
Based on the [XLM-RoBERTa architecture](https://huggingface.co/jinaai/xlm-roberta-flash-implementation), |
|
this model supports [Rotary Position Embeddings (RoPE)](https://arxiv.org/abs/2104.09864) to handle long input sequences up to **8192 tokens**. |
|
Additionally, it features [LoRA](https://arxiv.org/abs/2106.09685) adapters to generate task-specific embeddings efficiently. |
|
|
|
### Key Features: |
|
- **Extended Sequence Length:** Supports up to 8192 tokens with RoPE. |
|
- **Task-Specific Embedding:** Customize embeddings through the `task_type` argument with the following options: |
|
- `retrieval.query`: Used for query embeddings in asymmetric retrieval tasks |
|
- `retrieval.passage`: Used for passage embeddings in asymmetric retrieval tasks |
|
- `separation`: Used for embeddings in clustering and re-ranking applications |
|
- `classification`: Used for embeddings in classification tasks |
|
- `text-matching`: Used for embeddings in tasks that quantify similarity between two texts, such as STS or symmetric retrieval tasks |
|
- **Matryoshka Embeddings**: Supports flexible embedding sizes (`32, 64, 128, 256, 512, 768, 1024`), allowing for truncating embeddings to fit your application. |
|
|
|
### Model Lineage: |
|
|
|
The `jina-embeddings-v3` model is an enhancement of the [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model, initially trained on 100 languages. This model's functionality has been extended through an additional pretraining phase using the [CulturaX](https://huggingface.co/datasets/uonlp/CulturaX) dataset. Additionally, LoRA was employed to increase the context length to 8192 tokens. For further optimization, contrastive fine-tuning was performed across 30 languages, improving its performance in both monolingual and cross-lingual embedding tasks. |
|
|
|
|
|
### Supported Languages: |
|
While the base model supports 100 languages, we've focused our tuning efforts on the following 30 languages: |
|
**Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, Georgian, German, Greek, |
|
Hindi, Indonesian, Italian, Japanese, Korean, Latvian, Norwegian, Polish, Portuguese, Romanian, |
|
Russian, Slovak, Spanish, Swedish, Thai, Turkish, Ukrainian, Urdu,** and **Vietnamese.** |
|
|
|
|
|
## Data & Parameters |
|
|
|
The data and training details are described in the technical report (coming soon). |
|
|
|
## Usage |
|
|
|
**<details><summary>Apply mean pooling when integrating the model.</summary>** |
|
<p> |
|
|
|
### Why Use Mean Pooling? |
|
|
|
Mean pooling takes all token embeddings from the model's output and averages them at the sentence or paragraph level. |
|
This approach has been shown to produce high-quality sentence embeddings. |
|
|
|
We provide an `encode` function that handles this for you automatically. |
|
|
|
However, if you're working with the model directly, outside of the `encode` function, |
|
you'll need to apply mean pooling manually. Here's how you can do it: |
|
|
|
|
|
```python |
|
import torch |
|
import torch.nn.functional as F |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
|
|
def mean_pooling(model_output, attention_mask): |
|
token_embeddings = model_output[0] |
|
input_mask_expanded = ( |
|
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
|
) |
|
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp( |
|
input_mask_expanded.sum(1), min=1e-9 |
|
) |
|
|
|
|
|
sentences = ["How is the weather today?", "What is the current weather like today?"] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jinaai/jina-embeddings-v3") |
|
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True) |
|
|
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) |
|
embeddings = F.normalize(embeddings, p=2, dim=1) |
|
|
|
``` |
|
|
|
</p> |
|
</details> |
|
|
|
The easiest way to start using `jina-embeddings-v3` is with the [Jina Embedding API](https://jina.ai/embeddings/). |
|
|
|
Alternatively, you can use `jina-embeddings-v3` directly via Transformers package: |
|
```bash |
|
!pip install transformers torch einops |
|
!pip install 'numpy<2' |
|
``` |
|
If you run it on a GPU that support [FlashAttention-2](https://github.com/Dao-AILab/flash-attention). By 2024.9.12, it supports Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100), |
|
|
|
```bash |
|
!pip install flash-attn --no-build-isolation |
|
``` |
|
|
|
```python |
|
from transformers import AutoModel |
|
|
|
# Initialize the model |
|
model = AutoModel.from_pretrained("jinaai/jina-embeddings-v3", trust_remote_code=True) |
|
|
|
texts = [ |
|
"Follow the white rabbit.", # English |
|
"Sigue al conejo blanco.", # Spanish |
|
"Suis le lapin blanc.", # French |
|
"跟着白兔走。", # Chinese |
|
"اتبع الأرنب الأبيض.", # Arabic |
|
"Folge dem weißen Kaninchen.", # German |
|
] |
|
|
|
# When calling the `encode` function, you can choose a `task_type` based on the use case: |
|
# 'retrieval.query', 'retrieval.passage', 'separation', 'classification', 'text-matching' |
|
# Alternatively, you can choose not to pass a `task_type`, and no specific LoRA adapter will be used. |
|
embeddings = model.encode(texts, task_type="text-matching") |
|
|
|
# Compute similarities |
|
print(embeddings[0] @ embeddings[1].T) |
|
``` |
|
|
|
By default, the model supports a maximum sequence length of 8192 tokens. |
|
However, if you want to truncate your input texts to a shorter length, you can pass the `max_length` parameter to the `encode` function: |
|
```python |
|
embeddings = model.encode(["Very long ... document"], max_length=2048) |
|
|
|
``` |
|
|
|
In case you want to use **Matryoshka embeddings** and switch to a different dimension, |
|
you can adjust it by passing the `truncate_dim` parameter to the `encode` function: |
|
```python |
|
embeddings = model.encode(['Sample text'], truncate_dim=256) |
|
``` |
|
|
|
|
|
Note that the `truncate_dim` could be any integer between 1 and 1024 for the `separation`, `classification`, and `text-matching` tasks. As for the `retrieval.passage` and `retrieval.query` tasks, the value must be larger than the length of the instruction prompt. By default, the value must be larger than 9 for the `retrieval.passage` task and larger than 12 for the `retrieval.query` task. |
|
|
|
|
|
The latest version (3.1.0) of [SentenceTransformers](https://github.com/UKPLab/sentence-transformers) also supports `jina-embeddings-v3`: |
|
|
|
```bash |
|
!pip install -U sentence-transformers |
|
``` |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True) |
|
|
|
task_type = "retrieval.query" |
|
embeddings = model.encode( |
|
["What is the weather like in Berlin today?"], |
|
task_type=task_type, |
|
prompt_name=task_type, |
|
) |
|
``` |
|
|
|
|
|
|
|
## Performance |
|
|
|
### English MTEB |
|
| Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization | |
|
|:------------------------------:|:-----------:|:---------:|:----:|:----:|:----:|:----:|:----:|:----:|:----:| |
|
| jina-embeddings-v3 | 1024 | **65.60** | **82.58**| 45.27| 84.01| 58.13| 53.87| **85.8** | 30.98| |
|
| jina-embeddings-v2-en | 768 | 58.12 | 68.82 | 40.08| 84.44| 55.09| 45.64| 80.00| 30.56| |
|
| text-embedding-3-large | 3072 | 62.03 | 75.45 | 49.01| 84.22| 59.16| 55.44| 81.04| 29.92| |
|
| multilingual-e5-large-instruct | 4096 | 64.41 | 77.56 | 47.1 | 86.19| 58.58| 52.47| 84.78| 30.39| |
|
| Cohere-embed-multilingual-v3.0 | 4096 | 60.08 | 64.01 | 46.6 | 86.15| 57.86| 53.84| 83.15| 30.99| |
|
|
|
### Multilingual MTEB |
|
|
|
| Model | Dimension | Average | Classification | Clustering | Pair Classification | Reranking | Retrieval | STS | Summarization | |
|
|:------------------------------:|:---------:|:---------:|:--------------:|:----------:|:-------------------:|:---------:|:---------:|:---------:|:-------------:| |
|
| jina-embeddings-v3 | 1024 | **64.44** | **71.46** | 46.71 | 76.91 | 63.98 | 57.98 | **69.83** | - | |
|
| multilingual-e5-large | 4096 | 59.58 | 65.22 | 42.12 | 76.95 | 63.4 | 52.37 | 64.65 | - | |
|
| multilingual-e5-large-instruct | 4096 | 64.25 | 67.45 | **52.12** | 77.79 | **69.02** | **58.38** | 68.77 | - | |
|
|
|
|
|
### Long Context Tasks (LongEmbed) |
|
|
|
| Model | Dimension | Average | NarrativeQA | Needle | Passkey | QMSum | SummScreen | WikiQA | |
|
|:----------------------:|:---------:|:---------:|:-----------:|:---------:|:----------:|:---------:|:----------:|:---------:| |
|
| jina-embeddings-v3* | 1024 | **70.39** | 33.32 | **84.00** | **100.00** | **39.75** | 92.78 | 72.46 | |
|
| jina-embeddings-v2 | 768 | 58.12 | 37.89 | 54.25 | 50.25 | 38.87 | 93.48 | 73.99 | |
|
| text-embedding-3-large | 3072 | 51.30 | 44.09 | 29.25 | 63.00 | 32.49 | 84.80 | 54.16 | |
|
| baai-bge-m3 | 1024 | 56.56 | **45.76** | 40.25 | 46.00 | 35.54 | **94.09** | **77.73** | |
|
|
|
Notes: `*`, use the text-matching adapter |
|
|
|
|
|
#### Matryoshka Embeddings |
|
|
|
| Dimension | Retrieval | STS | |
|
|:-----------:|:-----------:|:-------:| |
|
| 32 | 52.54 | 76.35 | |
|
| 64 | 58.54 | 77.03 | |
|
| 128 | 61.64 | 77.43 | |
|
| 256 | 62.72 | 77.56 | |
|
| 512 | 63.16 | 77.59 | |
|
| 768 | 63.3 | 77.59 | |
|
| 1024 | 63.35 | 77.58 | |
|
|
|
For a comprehensive evaluation and detailed metrics, please refer to the full paper available here (coming soon). |
|
|
|
## Contact |
|
|
|
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |
|
|
|
## Citation |
|
|
|
If you find `jina-embeddings-v3` useful in your research, please cite the following paper: |
|
|
|
```bibtex |
|
|
|
``` |
|
|