|
--- |
|
license: cc-by-nc-4.0 |
|
tags: |
|
- feature-extraction |
|
- sentence-similarity |
|
- mteb |
|
language: |
|
- multilingual |
|
- af |
|
- am |
|
- ar |
|
- as |
|
- az |
|
- be |
|
- bg |
|
- bn |
|
- br |
|
- bs |
|
- ca |
|
- cs |
|
- cy |
|
- da |
|
- de |
|
- el |
|
- en |
|
- eo |
|
- es |
|
- et |
|
- eu |
|
- fa |
|
- fi |
|
- fr |
|
- fy |
|
- ga |
|
- gd |
|
- gl |
|
- gu |
|
- ha |
|
- he |
|
- hi |
|
- hr |
|
- hu |
|
- hy |
|
- id |
|
- is |
|
- it |
|
- ja |
|
- jv |
|
- ka |
|
- kk |
|
- km |
|
- kn |
|
- ko |
|
- ku |
|
- ky |
|
- la |
|
- lo |
|
- lt |
|
- lv |
|
- mg |
|
- mk |
|
- ml |
|
- mn |
|
- mr |
|
- ms |
|
- my |
|
- ne |
|
- nl |
|
- 'no' |
|
- om |
|
- or |
|
- pa |
|
- pl |
|
- ps |
|
- pt |
|
- ro |
|
- ru |
|
- sa |
|
- sd |
|
- si |
|
- sk |
|
- sl |
|
- so |
|
- sq |
|
- sr |
|
- su |
|
- sv |
|
- sw |
|
- ta |
|
- te |
|
- th |
|
- tl |
|
- tr |
|
- ug |
|
- uk |
|
- ur |
|
- uz |
|
- vi |
|
- xh |
|
- yi |
|
- zh |
|
inference: false |
|
library_name: transformers |
|
--- |
|
|
|
<br><br> |
|
|
|
<p align="center"> |
|
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px"> |
|
</p> |
|
|
|
|
|
<p align="center"> |
|
<b>The embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b> |
|
</p> |
|
|
|
<p align="center"> |
|
<b>Jina Embedding V3: A Multilingual Multi-Task Embedding Model</b> |
|
</p> |
|
|
|
## Quick Start |
|
|
|
The easiest way to starting using `jina-embeddings-v3` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/). |
|
|
|
|
|
## Intended Usage & Model Info |
|
|
|
`jina-embeddings-v3` is a multilingual **text embedding model** supporting **8192 sequence length**. |
|
It is based on a XLMRoBERTa architecture (JinaXLMRoBERTa) that supports the Rotary Position Embeddings to allow longer sequence length. |
|
The backbone `JinaXLMRoBERTa ` is pretrained on variable length textual data on Mask Language Modeling objective for 160k steps on 89 languages. |
|
The model is further trained on Jina AI's collection of more than 500 millions of multilingual sentence pairs and hard negatives. |
|
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process. |
|
|
|
`jina-embeddings-v3` has 5 task-specific LoRA adapters tuned on top of our backbone, add `task_type` as additional parameter when using the model: |
|
|
|
TODO UPDATE THIS |
|
|
|
1. **query**: Handles user incoming queries at search time. |
|
2. **index**: Manages user documents submitted for indexing. |
|
3. **text-matching**: Processes symmetric text similarity tasks, whether short or long, such as STS (Semantic Textual Similarity). |
|
4. **classification**: Classifies user inputs into predefined categories. |
|
5. **clustering**: Facilitates the clustering of embeddings for further analysis. |
|
|
|
`jina-embeddings-v3` supports Matryoshka representation learning. We recommend using an embedding size of 128 or higher (1024 provides optimal performance) for storing your embeddings. |
|
|
|
|
|
|
|
## Data & Parameters |
|
|
|
coming soon. |
|
|
|
## Usage |
|
|
|
1. The easiest way to starting using jina-clip-v1-en is to use Jina AI's [Embeddings API](https://jina.ai/embeddings/). |
|
2. Alternatively, you can use Jina CLIP directly via transformers package. |
|
|
|
```python |
|
!pip install transformers einops flash_attn |
|
from transformers import AutoModel |
|
|
|
# Initialize the model |
|
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v3', trust_remote_code=True) |
|
|
|
# New meaningful sentences |
|
sentences = [ |
|
"Organic skincare for sensitive skin with aloe vera and chamomile.", |
|
"New makeup trends focus on bold colors and innovative techniques", |
|
"Bio-Hautpflege für empfindliche Haut mit Aloe Vera und Kamille", |
|
"Neue Make-up-Trends setzen auf kräftige Farben und innovative Techniken", |
|
"Cuidado de la piel orgánico para piel sensible con aloe vera y manzanilla", |
|
"Las nuevas tendencias de maquillaje se centran en colores vivos y técnicas innovadoras", |
|
"针对敏感肌专门设计的天然有机护肤产品", |
|
"新的化妆趋势注重鲜艳的颜色和创新的技巧", |
|
"敏感肌のために特別に設計された天然有機スキンケア製品", |
|
"新しいメイクのトレンドは鮮やかな色と革新的な技術に焦点を当てています", |
|
] |
|
|
|
# Encode sentences |
|
embeddings = model.encode(sentences, truncate_dim=1024, task_type='index') # TODO UPDATE |
|
|
|
# Compute similarities |
|
print(embeddings[0] @ embeddings[1].T) |
|
``` |
|
|
|
|
|
## Performance |
|
|
|
TODO UPDATE THIS |
|
|
|
## Contact |
|
|
|
Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas. |
|
|
|
## Citation |
|
|
|
If you find `jina-embeddings-v3` useful in your research, please cite the following paper: |
|
|
|
```bibtex |
|
|
|
``` |
|
|