Fine Tuning Jina Embedding V3 for classification task

#77
by mumeranwaar - opened

Is there any blog post available for Fine Tuning Jina Embedding V3 for classification task. (only lora as well as full model)?

I could not find any, kindly guide.

Jina AI org
edited Nov 18

Hey, I don't think we have a blogpost I can point you to that details fine-tuning jina-embeddings-v3, but it's relatively straightforward to do with ST. Just make sure to set the default_task to train a LoRA adapter, like 'classification'.

model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)

Or

model[0].default_task = 'classification'

You can also choose to fine-tune the main parameters (non-lora parameters) by setting https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/config.json#L27 to True, and then not passing a default_task.

Hope this helps!

Jina AI org

@mumeranwaar You can also take a look at this blogpost: https://jina.ai/news/jina-classifier-for-high-performance-zero-shot-and-few-shot-classification/
Our zero-shot / few-shot classifier API might be just what you're looking for with your classification problem, though it does not detail how to finetune our v3 model.

from sentence_transformers import SentenceTransformer, losses, InputExample, evaluation, SentenceTransformerTrainer
from torch.utils.data import DataLoader
from datasets import load_dataset, DatasetDict, Dataset as HFDataset
import os

# Load the dataset from Hugging Face
dataset = load_dataset("ELVISIO/imdb_dataset_offical_triplet")

train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Prepare the train dataset for online contrastive loss
train_samples = []
for i in range(len(train_dataset)):
    anchor = train_dataset[i]['anchor']
    positive = train_dataset[i]['positive']
    negative = train_dataset[i]['negative']
    
    # For contrastive loss, we need pairs and similarity scores
    train_samples.append(InputExample(texts=[anchor, positive], label=1.0))
    train_samples.append(InputExample(texts=[anchor, negative], label=0.0))

train_dict = {
    "sentence1": [sample.texts[0] for sample in train_samples],
    "sentence2": [sample.texts[1] for sample in train_samples],
    "label": [sample.label for sample in train_samples],
}

train_dataset = HFDataset.from_dict(train_dict)

# Initialize the SentenceTransformer model
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})

# Define the online contrastive loss
train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=losses.SiameseDistanceMetric.COSINE_DISTANCE, margin=0.5)

# Set up the trainer
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=train_loss,
)

# Fine-tune the model
trainer.train()

for your reference

mumeranwaar changed discussion status to closed

Sign up or log in to comment