Fine Tuning Jina Embedding V3 for classification task

#77

by mumeranwaar - opened Nov 18, 2024

Nov 18, 2024

Is there any blog post available for Fine Tuning Jina Embedding V3 for classification task. (only lora as well as full model)?

I could not find any, kindly guide.

isacat

Jina AI org Nov 18, 2024

•

edited Nov 18, 2024

Hey, I don't think we have a blogpost I can point you to that details fine-tuning jina-embeddings-v3, but it's relatively straightforward to do with ST. Just make sure to set the default_task to train a LoRA adapter, like 'classification'.

model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)

model[0].default_task = 'classification'

You can also choose to fine-tune the main parameters (non-lora parameters) by setting https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/config.json#L27 to True, and then not passing a default_task.

Hope this helps!

isacat

Jina AI org Nov 19, 2024

@mumeranwaar You can also take a look at this blogpost: https://jina.ai/news/jina-classifier-for-high-performance-zero-shot-and-few-shot-classification/
Our zero-shot / few-shot classifier API might be just what you're looking for with your classification problem, though it does not detail how to finetune our v3 model.

ELVISIO

Dec 2, 2024

•

edited Dec 2, 2024

from sentence_transformers import SentenceTransformer, losses, InputExample, evaluation, SentenceTransformerTrainer
from torch.utils.data import DataLoader
from datasets import load_dataset, DatasetDict, Dataset as HFDataset
import os

# Load the dataset from Hugging Face
dataset = load_dataset("ELVISIO/imdb_dataset_offical_triplet")

train_dataset = dataset["train"]
test_dataset = dataset["test"]

# Prepare the train dataset for online contrastive loss
train_samples = []
for i in range(len(train_dataset)):
    anchor = train_dataset[i]['anchor']
    positive = train_dataset[i]['positive']
    negative = train_dataset[i]['negative']
    
    # For contrastive loss, we need pairs and similarity scores
    train_samples.append(InputExample(texts=[anchor, positive], label=1.0))
    train_samples.append(InputExample(texts=[anchor, negative], label=0.0))

train_dict = {
    "sentence1": [sample.texts[0] for sample in train_samples],
    "sentence2": [sample.texts[1] for sample in train_samples],
    "label": [sample.label for sample in train_samples],
}

train_dataset = HFDataset.from_dict(train_dict)

# Initialize the SentenceTransformer model
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})

# Define the online contrastive loss
train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=losses.SiameseDistanceMetric.COSINE_DISTANCE, margin=0.5)

# Set up the trainer
trainer = SentenceTransformerTrainer(
    model=model,
    train_dataset=train_dataset,
    loss=train_loss,
)

# Fine-tune the model
trainer.train()

for your reference

mumeranwaar changed discussion status to closed Dec 3, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment