Fine Tuning Jina Embedding V3 for classification task
Is there any blog post available for Fine Tuning Jina Embedding V3 for classification task. (only lora as well as full model)?
I could not find any, kindly guide.
Hey, I don't think we have a blogpost I can point you to that details fine-tuning jina-embeddings-v3, but it's relatively straightforward to do with ST. Just make sure to set the default_task to train a LoRA adapter, like 'classification'.
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True)
Or
model[0].default_task = 'classification'
You can also choose to fine-tune the main parameters (non-lora parameters) by setting https://huggingface.co/jinaai/jina-embeddings-v3/blob/main/config.json#L27 to True
, and then not passing a default_task
.
Hope this helps!
@mumeranwaar
You can also take a look at this blogpost: https://jina.ai/news/jina-classifier-for-high-performance-zero-shot-and-few-shot-classification/
Our zero-shot / few-shot classifier API might be just what you're looking for with your classification problem, though it does not detail how to finetune our v3 model.
from sentence_transformers import SentenceTransformer, losses, InputExample, evaluation, SentenceTransformerTrainer
from torch.utils.data import DataLoader
from datasets import load_dataset, DatasetDict, Dataset as HFDataset
import os
# Load the dataset from Hugging Face
dataset = load_dataset("ELVISIO/imdb_dataset_offical_triplet")
train_dataset = dataset["train"]
test_dataset = dataset["test"]
# Prepare the train dataset for online contrastive loss
train_samples = []
for i in range(len(train_dataset)):
anchor = train_dataset[i]['anchor']
positive = train_dataset[i]['positive']
negative = train_dataset[i]['negative']
# For contrastive loss, we need pairs and similarity scores
train_samples.append(InputExample(texts=[anchor, positive], label=1.0))
train_samples.append(InputExample(texts=[anchor, negative], label=0.0))
train_dict = {
"sentence1": [sample.texts[0] for sample in train_samples],
"sentence2": [sample.texts[1] for sample in train_samples],
"label": [sample.label for sample in train_samples],
}
train_dataset = HFDataset.from_dict(train_dict)
# Initialize the SentenceTransformer model
model = SentenceTransformer("jinaai/jina-embeddings-v3", trust_remote_code=True, model_kwargs={'default_task': 'classification'})
# Define the online contrastive loss
train_loss = losses.OnlineContrastiveLoss(model=model, distance_metric=losses.SiameseDistanceMetric.COSINE_DISTANCE, margin=0.5)
# Set up the trainer
trainer = SentenceTransformerTrainer(
model=model,
train_dataset=train_dataset,
loss=train_loss,
)
# Fine-tune the model
trainer.train()
for your reference