Marqo-FashionSigLIP Model Card

Marqo-FashionSigLIP leverages Generalised Contrastive Learning (GCL) which allows the model to be trained on not just text descriptions but also categories, style, colors, materials, keywords and fine-details to provide highly relevant search results on fashion products. The model was fine-tuned from ViT-B-16-SigLIP (webli).

Github Page: Marqo-FashionCLIP

Blog: Marqo Blog

Usage

The model can be seamlessly used with OpenCLIP by

import open_clip
model, preprocess_train, preprocess_val = open_clip.create_model_and_transforms('hf-hub:Marqo/marqo-fashionSigLIP')
tokenizer = open_clip.get_tokenizer('hf-hub:Marqo/marqo-fashionSigLIP')

import torch
from PIL import Image

image = preprocess_val(Image.open("docs/fashion-hippo.png")).unsqueeze(0)
text = tokenizer(["a hat", "a t-shirt", "shoes"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Benchmark Results

Average evaluation results on 6 public multimodal fashion datasets (Atlas, DeepFashion (In-shop), DeepFashion (Multimodal), Fashion200k, KAGL, and Polyvore) are reported below:

Text-To-Image (Averaged across 6 datasets)

Model AvgRecall Recall@1 Recall@10 MRR
Marqo-FashionSigLIP 0.231 0.121 0.340 0.239
FashionCLIP2.0 0.163 0.077 0.249 0.165
OpenFashionCLIP 0.132 0.060 0.204 0.135
ViT-B-16-laion2b_s34b_b88k 0.174 0.088 0.261 0.180
ViT-B-16-SigLIP-webli 0.212 0.111 0.314 0.214

Category-To-Product (Averaged across 5 datasets)

Model AvgP P@1 P@10 MRR
Marqo-FashionSigLIP 0.737 0.758 0.716 0.812
FashionCLIP2.0 0.684 0.681 0.686 0.741
OpenFashionCLIP 0.646 0.653 0.639 0.720
ViT-B-16-laion2b_s34b_b88k 0.662 0.673 0.652 0.743
ViT-B-16-SigLIP-webli 0.688 0.690 0.685 0.751

Sub-Category-To-Product (Averaged across 4 datasets)

Model AvgP P@1 P@10 MRR
Marqo-FashionSigLIP 0.725 0.767 0.683 0.811
FashionCLIP2.0 0.657 0.676 0.638 0.733
OpenFashionCLIP 0.598 0.619 0.578 0.689
ViT-B-16-laion2b_s34b_b88k 0.638 0.651 0.624 0.712
ViT-B-16-SigLIP-webli 0.643 0.643 0.643 0.726
Downloads last month
3
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train Styld/marqo-fashionSigLIP