Model Card for xVLM2Vec

Model description

xVLM2Vec is a Large Vision-Language Model (LVLM) aligned over TIGER-Lab/VLM2Vec-LoRA. This model has been trained for increased performance in multilingual retrieval tasks, specifically it was trained on a machine-translated parallel corpus. It is capable of performing several multimodal retrieval tasks (e.g. Text-to-Image, Image-to-Text, VQA, Visual Grounding and Classification).

More details regarding the training procedure (e.g. hyperparameters, dataset construction, and so on) can be found in the [paper].

Developed by: Elio Musacchio, Lucia Siciliani, Pierpaolo Basile
Model type: Phi-3.5-vision-instruct
Language(s) (NLP): English, French, German, Italian and Spanish
License: Apache 2.0
Finetuned from model: TIGER-Lab/VLM2Vec-LoRA

How to Get Started with the Model

Below you can find an example of model usage. To facilitate its usage, we recommend pulling from GitHub the version of the VLM2Vec source code we used for both training and inference:

git clone https://github.com/swapUniba/xVLM2Vec
cd xVLM2Vec

Now you should be able to run the following:

from src.mmeb_src.model import MMEBModel
from src.mmeb_src.arguments import ModelArguments

from PIL import Image
from transformers import AutoProcessor

import torch
import requests

model_args = ModelArguments(
    model_name='microsoft/Phi-3.5-vision-instruct',
    checkpoint_path="m-elio/xVLM2Vec",
    pooling='last',
    normalize=True,
    lora=False,
)

processor = AutoProcessor.from_pretrained(
    "microsoft/Phi-3.5-vision-instruct",
    trust_remote_code=True,
    num_crops=4,
)

model = MMEBModel.load(model_args)
model.eval()
model = model.to('cuda', dtype=torch.bfloat16)

with torch.no_grad():
    inputs = processor("<|image_1|>\nTrova una didascalia che descriva l'immagine di tutti i giorni", [Image.open(requests.get("http://images.cocodataset.org/train2017/000000514915.jpg", stream=True).raw)])
    inputs = {key: value.to('cuda') for key, value in inputs.items()}
    qry_output = model(qry=inputs)["qry_reps"]

    strings = ['Un cane steso sul pavimento', 'Un gatto steso sul pavimento']
    inputs = processor(strings)
    inputs = {key: value.to('cuda') for key, value in inputs.items()}
    tgt_output = model(tgt=inputs)["tgt_reps"]
    cos_sim = model.compute_similarity(qry_output, tgt_output).squeeze()

    for string_, sim_ in zip(strings, cos_sim):
        print(string_, '=', sim_)

This is a use case where the model is being used to retrieve an image caption in Italian.

Citation

If you use this model in your research, please cite the following:

@misc{musacchio2025xvlm2vecadaptinglvlmbasedembedding,
      title={xVLM2Vec: Adapting LVLM-based embedding models to multilinguality using Self-Knowledge Distillation}, 
      author={Elio Musacchio and Lucia Siciliani and Pierpaolo Basile and Giovanni Semeraro},
      year={2025},
      eprint={2503.09313},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.09313}, 
}

swap-uniba
/

xVLM2Vec

Model Card for xVLM2Vec

Model description

How to Get Started with the Model

Citation

Collection including swap-uniba/xVLM2Vec

LVLMs for retrieval 🇩🇪 🇫🇷 🇪🇸 🇮🇹 🇬🇧