Fine-tuned version of PaliGemma 224x224 on google/docci and google/imageinwords datasets.

pip install git+https://github.com/huggingface/transformers
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from PIL import Image
import requests
import torch

model_id = "gokaygokay/sd3-long-captioner"

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg?download=true"
image = Image.open(requests.get(url, stream=True).raw)

model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to('cuda').eval()
processor = AutoProcessor.from_pretrained(model_id)

## prefix
prompt = "caption en"
model_inputs = processor(text=prompt, images=image, return_tensors="pt").to('cuda')
input_len = model_inputs["input_ids"].shape[-1]

with torch.inference_mode():
    generation = model.generate(**model_inputs, max_new_tokens=256, do_sample=False)
    generation = generation[0][input_len:]
    decoded = processor.decode(generation, skip_special_tokens=True)
    print(decoded)
 
Downloads last month
59
Safetensors
Model size
2.92B params
Tensor type
F32
Β·
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Datasets used to train gokaygokay/sd3-long-captioner

Spaces using gokaygokay/sd3-long-captioner 9