17 1 2

Lucas Beyer

giffmana

http://lucasb.eyer.be

AI & ML interests

None yet

Recent Activity

authored a paper 20 days ago

Gemma 3 Technical Report

new activity about 1 month ago

google/siglip-so400m-patch14-384:Is SiglipImageProcessor configured correctly?

new activity about 1 month ago

google/siglip2-so400m-patch14-384:Question About SigLIP 2’s Performance with Newline-Separated Labels

View all activity

Organizations

giffmana's activity

authored a paper 20 days ago

Gemma 3 Technical Report

Paper • 2503.19786 • Published 23 days ago • 46

New activity in google/siglip-so400m-patch14-384 about 1 month ago

Is SiglipImageProcessor configured correctly?

#9 opened 5 months ago by

karby

New activity in google/siglip2-so400m-patch14-384 about 1 month ago

Question About SigLIP 2’s Performance with Newline-Separated Labels

#2 opened about 2 months ago by

zfjerome1

commented on SigLIP 2: A better multilingual vision language encoder about 2 months ago

Not sure what's up as I'm not familiar with this codebase (and no time to dig in), but for siglip what you're supposed to do is do sigmoid(zimg @ ztxt * temperature + bias)

from what you describe, I would bet the bias and/or temperature are missing?
The ground-truth reference code is https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb

commented on SigLIP 2: A better multilingual vision language encoder about 2 months ago

Sorry i can't collaborate with individual's papers.

commented on SigLIP 2: A better multilingual vision language encoder about 2 months ago

Ah sorry that wasn't clear from your message. I'm not familiar enough with this codebase to help more.

commented on SigLIP 2: A better multilingual vision language encoder about 2 months ago

The warning gives you the answer: pass max_length=64

commented on SigLIP 2: A better multilingual vision language encoder about 2 months ago

Yes. If you want longer text, what I'd do is chunk it into pieces of 64 tokens (possibly even overlapping), embed those separately, and either average their endings or dot them with the image embedding individually and take max or average score, depending on your use case.

I'm actually curious what kind of queries you're dealing with that are longer than 64 tokens? All use cases of siglip i can think of almost always fit in way below 64.

authored a paper about 2 months ago