Embeddings quality for inputs longer than 128 tokens

#52
by HiddenGaze - opened

Hi there,

it says the sequence lenght was limited to 128 tokens while training. However, the model supports text to 256 tokens. Do you have information or maybe an educated guess how it will perform for texts between 128 and 256 tokens?

Sentence Transformers org

Hello!

it says the sequence lenght was limited to 128 tokens while training. However, the model supports text to 256 tokens.

Indeed, you're right.

Do you have information or maybe an educated guess how it will perform for texts between 128 and 256 tokens?

It tends to perform quite poorly, from others I've heard that truncating to the recommended size (i.e. 128 tokens here) gives better results than extending it to 256. If you'd like to get a small model with a bit of a higher token length, then you might want to try e.g. https://huggingface.co/BAAI/bge-small-en-v1.5, which has a sequence length of 512 tokens instead. Its README might be a bit overwhelming, but you can use it with Sentence Transformers just like this model:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(["my first text", "my second text"])
  • Tom Aarsen

Thanks for the fast reply! I will check out the model :)

Sign up or log in to comment