Embeddings quality for inputs longer than 128 tokens
Hi there,
it says the sequence lenght was limited to 128 tokens while training. However, the model supports text to 256 tokens. Do you have information or maybe an educated guess how it will perform for texts between 128 and 256 tokens?
Hello!
it says the sequence lenght was limited to 128 tokens while training. However, the model supports text to 256 tokens.
Indeed, you're right.
Do you have information or maybe an educated guess how it will perform for texts between 128 and 256 tokens?
It tends to perform quite poorly, from others I've heard that truncating to the recommended size (i.e. 128 tokens here) gives better results than extending it to 256. If you'd like to get a small model with a bit of a higher token length, then you might want to try e.g. https://huggingface.co/BAAI/bge-small-en-v1.5, which has a sequence length of 512 tokens instead. Its README might be a bit overwhelming, but you can use it with Sentence Transformers just like this model:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-small-en-v1.5")
embeddings = model.encode(["my first text", "my second text"])
- Tom Aarsen
Thanks for the fast reply! I will check out the model :)