Embeddings not working with Langchain HuggingFaceEmbeddings

#11
by weissenbacherpwc - opened

Hi,

I use the embeddings locally and found that with the Jina Embeddings, the retriever is not working correctly.
I set up my retriever as follows:

retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")

This is computing cosine similarity with all other embedding models I am using. However with jina-embeddings-v2-base-de I am only getting values returned if I set the threshold to a low value like 0.2. Sometimes the scores are even negative. Is maybe cosine distance?

Here my code for replication:
`loader = DirectoryLoader(directory_path,
glob='*.pdf',
loader_cls=PyPDFLoader)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
#length_function = len
)
texts = text_splitter.split_documents(documents)
embeddings = HuggingFaceEmbeddings(model_name=embedding_name,
model_kwargs={'device': 'mps', 'trust_remote_code': True},
encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
vectorstore = FAISS.from_documents(texts, embeddings)
retriever = vectordb.as_retriever(search_kwargs={'k': 3, 'score_threshold': 0.75,'sorted': True}, search_type="similarity_score_threshold")`

Hi @weissenbacherpwc

I tried to replicate your problem of negative similarity scores using our embedding model but did not succeed. However, it seems that the embeddings are not in fact normalized when I run your above code. Here is a small working example to illustrate:

import numpy as np
from langchain_community.embeddings import HuggingFaceEmbeddings

texts = [
    'How is the weather today?',
    'Wie ist das Wetter heute?',
    'Wie geht es dir?',
    'How are you doing?'
]

model = HuggingFaceEmbeddings(model_name="jinaai/jina-embeddings-v2-base-de", model_kwargs={'device': 'mps', 'trust_remote_code': True}, encode_kwargs={'device': 'mps', 'normalize_embeddings': True})
embeddings = model.client.encode(texts)

# Compute the L2 norm of each embedding vector
l2_norms = np.linalg.norm(embeddings, axis=1)

print(l2_norms)
# To check all embeddings systematically, you can use:
are_normalized = np.allclose(l2_norms, 1, atol=1e-6)
print(f"All embeddings normalized: {are_normalized}")  # --> this prints False

This shows that the embeddings are not normalized, even though we set the normalize_embeddings=True in the encode_kwargs. I suspect these kwargs may not be passed along correctly. If I change the encode call to the following, the embeddings are actually normalized:

embeddings = model.client.encode(texts, normalize_embeddings=True)

So it seems the parameter is not passed correctly with encode_kwargs. I will do some more debugging but as far as I can see, this is not an issue on our embedding model side, but rather with the langchain pipeline.

bwang0911 changed discussion status to closed

Sign up or log in to comment