# Qdrant Integration: Lessons Learned ## Introduction This document summarizes our experience integrating Qdrant vector database with FastEmbed for embedding generation. We encountered several challenges related to vector naming conventions, search query formats, and other aspects of working with Qdrant. This document outlines the issues we faced and the solutions we implemented to create a robust vector search system. ## Problem Statement We were experiencing issues with vector name mismatches in our Qdrant integration. Specifically: 1. Points were being skipped during processing with the error message "Skipping point as it has no valid vector" 2. The vector names we specified in our configuration did not match the actual vector names used in the Qdrant collection 3. We had implemented unnecessary sanitization of model names ## Understanding Vector Names in Qdrant ### How Qdrant Handles Vector Names According to the [Qdrant documentation](https://qdrant.tech/documentation/concepts/collections/), when creating a collection with vectors, you specify vector names and their configurations. These names are used as keys when inserting and querying vectors. However, when using FastEmbed with Qdrant, we discovered that the model names specified in the configuration are transformed before being used as vector names in the collection: - Original model name: `"intfloat/multilingual-e5-large"` - Actual vector name in Qdrant: `"fast-multilingual-e5-large"` Similarly for sparse vectors: - Original model name: `"prithivida/Splade_PP_en_v1"` - Actual vector name in Qdrant: `"fast-sparse-splade_pp_en_v1"` ### Initial Approach (Problematic) Our initial approach was to manually transform the model names using a `format_vector_name` function: ```python def format_vector_name(name: str) -> str: """Format a model name into a valid vector name for Qdrant.""" return name.replace('/', '_') ``` This led to inconsistencies because: 1. We were using one transformation in our code (`replace('/', '_')`) 2. FastEmbed was using a different transformation (prefixing with "fast-" and removing slashes) ## Solution: Dynamic Vector Name Discovery Instead of trying to predict how FastEmbed transforms model names, we implemented a solution that dynamically discovers the actual vector names from the Qdrant collection configuration. ### Helper Functions We added two helper functions to retrieve the actual vector names: ```python def get_dense_vector_name(client: QdrantClient, collection_name: str) -> str: """ Get the name of the dense vector from the collection configuration. Args: client: Initialized Qdrant client collection_name: Name of the collection Returns: Name of the dense vector as used in the collection """ try: return list(client.get_collection(collection_name).config.params.vectors.keys())[0] except (IndexError, AttributeError) as e: logger.warning(f"Could not get dense vector name: {e}") # Fallback to a default name return "fast-multilingual-e5-large" def get_sparse_vector_name(client: QdrantClient, collection_name: str) -> str: """ Get the name of the sparse vector from the collection configuration. Args: client: Initialized Qdrant client collection_name: Name of the collection Returns: Name of the sparse vector as used in the collection """ try: return list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0] except (IndexError, AttributeError) as e: logger.warning(f"Could not get sparse vector name: {e}") # Fallback to a default name return "fast-sparse-splade_pp_en_v1" ``` ### Implementation in Vector Creation When creating new points or updating existing ones, we now use these helper functions to get the correct vector names: ```python # Get vector names from the collection configuration dense_vector_name = get_dense_vector_name(client, collection_name) sparse_vector_name = get_sparse_vector_name(client, collection_name) # Create point with the correct vector names point = PointStruct( id=str(uuid.uuid4()), vector={ dense_vector_name: get_embedding(payload_new['purpose'])[0], sparse_vector_name: get_embedding(payload_new['purpose'])[1] }, payload={ # payload fields... } ) ``` ### Implementation in Vector Querying Similarly, when querying vectors, we use the same helper functions: ```python # Get the actual vector names from the collection configuration dense_vector_name = get_dense_vector_name(client, collection_name) # Skip points without vector or without the required vector type if not point.vector or dense_vector_name not in point.vector: logger.debug(f"Skipping point {point_id} as it has no valid vector") continue # Find semantically similar points using Qdrant's search similar_points = client.search( collection_name=collection_name, query_vector={ dense_vector_name: point.vector.get(dense_vector_name) }, limit=100, score_threshold=SIMILARITY_THRESHOLD ) ``` ## Key Insights 1. **Model Names vs. Vector Names**: There's a distinction between the model names you specify in your configuration and the actual vector names used in the Qdrant collection. FastEmbed transforms these names. 2. **Dynamic Discovery**: Instead of hardcoding vector names or trying to predict the transformation, it's better to dynamically discover the actual vector names from the collection configuration. 3. **Fallback Mechanism**: Always include fallback mechanisms in case the collection information can't be retrieved, making your code more robust. 4. **Consistency**: Use the same vector names throughout your system to ensure consistency between vector creation, storage, and retrieval. 5. **Correct Search Query Format**: When using named vectors in Qdrant search queries, you must use the correct format. Instead of passing a dictionary with vector names as keys, use the `query_vector` parameter for the actual vector and the `using` parameter to specify which named vector to use. ## Accessing Collection Configuration The key to our solution was discovering how to access the collection configuration to get the actual vector names: ```python # Get dense vector name dense_vector_name = list(client.get_collection(collection_name).config.params.vectors.keys())[0] # Get sparse vector name sparse_vector_name = list(client.get_collection(collection_name).config.params.sparse_vectors.keys())[0] ``` This approach allows our code to adapt to however FastEmbed decides to name the vectors in the collection, rather than assuming a specific naming convention. ## Correct Search Query Format for Named Vectors When using named vectors in Qdrant, it's important to use the correct format for search queries. The format depends on the version of the Qdrant client you're using: ### Incorrect Format (Causes Validation Error) ```python # This format causes a validation error similar_points = client.search( collection_name=collection_name, query_vector={ dense_vector_name: point.vector.get(dense_vector_name) }, limit=100 ) ``` ### Correct Format for Qdrant Client Version 1.12.2 ```python # This is the correct format for Qdrant client version 1.12.2 similar_points = client.search( collection_name=collection_name, query_vector=(dense_vector_name, point.vector.get(dense_vector_name)), # Tuple of (vector_name, vector_values) limit=100, score_threshold=0.8 # Optional similarity threshold ) ``` In Qdrant client version 1.12.2, the correct way to specify which named vector to use is by providing a tuple to the `query_vector` parameter. The tuple should contain the vector name as the first element and the actual vector values as the second element. Using the incorrect format will result in a Pydantic validation error with messages like: ``` validation errors for SearchRequest vector.list[float] Input should be a valid list [type=list_type, input_value={'fast-multilingual-e5-la...}, input_type=dict] vector.NamedVector.name Field required [type=missing, input_value={'fast-multilingual-e5-la...}, input_type=dict] ``` ## Optimizing Search Parameters for Deduplication When using Qdrant for deduplication of similar content, the search parameters play a crucial role in determining the effectiveness of the process. We've found the following parameters to be particularly important: ### Similarity Threshold The `score_threshold` parameter determines the minimum similarity score required for points to be considered similar: ```python similar_points = client.search( collection_name=collection_name, query_vector=(dense_vector_name, point.vector.get(dense_vector_name)), limit=100, score_threshold=0.9 # Only consider points with similarity > 90% ) ``` For deduplication purposes, we found that a higher threshold (0.9) works better than a lower one (0.7) to avoid false positives. This means that only very similar items will be considered duplicates. ### Text Difference Threshold In addition to vector similarity, we also check the actual text difference between potential duplicates: ```python # Constants for duplicate detection SIMILARITY_THRESHOLD = 0.9 # Minimum semantic similarity to consider as potential duplicate DIFFERENCE_THRESHOLD = 0.05 # Maximum text difference (5%) to consider as duplicate ``` The `DIFFERENCE_THRESHOLD` of 0.05 means that texts with less than 5% difference will be considered duplicates. This two-step verification (vector similarity + text difference) helps to ensure that only true duplicates are removed. ## Logging Considerations When working with Qdrant, especially during development and debugging, it's helpful to adjust the logging level: ```python # Set log level and prevent propagation logger.setLevel(logging.DEBUG) # For development/debugging logger.setLevel(logging.INFO) # For production ``` Using `DEBUG` level during development provides detailed information about vector operations, including: - Which points are being processed - Why points are being skipped (e.g., missing vectors) - Similarity scores between points - Deduplication decisions However, in production, it's better to use `INFO` level to reduce log volume, especially when processing large collections. ## Performance Considerations ### Batch Operations When working with large numbers of points, it's more efficient to use batch operations: ```python # Batch upsert example client.upsert( collection_name=collection_name, points=batch_of_points # List of PointStruct objects ) ``` This reduces network overhead compared to upserting points individually. ### Search Limit The `limit` parameter in search operations should be set carefully: ```python similar_points = client.search( collection_name=collection_name, query_vector=(dense_vector_name, point.vector.get(dense_vector_name)), limit=100, # Maximum number of similar points to return score_threshold=0.9 ) ``` A higher limit increases the chance of finding all duplicates but also increases search time. For deduplication purposes, we found that a limit of 100 provides a good balance between thoroughness and performance. ## Conclusion Our experience with Qdrant has taught us several important lessons: 1. **Dynamic Vector Name Discovery**: By retrieving the actual vector names from the Qdrant collection configuration, we've created a robust solution that adapts to the naming conventions used by FastEmbed and Qdrant. 2. **Correct Query Format**: Using the proper format for search queries with named vectors is essential - specifically using a tuple of (vector_name, vector_values) for the query_vector parameter. 3. **Optimized Search Parameters**: Fine-tuning similarity thresholds and text difference thresholds is crucial for effective deduplication, with higher thresholds (0.9 for similarity, 0.05 for text difference) providing better results. 4. **Appropriate Logging Levels**: Using DEBUG level during development and INFO in production helps balance between having enough information for troubleshooting and maintaining performance. 5. **Batch Operations**: Using batch operations for inserting and updating points significantly improves performance when working with large collections. By implementing these lessons, we've created a more efficient and reliable vector search system that properly handles named vectors, effectively identifies duplicates, and maintains good performance even with large collections. This solution should work regardless of changes to the naming conventions in future versions of Qdrant or FastEmbed, as it reads the actual names directly from the collection configuration.