ColBERT Models
Collection
ColBERT models fine-tuned for efficient retrieval. Optimized from pre-trained architectures for token-level similarity.
•
2 items
•
Updated
•
2
fjmgAI/col1-610M-EuroBERT
EuroBERT/EuroBERT-610m
Fine-tuning was performed using PyLate, with contrastive training on the rag-comprehensive-triplets dataset. It maps sentences & paragraphs to sequences of 128-dimensional dense vectors and can be used for semantic textual similarity using the MaxSim operator.
baconnier/rag-comprehensive-triplets
This dataset has been filtered for the Spanish language containing 303,000 examples, designed for rag-comprehensive-triplets.
pylate.evaluation.colbert_triplet.ColBERTTripletEvaluator
Metric | Value |
---|---|
accuracy | 0.98417 |
First install the PyLate library:
pip install -U pylate
import torch
from pylate import models
# Load the ColBERT model
model = models.ColBERT("fjmgAI/col1-610M-EuroBERT", trust_remote_code=True)
# Move the model to GPU if available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# Example data for similarity comparison
query = "¿Cuál es la capital de España?" # Query sentence
positive_doc = "La capital de España es Madrid." # Relevant document
negative_doc = "Florida es un estado en los Estados Unidos." # Irrelevant document
sentences = [query, positive_doc, negative_doc] # Combine all texts
# Tokenize the input sentences using ColBERT's tokenizer
inputs = model.tokenize(sentences)
# Move all input tensors to the same device as the model (GPU/CPU)
inputs = {key: value.to(device) for key, value in inputs.items()}
# Generate token embeddings (no gradients needed for inference)
with torch.no_grad():
embeddings_dict = model(inputs)
embeddings = embeddings_dict['token_embeddings']
# Define ColBERT's MaxSim similarity function
def colbert_similarity(query_emb, doc_emb):
"""
Computes ColBERT-style similarity between query and document embeddings.
Uses maximum similarity (MaxSim) between individual tokens.
Args:
query_emb: [query_tokens, embedding_dim]
doc_emb: [doc_tokens, embedding_dim]
Returns:
Normalized similarity score
"""
# Compute dot product between all token pairs
similarity_matrix = torch.matmul(query_emb, doc_emb.T)
# Get maximum similarity for each query token (MaxSim)
max_similarities = similarity_matrix.max(dim=1)[0]
# Return average of maximum similarities (normalized by query length)
return max_similarities.sum() / query_emb.shape[0]
# Extract embeddings for each text
query_emb = embeddings[0]
positive_emb = embeddings[1]
negative_emb = embeddings[2]
# Compute similarity scores
positive_score = colbert_similarity(query_emb, positive_emb)
negative_score = colbert_similarity(query_emb, negative_emb)
print(f"Similarity with positive document: {positive_score.item():.4f}")
print(f"Similarity with negative document: {negative_score.item():.4f}")
This tuned model is designed for Spanish applications that require the use of efficient semantic search comparing embeddings at the token level with its MaxSim operation, ideal for question-answering and document retrieval.
Base model
EuroBERT/EuroBERT-610m