Zeta-Alpha-E5-Mistral

We introduce Zeta Alpha's first public embedding model, a retrieval-specialized, 7B parameter embedding model trained on top of E5-mistral-7b-instruct. This model marks the first published model from Zeta Alpha's open science embedding models.

Check out our blog post for a complete breakdown of the training set we used and all the training details: Zeta Alpha blog

We are also making available our internal evaluation set, called NanoBEIR, a collection of Nano (i.e., 50 queries+~10k documents) per BEIR dataset.

Lora Weights

The lora weights are also available, so there is no need to download the full model.

How to Run

The model was trained with the same instruction-tuning strategy as the original E5-mistral-7b-instruct model. Therefore, queries should be formatted as follows:

Instruct: <task description>\nQuery: <query>

Sentence Transformers


from sentence_transformers import SentenceTransformer

model = SentenceTransformer("zeta-alpha-ai/Zeta-Alpha-E5-Mistral")

def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

task = "Given a claim about climate change, retrieve documents that support or refute the claim"
queries = [
    get_detailed_instruct(task, "In Alaska, brown bears are changing their feeding habits to eat elderberries that ripen earlier."),
    get_detailed_instruct(task, "Local and regional sea levels continue to exhibit typical natural variability—in some places rising and in others falling.")
]

passages = [
  "The brown bear ( Ursus arctos ) is a large bear with the widest distribution of any living ursid . The species is distributed across much of northern Eurasia and North America . It is one of the two largest terrestrial carnivorans alive today , rivaled in body size only by its close cousin , the polar bear ( Ursus maritimus ) , which is much less variable in size and averages larger due to this . There are several recognized subspecies , many of which are quite well-known within their native ranges , found in the brown bear species .   The brown bear 's principal range includes parts of Russia , Central Asia , China , Canada , the United States ( mostly Alaska ) , Scandinavia and the Carpathian region ( especially Romania ) , Anatolia , and Caucasus . The brown bear is recognized as a national and state animal in several European countries .   While the brown bear 's range has shrunk and it has faced local extinctions , it remains listed as a least concern species by the International Union for Conservation of Nature ( IUCN ) with a total population of approximately 200,000 . As of 2012 , this and the American black bear are the only bear species not classified as threatened by the IUCN . However , the Californian , North African ( Atlas bear ) , and Mexican subspecies were hunted to extinction in the nineteenth and early twentieth centuries , and many of the southern Asian subspecies are highly endangered . One of the smaller-bodied subspecies , the Himalayan brown bear , is critically endangered , occupying only 2 % of its former range and threatened by uncontrolled poaching for its parts . The Marsican brown bear , one of several currently isolated populations of the main Eurasian brown bear race , in central Italy is believed to have a population of just 30 to 40 bears .",
  "ean sea level ( MSL ) ( abbreviated simply sea level ) is an average level of the surface of one or more of Earth 's oceans from which heights such as elevations may be measured . MSL is a type of vertical datuma standardised geodetic reference pointthat is used , for example , as a chart datum in cartography and marine navigation , or , in aviation , as the standard sea level at which atmospheric pressure is measured in order to calibrate altitude and , consequently , aircraft flight levels . A common and relatively straightforward mean sea-level standard is the midpoint between a mean low and mean high tide at a particular location .   Sea levels can be affected by many factors and are known to have varied greatly over geological time scales . The careful measurement of variations in MSL can offer insights into ongoing climate change , and sea level rise has been widely quoted as evidence of ongoing global warming .   The term above sea level generally refers to above mean sea level ( AMSL ) ."
]

embeddings = model.encode(queries + passages)
scores = model.similarity(embeddings[:2], embeddings[2:]) * 100
print(scores.tolist())
# [[66.12603759765625, 43.760101318359375], [47.67058563232422, 63.7889518737793]]

Transformers

import torch
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]
def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'

task = "Given a claim about climate change, retrieve documents that support or refute the claim"
queries = [
    get_detailed_instruct(task, "In Alaska, brown bears are changing their feeding habits to eat elderberries that ripen earlier."),
    get_detailed_instruct(task, "Local and regional sea levels continue to exhibit typical natural variability—in some places rising and in others falling.")
]

passages = [
  "The brown bear ( Ursus arctos ) is a large bear with the widest distribution of any living ursid . The species is distributed across much of northern Eurasia and North America . It is one of the two largest terrestrial carnivorans alive today , rivaled in body size only by its close cousin , the polar bear ( Ursus maritimus ) , which is much less variable in size and averages larger due to this . There are several recognized subspecies , many of which are quite well-known within their native ranges , found in the brown bear species .   The brown bear 's principal range includes parts of Russia , Central Asia , China , Canada , the United States ( mostly Alaska ) , Scandinavia and the Carpathian region ( especially Romania ) , Anatolia , and Caucasus . The brown bear is recognized as a national and state animal in several European countries .   While the brown bear 's range has shrunk and it has faced local extinctions , it remains listed as a least concern species by the International Union for Conservation of Nature ( IUCN ) with a total population of approximately 200,000 . As of 2012 , this and the American black bear are the only bear species not classified as threatened by the IUCN . However , the Californian , North African ( Atlas bear ) , and Mexican subspecies were hunted to extinction in the nineteenth and early twentieth centuries , and many of the southern Asian subspecies are highly endangered . One of the smaller-bodied subspecies , the Himalayan brown bear , is critically endangered , occupying only 2 % of its former range and threatened by uncontrolled poaching for its parts . The Marsican brown bear , one of several currently isolated populations of the main Eurasian brown bear race , in central Italy is believed to have a population of just 30 to 40 bears .",
  "ean sea level ( MSL ) ( abbreviated simply sea level ) is an average level of the surface of one or more of Earth 's oceans from which heights such as elevations may be measured . MSL is a type of vertical datuma standardised geodetic reference pointthat is used , for example , as a chart datum in cartography and marine navigation , or , in aviation , as the standard sea level at which atmospheric pressure is measured in order to calibrate altitude and , consequently , aircraft flight levels . A common and relatively straightforward mean sea-level standard is the midpoint between a mean low and mean high tide at a particular location .   Sea levels can be affected by many factors and are known to have varied greatly over geological time scales . The careful measurement of variations in MSL can offer insights into ongoing climate change , and sea level rise has been widely quoted as evidence of ongoing global warming .   The term above sea level generally refers to above mean sea level ( AMSL ) ."
]

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("zeta-alpha-ai/Zeta-Alpha-E5-Mistral")
model = AutoModel.from_pretrained("zeta-alpha-ai/Zeta-Alpha-E5-Mistral")

# get the embeddings
max_length = 4096
input_texts = queries + passages
batch_dict = tokenizer(input_texts, max_length=max_length, padding=True, truncation=True, return_tensors="pt")
outputs = model(**batch_dict)
embeddings = last_token_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:2] @ embeddings[2:].T) * 100
print(scores.tolist())
[[66.15530395507812, 43.65541458129883], [47.681705474853516, 63.67986297607422]]

Zeta Alpha

Zeta Alpha is the premier Neural Discovery Platform for AI and more. We are an Amsterdam-based R&D and product lab with a passion for AI technology, with offices on the Science Park campus of the University of Amsterdam. and in San Francisco.

The Zeta Alpha Research team:

  • Arthur Câmara
  • Dinos Papakostas
  • Mathias Parisot
  • Fernando Rejon Barrera
  • Jakub Zavrel
Downloads last month
340
Safetensors
Model size
7.11B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Evaluation results