misa-ai/MISA-Embeddings-v1

This is a LLM Embedding model for Document Retrieval. It can encode queries or documents (maximum 4096 tokens) to dense vectors with 1024 dimensions. It used for QA semantic search.

Training Dataset

Training dataset is collected from various sources:

MS Macro (translated into Vietnamese)
SQuAD v2 (translated into Vietnamese)
UIT ViQuad2.0
ZaloQA 2021
Web Crawl
Private dataset

There are about 900k samples in total.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -q sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
from torch.nn import functional as F

query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "

query = query_prompt + "..."
documents = [
  "document1",
  "document2",
]

model = SentenceTransformer('misa-ai/MISA-Embeddings-v1')
model.eval()

query_embeddings = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
document_embeddings = model.encode(documents, convert_to_tensor=True, normalize_embeddings=True)

sim_scores = F.cosine_similarity(query_embeddings, document_embeddings)
print(sim_scores)

Usage (HuggingFace Transformers)

You can alse use the model with transformers by applying the pooling (mean pooling) on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
from torch.nn import functional as F

# Last token Pooling
def last_token_pooling(model_output):
    token_embeddings = model_output.last_hidden_state
    return token_embeddings[:, -1, :]

query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "

inputs= [
  query = query_prompt + "...",
  "document1",
  "document2",
]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('misa-ai/MISA-Embeddings-v1')
model = AutoModel.from_pretrained('misa-ai/MISA-Embeddings-v1')
model.eval()

if tokenizer.padding_side != "left":
    tokenizer.padding_side = "left"

# Tokenize sentences
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_inputs)
    embeddings = last_token_pooling(model_output)

vecs = F.normalize(embeddings)
sim_scores = F.cosine_similarity(vecs[:1], vecs[1:])
print(sim_scores)

Training

Continual SFT

The model was trained with the parameters:

{'batch_size': 32, 'sampler': None, 'batch_sampler': None, 'shuffle': true}

Loss:

Custom Multiple Negative Loss + Custom Ranking Loss

Training Parameters

epochs: 2
optimizer: AdamW
learning_rate: 2e-05
scheduler: Warmup Linear Scheduler
warmup_steps: 10000
weight_decay": 0.001

Author

Sy The Ho

misa-ai
/

MISA-Embeddings-v1

You need to agree to share your contact information to access this model