You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

misa-ai/MISA-Embeddings-v1

This is a LLM Embedding model for Document Retrieval. It can encode queries or documents (maximum 4096 tokens) to dense vectors with 1024 dimensions. It used for QA semantic search.

Training Dataset

Training dataset is collected from various sources:

  • MS Macro (translated into Vietnamese)
  • SQuAD v2 (translated into Vietnamese)
  • UIT ViQuad2.0
  • ZaloQA 2021
  • Web Crawl
  • Private dataset

There are about 900k samples in total.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -q sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
from torch.nn import functional as F

query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "

query = query_prompt + "..."
documents = [
  "document1",
  "document2",
]

model = SentenceTransformer('misa-ai/MISA-Embeddings-v1')
model.eval()

query_embeddings = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
document_embeddings = model.encode(documents, convert_to_tensor=True, normalize_embeddings=True)

sim_scores = F.cosine_similarity(query_embeddings, document_embeddings)
print(sim_scores)

Usage (HuggingFace Transformers)

You can alse use the model with transformers by applying the pooling (mean pooling) on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
from torch.nn import functional as F

# Last token Pooling
def last_token_pooling(model_output):
    token_embeddings = model_output.last_hidden_state
    return token_embeddings[:, -1, :]

query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "

inputs= [
  query = query_prompt + "...",
  "document1",
  "document2",
]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('misa-ai/MISA-Embeddings-v1')
model = AutoModel.from_pretrained('misa-ai/MISA-Embeddings-v1')
model.eval()

if tokenizer.padding_side != "left":
    tokenizer.padding_side = "left"

# Tokenize sentences
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_inputs)
    embeddings = last_token_pooling(model_output)

vecs = F.normalize(embeddings)
sim_scores = F.cosine_similarity(vecs[:1], vecs[1:])
print(sim_scores)

Training

Continual SFT

The model was trained with the parameters:

{'batch_size': 32, 'sampler': None, 'batch_sampler': None, 'shuffle': true}

Loss:

Custom Multiple Negative Loss + Custom Ranking Loss

Training Parameters

  • epochs: 2
  • optimizer: AdamW
  • learning_rate: 2e-05
  • scheduler: Warmup Linear Scheduler
  • warmup_steps: 10000
  • weight_decay": 0.001

Author

Downloads last month
12
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for misa-ai/MISA-Embeddings-v1

Finetuned
(18)
this model