misa-ai/MISA-Embeddings-v1
This is a LLM Embedding model for Document Retrieval. It can encode queries or documents (maximum 4096 tokens) to dense vectors with 1024 dimensions. It used for QA semantic search.
Training Dataset
Training dataset is collected from various sources:
- MS Macro (translated into Vietnamese)
- SQuAD v2 (translated into Vietnamese)
- UIT ViQuad2.0
- ZaloQA 2021
- Web Crawl
- Private dataset
There are about 900k samples in total.
Usage (Sentence-Transformers)
Using this model becomes easy when you have sentence-transformers installed:
pip install -q sentence-transformers
Then you can use the model like this:
from sentence_transformers import SentenceTransformer
from torch.nn import functional as F
query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "
query = query_prompt + "..."
documents = [
"document1",
"document2",
]
model = SentenceTransformer('misa-ai/MISA-Embeddings-v1')
model.eval()
query_embeddings = model.encode(query, convert_to_tensor=True, normalize_embeddings=True)
document_embeddings = model.encode(documents, convert_to_tensor=True, normalize_embeddings=True)
sim_scores = F.cosine_similarity(query_embeddings, document_embeddings)
print(sim_scores)
Usage (HuggingFace Transformers)
You can alse use the model with transformers by applying the pooling (mean pooling) on-top of the contextualized word embeddings.
from transformers import AutoTokenizer, AutoModel
import torch
from torch.nn import functional as F
# Last token Pooling
def last_token_pooling(model_output):
token_embeddings = model_output.last_hidden_state
return token_embeddings[:, -1, :]
query_prompt = "Instruct: Đưa ra một truy vấn tìm kiếm, truy xuất các tài liệu có liên quan trả lời cho truy vấn.\nTruy vấn: "
inputs= [
query = query_prompt + "...",
"document1",
"document2",
]
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('misa-ai/MISA-Embeddings-v1')
model = AutoModel.from_pretrained('misa-ai/MISA-Embeddings-v1')
model.eval()
if tokenizer.padding_side != "left":
tokenizer.padding_side = "left"
# Tokenize sentences
encoded_inputs = tokenizer(inputs, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_inputs)
embeddings = last_token_pooling(model_output)
vecs = F.normalize(embeddings)
sim_scores = F.cosine_similarity(vecs[:1], vecs[1:])
print(sim_scores)
Training
Continual SFT
The model was trained with the parameters:
{'batch_size': 32, 'sampler': None, 'batch_sampler': None, 'shuffle': true}
Loss:
Custom Multiple Negative Loss
+ Custom Ranking Loss
Training Parameters
- epochs: 2
- optimizer: AdamW
- learning_rate: 2e-05
- scheduler: Warmup Linear Scheduler
- warmup_steps: 10000
- weight_decay": 0.001
Author
- Downloads last month
- 12
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for misa-ai/MISA-Embeddings-v1
Base model
Alibaba-NLP/gte-Qwen2-1.5B-instruct