GLuCoSE v2
This model is a general Japanese text embedding model, excelling in retrieval tasks. It can run on CPU and is designed to measure semantic similarity between sentences, as well as to function as a retrieval system for searching passages based on queries.
Key features:
- Specialized for retrieval tasks, it demonstrates the highest performance among similar size models in MIRACL and other tasks .
- Optimized for Japanese text processing
- Can run on CPU
During inference, the prefix "query: " or "passage: " is required. Please check the Usage section for details.
Model Description
The model is based on GLuCoSE and fine-tuned through distillation using several large-scale embedding models and multi-stage contrastive learning.
- Maximum Sequence Length: 512 tokens
- Output Dimensionality: 768 tokens
- Similarity Function: Cosine Similarity
Usage
Direct Usage (Sentence Transformers)
You can perform inference using SentenceTransformer with the following code:
from sentence_transformers import SentenceTransformer
import torch.nn.functional as F
# Download from the 🤗 Hub
model = SentenceTransformer("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
embeddings = model.encode(sentences,convert_to_tensor=True)
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
Direct Usage (Transformers)
You can perform inference using Transformers with the following code:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def mean_pooling(last_hidden_states: Tensor,attention_mask: Tensor) -> Tensor:
emb = last_hidden_states * attention_mask.unsqueeze(-1)
emb = emb.sum(dim=1) / attention_mask.sum(dim=1).unsqueeze(-1)
return emb
# Download from the 🤗 Hub
tokenizer = AutoTokenizer.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
model = AutoModel.from_pretrained("pkshatech/GLuCoSE-base-ja-v2")
# Each input text should start with "query: " or "passage: ".
# For tasks other than retrieval, you can simply use the "query: " prefix.
sentences = [
'query: PKSHAはどんな会社ですか?',
'passage: 研究開発したアルゴリズムを、多くの企業のソフトウエア・オペレーションに導入しています。',
'query: 日本で一番高い山は?',
'passage: 富士山(ふじさん)は、標高3776.12 m、日本最高峰(剣ヶ峰)の独立峰で、その優美な風貌は日本国外でも日本の象徴として広く知られている。',
]
# Tokenize the input texts
batch_dict = tokenizer(sentences, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = mean_pooling(outputs.last_hidden_state, batch_dict['attention_mask'])
print(embeddings.shape)
# [4, 768]
# Get the similarity scores for the embeddings
similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.6050, 0.4341, 0.5537],
# [0.6050, 1.0000, 0.5018, 0.6815],
# [0.4341, 0.5018, 1.0000, 0.7534],
# [0.5537, 0.6815, 0.7534, 1.0000]]
Training Details
The fine-tuning of GLuCoSE v2 is carried out through the following steps:
Step 1: Ensemble distillation
- The embedded representation was distilled using E5-mistral, gte-Qwen2, and mE5-large as teacher models.
Step 2: Contrastive learning
- Triplets were created from JSNLI, MNLI, PAWS-X, JSeM and Mr.TyDi and used for training.
- This training aimed to improve the overall performance as a sentence embedding model.
Step 3: Search-specific contrastive learning
- In order to make the model more robust to the retrieval task, additional two-stage training with QA and retrieval task was conducted.
- In the first stage, the synthetic dataset auto-wiki-qa was used for training, while in the second stage, JQaRA, MQA, Japanese Wikipedia Human Retrieval, Mr.TyDi,MIRACL, Quiz Works and Quiz No Mori were used.
Benchmarks
Retrieval
Evaluated with MIRACL-ja, JQARA , JaCWIR and MLDR-ja.
Model | Size | MIRACL Recall@5 |
JQaRA nDCG@10 |
JaCWIR MAP@10 |
MLDR nDCG@10 |
---|---|---|---|---|---|
intfloat/multilingual-e5-large | 0.6B | 89.2 | 55.4 | 87.6 | 29.8 |
cl-nagoya/ruri-large | 0.3B | 78.7 | 62.4 | 85.0 | 37.5 |
intfloat/multilingual-e5-base | 0.3B | 84.2 | 47.2 | 85.3 | 25.4 |
cl-nagoya/ruri-base | 0.1B | 74.3 | 58.1 | 84.6 | 35.3 |
pkshatech/GLuCoSE-base-ja | 0.1B | 53.3 | 30.8 | 68.6 | 25.2 |
GLuCoSE v2 | 0.1B | 85.5 | 60.6 | 85.3 | 33.8 |
Note: Results for OpenAI small embeddings in JQARA and JaCWIR are quoted from the JQARA and JaCWIR.
JMTEB
Evaluated with JMTEB. The average score is macro-average.
Model | Size | Avg. | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|---|
OpenAI/text-embedding-3-small | - | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
OpenAI/text-embedding-3-large | - | 74.05 | 74.48 | 82.52 | 77.58 | 93.58 | 53.32 | 62.35 |
intfloat/multilingual-e5-large | 0.6B | 70.90 | 70.98 | 79.70 | 72.89 | 92.96 | 51.24 | 62.15 |
cl-nagoya/ruri-large | 0.3B | 73.31 | 73.02 | 83.13 | 77.43 | 92.99 | 51.82 | 62.29 |
intfloat/multilingual-e5-base | 0.3B | 68.61 | 68.21 | 79.84 | 69.30 | 92.85 | 48.26 | 62.26 |
cl-nagoya/ruri-base | 0.1B | 71.91 | 69.82 | 82.87 | 75.58 | 92.91 | 54.16 | 62.38 |
pkshatech/GLuCoSE-base-ja | 0.1B | 67.29 | 59.02 | 78.71 | 76.82 | 91.90 | 49.78 | 66.39 |
GLuCoSE v2 | 0.1B | 72.23 | 73.36 | 82.96 | 74.21 | 93.01 | 48.65 | 62.37 |
Note: Results for OpenAI embeddings and multilingual-e5 models are quoted from the JMTEB leaderboard. Results for ruri are quoted from the cl-nagoya/ruri-base model card.
Authors
Chihiro Yano, Mocho Go, Hideyuki Tachibana, Hiroto Takegawa, Yotaro Watanabe
License
This model is published under the Apache License, Version 2.0.
- Downloads last month
- 6,058