Conan-Embedding-v2
What's New?
Performance
Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
Cross-lingual Retrieval between Chinese and English
Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
Longer Context Support
Conan-Embedding-v2 now supports a context length of 32,768 tokens.
Conan 1.4B Large Model Trained from Scratch
A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.
The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.
Performance
Performance of Conan-Embedding-v2 on MTEB for Chinese and English
English
Embedding TaskMertric | Class. Acc. (12) | Clust V-Meas. (11) | PairClass AP (3) | Rerank MAP (4) | Retri nDCG @ 10 (15) | STS Spear. (12) | SummSpear. (1) | Avg.(56) |
---|---|---|---|---|---|---|---|---|
bge-multilingual-gemma2 | 88.08 | 54.65 | 85.97 | 59.72 | 59.24 | 83.88 | 31.20 | 69.88 |
e5-mistral-7b-instruct | 79.89 | 51.44 | 88.42 | 49.78 | 57.62 | 84.32 | 36.57 | 67.98 |
gte-Qwen2-7B-instruct | 86.58 | 56.92 | 85.90 | 61.42 | 59.11 | 83.06 | 31.35 | 69.95 |
stella-en-1.5B-v5 | 87.63 | 57.69 | 88.07 | 61.21 | 61.01 | 84.51 | 31.49 | 71.19 |
bge-en-icl | 88.95 | 57.89 | 88.14 | 59.86 | 62.16 | 84.24 | 30.77 | 71.67 |
NV-Embed-v2 | 90.37 | 58.46 | 88.67 | 60.65 | 62.65 | 84.31 | 30.70 | 72.31 |
Conan-embedding-v2 | 89.98 | 60.86 | 93.47 | 60.89 | 66.40 | 85.73 | 28.08 | 74.19 |
Chinese
Embedding TaskMertric | Class.Acc. (9) | ClustV-Meas. (4) | PairClassAP (2) | RerankMAP (4) | RetrinDCG @ 10 (8) | STSSpear. (8) | Avg.(35) |
---|---|---|---|---|---|---|---|
e5-mistral-7b-instruct | 70.47 | 52.30 | 72.19 | 61.86 | 61.75 | 50.22 | 60.89 |
gte-Qwen2-1.5B-instruct | 71.12 | 54.61 | 86.91 | 68.21 | 71.86 | 60.96 | 67.65 |
bge-multilingual-gemma2 | 74.11 | 59.30 | 86.67 | 68.28 | 73.73 | 56.87 | 68.44 |
gte-Qwen2-7B-instruct | 75.00 | 66.06 | 87.48 | 68.92 | 75.71 | 65.27 | 71.94 |
xiaobu-embedding-v2 | 74.67 | 65.17 | 91.87 | 72.58 | 76.50 | 64.53 | 72.43 |
Conan-embedding-v1 | 75.03 | 66.33 | 91.66 | 72.76 | 76.67 | 64.18 | 72.62 |
Conan-embedding-v2 | 74.70 | 68.84 | 92.44 | 74.41 | 78.31 | 66.47 | 73.95 |
Model Detail
Model Structure
Conan-Embedding-v2 Structure:
SentenceTransformer(
(0): Transformer({
'max_seq_length': 32768,
'do_lower_case': False
}) with Transformer model: ConanEmbedModel,
(1): Pooling({
'word_embedding_dimension': 3584,
'pooling_mode_cls_token': False,
'pooling_mode_mean_tokens': True,
'pooling_mode_max_tokens': False,
'pooling_mode_mean_sqrt_len_tokens': False,
'pooling_mode_weightedmean_tokens': False,
'pooling_mode_lasttoken': False,
'include_prompt': True
}),
(2): Dense({
'in_features': 3584,
'out_features': 3584,
'bias': True,
'activation_function': 'torch.nn.modules.linear.Identity'
})
)
Key Specifications of Conan-1.4B (Transformer):
Number of Parameters (Non-Dense-Layer): 1.48B
Vocabulary Size: 150,000
Number of Layers: 8
Hidden Layer Dimension: 3584
Number of Attention Heads (GOA): 32 for Q and 8 for KV
Intermediate Dimension of FFN Layer: 8192
Maximum Context Window: 32,768 Tokens
For more model details, please refer to model/modeling_conan.py
and config.json
, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.
Tokenizer
We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.
Technical Report
We will soon release our technical report.
Using Conan-Embedding-v2
Use /model/conan_api_client.py
to access our test API. A sample call is as follows:
from modeling_conan import ConanClient
AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)
This is a temporary calling solution. Please contact us to obtain an access token.
In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.
About
Created by the Tencent BAC Group. All rights reserved.