Conan-Embedding-v2

What's New?

  • Performance

    Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.

  • Cross-lingual Retrieval between Chinese and English

    Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.

  • Longer Context Support

    Conan-Embedding-v2 now supports a context length of 32,768 tokens.

  • Conan 1.4B Large Model Trained from Scratch

    A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.

    The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.

Performance

Performance of Conan-Embedding-v2 on MTEB for Chinese and English

MTEB Result

English

Embedding TaskMertric Class. Acc. (12) Clust V-Meas. (11) PairClass AP (3) Rerank MAP (4) Retri nDCG @ 10 (15) STS Spear. (12) SummSpear. (1) Avg.(56)
bge-multilingual-gemma2 88.08 54.65 85.97 59.72 59.24 83.88 31.20 69.88
e5-mistral-7b-instruct 79.89 51.44 88.42 49.78 57.62 84.32 36.57 67.98
gte-Qwen2-7B-instruct 86.58 56.92 85.90 61.42 59.11 83.06 31.35 69.95
stella-en-1.5B-v5 87.63 57.69 88.07 61.21 61.01 84.51 31.49 71.19
bge-en-icl 88.95 57.89 88.14 59.86 62.16 84.24 30.77 71.67
NV-Embed-v2 90.37 58.46 88.67 60.65 62.65 84.31 30.70 72.31
Conan-embedding-v2 89.98 60.86 93.47 60.89 66.40 85.73 28.08 74.19

Chinese

Embedding TaskMertric Class.Acc. (9) ClustV-Meas. (4) PairClassAP (2) RerankMAP (4) RetrinDCG @ 10 (8) STSSpear. (8) Avg.(35)
e5-mistral-7b-instruct 70.47 52.30 72.19 61.86 61.75 50.22 60.89
gte-Qwen2-1.5B-instruct 71.12 54.61 86.91 68.21 71.86 60.96 67.65
bge-multilingual-gemma2 74.11 59.30 86.67 68.28 73.73 56.87 68.44
gte-Qwen2-7B-instruct 75.00 66.06 87.48 68.92 75.71 65.27 71.94
xiaobu-embedding-v2 74.67 65.17 91.87 72.58 76.50 64.53 72.43
Conan-embedding-v1 75.03 66.33 91.66 72.76 76.67 64.18 72.62
Conan-embedding-v2 74.70 68.84 92.44 74.41 78.31 66.47 73.95

Model Detail

Model Structure

Conan-Embedding-v2 Structure:

SentenceTransformer(  
    (0): Transformer({
        'max_seq_length': 32768, 
        'do_lower_case': False
        }) with Transformer model: ConanEmbedModel,
    (1): Pooling({
        'word_embedding_dimension': 3584, 
        'pooling_mode_cls_token': False, 
        'pooling_mode_mean_tokens': True, 
        'pooling_mode_max_tokens': False, 
        'pooling_mode_mean_sqrt_len_tokens': False, 
        'pooling_mode_weightedmean_tokens': False, 
        'pooling_mode_lasttoken': False, 
        'include_prompt': True
        }),
    (2): Dense({
        'in_features': 3584, 
        'out_features': 3584, 
        'bias': True, 
        'activation_function': 'torch.nn.modules.linear.Identity'
        })
)

Key Specifications of Conan-1.4B (Transformer):

  • Number of Parameters (Non-Dense-Layer): 1.48B

  • Vocabulary Size: 150,000

  • Number of Layers: 8

  • Hidden Layer Dimension: 3584

  • Number of Attention Heads (GOA): 32 for Q and 8 for KV

  • Intermediate Dimension of FFN Layer: 8192

  • Maximum Context Window: 32,768 Tokens

For more model details, please refer to model/modeling_conan.py and config.json, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

Tokenizer

We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

Technical Report

We will soon release our technical report.

Using Conan-Embedding-v2

Use /model/conan_api_client.py to access our test API. A sample call is as follows:

from modeling_conan import ConanClient

AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)

This is a temporary calling solution. Please contact us to obtain an access token.

In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.


About

Created by the Tencent BAC Group. All rights reserved.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support