Conan-Embedding-v2

What's New?

Performance

Conan-Embedding-v2 has now achieved SOTA performance on the MTEB leaderboard for both Chinese and English.
Cross-lingual Retrieval between Chinese and English

Conan-Embedding-v2 supports cross-lingual retrieval between Chinese and English samples.
Longer Context Support

Conan-Embedding-v2 now supports a context length of 32,768 tokens.
Conan 1.4B Large Model Trained from Scratch

A vocabulary and large language model trained from scratch, with a pre-trained model and vocabulary more tailored to the Embedding scenario, delivering stronger performance.

The Conan-1.4B base model will be open-sourced. Community workers can train their own Embedding models based on the Conan-1.4B base model.

Performance

Performance of Conan-Embedding-v2 on MTEB for Chinese and English

English

Embedding TaskMertric	Class. Acc. (12)	Clust V-Meas. (11)	PairClass AP (3)	Rerank MAP (4)	Retri nDCG @ 10 (15)	STS Spear. (12)	SummSpear. (1)	Avg.(56)
bge-multilingual-gemma2	88.08	54.65	85.97	59.72	59.24	83.88	31.20	69.88
e5-mistral-7b-instruct	79.89	51.44	88.42	49.78	57.62	84.32	36.57	67.98
gte-Qwen2-7B-instruct	86.58	56.92	85.90	61.42	59.11	83.06	31.35	69.95
stella-en-1.5B-v5	87.63	57.69	88.07	61.21	61.01	84.51	31.49	71.19
bge-en-icl	88.95	57.89	88.14	59.86	62.16	84.24	30.77	71.67
NV-Embed-v2	90.37	58.46	88.67	60.65	62.65	84.31	30.70	72.31
Conan-embedding-v2	89.98	60.86	93.47	60.89	66.40	85.73	28.08	74.19

Chinese

Embedding TaskMertric	Class.Acc. (9)	ClustV-Meas. (4)	PairClassAP (2)	RerankMAP (4)	RetrinDCG @ 10 (8)	STSSpear. (8)	Avg.(35)
e5-mistral-7b-instruct	70.47	52.30	72.19	61.86	61.75	50.22	60.89
gte-Qwen2-1.5B-instruct	71.12	54.61	86.91	68.21	71.86	60.96	67.65
bge-multilingual-gemma2	74.11	59.30	86.67	68.28	73.73	56.87	68.44
gte-Qwen2-7B-instruct	75.00	66.06	87.48	68.92	75.71	65.27	71.94
xiaobu-embedding-v2	74.67	65.17	91.87	72.58	76.50	64.53	72.43
Conan-embedding-v1	75.03	66.33	91.66	72.76	76.67	64.18	72.62
Conan-embedding-v2	74.70	68.84	92.44	74.41	78.31	66.47	73.95

Model Detail

Model Structure

Conan-Embedding-v2 Structure:

SentenceTransformer(  
    (0): Transformer({
        'max_seq_length': 32768, 
        'do_lower_case': False
        }) with Transformer model: ConanEmbedModel,
    (1): Pooling({
        'word_embedding_dimension': 3584, 
        'pooling_mode_cls_token': False, 
        'pooling_mode_mean_tokens': True, 
        'pooling_mode_max_tokens': False, 
        'pooling_mode_mean_sqrt_len_tokens': False, 
        'pooling_mode_weightedmean_tokens': False, 
        'pooling_mode_lasttoken': False, 
        'include_prompt': True
        }),
    (2): Dense({
        'in_features': 3584, 
        'out_features': 3584, 
        'bias': True, 
        'activation_function': 'torch.nn.modules.linear.Identity'
        })
)

Key Specifications of Conan-1.4B (Transformer):

Number of Parameters (Non-Dense-Layer): 1.48B
Vocabulary Size: 150,000
Number of Layers: 8
Hidden Layer Dimension: 3584
Number of Attention Heads (GOA): 32 for Q and 8 for KV
Intermediate Dimension of FFN Layer: 8192
Maximum Context Window: 32,768 Tokens

For more model details, please refer to model/modeling_conan.py and config.json, or stay tuned for the upcoming open-source release of Conan-1.4B Base Model.

Tokenizer

We trained the Tokenizer on a large-scale multilingual dataset to build a standard BBPE(Byte-level Byte Pair Encoding) tokenizer with a vocabulary size of 150,000.

Technical Report

We will soon release our technical report.

Using Conan-Embedding-v2

Use /model/conan_api_client.py to access our test API. A sample call is as follows:

from modeling_conan import ConanClient

AK = os.getenv("CONAN_AK")
SK = os.getenv("CONAN_SK")
client = ConanClient(ak=AK, sk=SK, url="https://ai.om.qq.com/api/conan/v2")
res = client.embed("Hello!")
print(res)

This is a temporary calling solution. Please contact us to obtain an access token.

In the future, we will provide high-performance, cost-effective, and reliable Embedding services on Tencent Cloud.

About