Update README.md
Browse files
README.md
CHANGED
@@ -1063,6 +1063,8 @@ model-index:
|
|
1063 |
|
1064 |
**新闻 | News**
|
1065 |
|
|
|
|
|
1066 |
**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
|
1067 |
Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
|
1068 |
and **do not need any prefix text**.\
|
@@ -1072,12 +1074,13 @@ stella是一个通用的文本编码模型,主要有以下模型:
|
|
1072 |
|
1073 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1074 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
|
|
1075 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1076 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1077 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
1078 |
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
1079 |
|
1080 |
-
完整的训练思路和训练过程已记录在[博客](https://zhuanlan.zhihu.com/p/655322183),欢迎阅读讨论。
|
1081 |
|
1082 |
**训练数据:**
|
1083 |
|
@@ -1104,6 +1107,7 @@ stella is a general-purpose text encoder, which mainly includes the following mo
|
|
1104 |
|
1105 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1106 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
|
|
1107 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1108 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1109 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
@@ -1142,9 +1146,15 @@ Based on stella models, stella-v2 use more training data and remove instruction
|
|
1142 |
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
1143 |
| stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
1144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
1145 |
#### Reproduce our results
|
1146 |
|
1147 |
-
|
1148 |
|
1149 |
```python
|
1150 |
import torch
|
@@ -1186,6 +1196,10 @@ if __name__ == '__main__':
|
|
1186 |
|
1187 |
```
|
1188 |
|
|
|
|
|
|
|
|
|
1189 |
#### Evaluation for long text
|
1190 |
|
1191 |
经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
|
@@ -1244,7 +1258,6 @@ stella中文系列模型均使用mean pooling做为文本向量。
|
|
1244 |
在sentence-transformer库中的使用方法:
|
1245 |
|
1246 |
```python
|
1247 |
-
# 对于短对短数据集,下面是通用的使用方式
|
1248 |
from sentence_transformers import SentenceTransformer
|
1249 |
|
1250 |
sentences = ["数据1", "数据2"]
|
@@ -1282,7 +1295,43 @@ print(vectors.shape) # 2,768
|
|
1282 |
|
1283 |
#### stella models for English
|
1284 |
|
1285 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1286 |
|
1287 |
## Training Detail
|
1288 |
|
@@ -1320,3 +1369,4 @@ developing...
|
|
1320 |
9. https://github.com/THUDM/LongBench
|
1321 |
|
1322 |
|
|
|
|
1063 |
|
1064 |
**新闻 | News**
|
1065 |
|
1066 |
+
**[2023-10-19]** 开源stella-base-en-v2 使用简单,**不需要任何前缀文本**。
|
1067 |
+
Release stella-base-en-v2. This model **does not need any prefix text**.\
|
1068 |
**[2023-10-12]** 开源stella-base-zh-v2和stella-large-zh-v2, 效果更好且使用简单,**不需要任何前缀文本**。
|
1069 |
Release stella-base-zh-v2 and stella-large-zh-v2. The 2 models have better performance
|
1070 |
and **do not need any prefix text**.\
|
|
|
1074 |
|
1075 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1076 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
1077 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
1078 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1079 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1080 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
1081 |
| stella-base-zh | 0.2 | 768 | 1024 | Chinese | Yes |
|
1082 |
|
1083 |
+
完整的训练思路和训练过程已记录在[博客1](https://zhuanlan.zhihu.com/p/655322183)和[博客2](https://zhuanlan.zhihu.com/p/662209559),欢迎阅读讨论。
|
1084 |
|
1085 |
**训练数据:**
|
1086 |
|
|
|
1107 |
|
1108 |
| Model Name | Model Size (GB) | Dimension | Sequence Length | Language | Need instruction for retrieval? |
|
1109 |
|:------------------:|:---------------:|:---------:|:---------------:|:--------:|:-------------------------------:|
|
1110 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | English | No |
|
1111 |
| stella-large-zh-v2 | 0.65 | 1024 | 1024 | Chinese | No |
|
1112 |
| stella-base-zh-v2 | 0.2 | 768 | 1024 | Chinese | No |
|
1113 |
| stella-large-zh | 0.65 | 1024 | 1024 | Chinese | Yes |
|
|
|
1146 |
| stella-large-zh | 0.65 | 1024 | 1024 | 64.54 | 67.62 | 48.65 | 78.72 | 65.98 | 71.02 | 58.3 |
|
1147 |
| stella-base-zh | 0.2 | 768 | 1024 | 64.16 | 67.77 | 48.7 | 76.09 | 66.95 | 71.07 | 56.54 |
|
1148 |
|
1149 |
+
#### MTEB leaderboard (English)
|
1150 |
+
|
1151 |
+
| Model Name | Model Size (GB) | Dimension | Sequence Length | Average (56) | Classification (12) | Clustering (11) | Pair Classification (3) | Reranking (4) | Retrieval (15) | STS (10) | Summarization (1) |
|
1152 |
+
|:-----------------:|:---------------:|:---------:|:---------------:|:------------:|:-------------------:|:---------------:|:-----------------------:|:-------------:|:--------------:|:--------:|:------------------:|
|
1153 |
+
| stella-base-en-v2 | 0.2 | 768 | 512 | 62.61 | 75.28 | 44.9 | 86.45 | 58.77 | 50.1 | 83.02 | 32.52 |
|
1154 |
+
|
1155 |
#### Reproduce our results
|
1156 |
|
1157 |
+
**C-MTEB:**
|
1158 |
|
1159 |
```python
|
1160 |
import torch
|
|
|
1196 |
|
1197 |
```
|
1198 |
|
1199 |
+
**MTEB:**
|
1200 |
+
|
1201 |
+
You can use official script to reproduce our result. [scripts/run_mteb_english.py](https://github.com/embeddings-benchmark/mteb/blob/main/scripts/run_mteb_english.py)
|
1202 |
+
|
1203 |
#### Evaluation for long text
|
1204 |
|
1205 |
经过实际观察发现,C-MTEB的评测数据长度基本都是小于512的,
|
|
|
1258 |
在sentence-transformer库中的使用方法:
|
1259 |
|
1260 |
```python
|
|
|
1261 |
from sentence_transformers import SentenceTransformer
|
1262 |
|
1263 |
sentences = ["数据1", "数据2"]
|
|
|
1295 |
|
1296 |
#### stella models for English
|
1297 |
|
1298 |
+
**Using Sentence-Transformers:**
|
1299 |
+
|
1300 |
+
```python
|
1301 |
+
from sentence_transformers import SentenceTransformer
|
1302 |
+
|
1303 |
+
sentences = ["one car come", "one car go"]
|
1304 |
+
model = SentenceTransformer('infgrad/stella-base-en-v2')
|
1305 |
+
print(model.max_seq_length)
|
1306 |
+
embeddings_1 = model.encode(sentences, normalize_embeddings=True)
|
1307 |
+
embeddings_2 = model.encode(sentences, normalize_embeddings=True)
|
1308 |
+
similarity = embeddings_1 @ embeddings_2.T
|
1309 |
+
print(similarity)
|
1310 |
+
```
|
1311 |
+
|
1312 |
+
**Using HuggingFace Transformers:**
|
1313 |
+
|
1314 |
+
```python
|
1315 |
+
from transformers import AutoModel, AutoTokenizer
|
1316 |
+
from sklearn.preprocessing import normalize
|
1317 |
+
|
1318 |
+
model = AutoModel.from_pretrained('infgrad/stella-base-en-v2')
|
1319 |
+
tokenizer = AutoTokenizer.from_pretrained('infgrad/stella-base-en-v2')
|
1320 |
+
sentences = ["one car come", "one car go"]
|
1321 |
+
batch_data = tokenizer(
|
1322 |
+
batch_text_or_text_pairs=sentences,
|
1323 |
+
padding="longest",
|
1324 |
+
return_tensors="pt",
|
1325 |
+
max_length=512,
|
1326 |
+
truncation=True,
|
1327 |
+
)
|
1328 |
+
attention_mask = batch_data["attention_mask"]
|
1329 |
+
model_output = model(**batch_data)
|
1330 |
+
last_hidden = model_output.last_hidden_state.masked_fill(~attention_mask[..., None].bool(), 0.0)
|
1331 |
+
vectors = last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
|
1332 |
+
vectors = normalize(vectors, norm="l2", axis=1, )
|
1333 |
+
print(vectors.shape) # 2,768
|
1334 |
+
```
|
1335 |
|
1336 |
## Training Detail
|
1337 |
|
|
|
1369 |
9. https://github.com/THUDM/LongBench
|
1370 |
|
1371 |
|
1372 |
+
|