Update model card: Add paper link, abstract, and library name
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,17 +1,31 @@
|
|
1 |
---
|
2 |
-
|
|
|
3 |
datasets:
|
4 |
- BAAI/Infinity-Instruct
|
5 |
- HuggingFaceFW/fineweb-edu
|
6 |
language:
|
7 |
- en
|
8 |
-
|
9 |
-
- answerdotai/ModernBERT-large
|
10 |
pipeline_tag: feature-extraction
|
11 |
tags:
|
12 |
- sentence-transformers
|
13 |
- transformers
|
|
|
14 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
## 1 Introduction
|
16 |
|
17 |
Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
|
@@ -133,15 +147,15 @@ axis=0)))\
|
|
133 |
text_len-1) to get global vector
|
134 |
|
135 |
For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
|
136 |
-
score of query with every document vector
|
137 |
This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
|
138 |
|
139 |
Below are detailed code examples.
|
140 |
|
141 |
#### 2.3.1 Chunk text in the `encode` function
|
142 |
|
143 |
-
You can directly use `encode` method in our model to get multi vectors
|
144 |
-
This method will chunk text automatically
|
145 |
You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
|
146 |
ids, else using RecursiveCharacterTextSplitter.
|
147 |
|
@@ -191,11 +205,11 @@ Surely some of the other frequencies also get scattered during the day, just in
|
|
191 |
|
192 |
So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
|
193 |
|
194 |
-
And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them
|
195 |
|
196 |
-
Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it
|
197 |
|
198 |
-
It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects
|
199 |
|
200 |
Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
|
201 |
|
@@ -379,49 +393,4 @@ Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scri
|
|
379 |
| [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
|
380 |
| [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
|
381 |
|
382 |
-
### 3.
|
383 |
-
|
384 |
-
URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
|
385 |
-
https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
|
386 |
-
|
387 |
-
Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
|
388 |
-
|
389 |
-
Metric: NDCG@10
|
390 |
-
|
391 |
-
Result:
|
392 |
-
|
393 |
-
| **dataset-name** | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** | **dewey_en_beta_64k** | **dewey_en_beta_64k-multi-vectors** |
|
394 |
-
|:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
|
395 |
-
| **2wikimqa_test** | 0.9271 | 0.8658 | 0.8884 | 0.9067 | 0.8965 | 0.8901 | 0.8953 | 0.9051 | 0.9775 |
|
396 |
-
| **courtlistener_HTML_test** | 0.1933 | 0.2349 | 0.3551 | 0.3670 | 0.3647 | 0.3543 | 0.3415 | 0.3616 | 0.4775 |
|
397 |
-
| **courtlistener_Plain_Text_test** | 0.1888 | 0.2478 | 0.3675 | 0.3761 | 0.3679 | 0.3579 | 0.3377 | 0.3485 | 0.4426 |
|
398 |
-
| **gov_report_test** | 0.9869 | 0.9750 | 0.9832 | 0.9837 | 0.9816 | 0.9823 | 0.9855 | 0.9883 | 0.9853 |
|
399 |
-
| **legal_case_reports_test** | 0.3702 | 0.4476 | 0.5398 | 0.5432 | 0.5319 | 0.4850 | 0.5474 | 0.5875 | 0.6534 |
|
400 |
-
| **multifieldqa_test** | 0.9373 | 0.9341 | 0.9345 | 0.9327 | 0.9450 | 0.9321 | 0.9687 | 0.9564 | 0.9754 |
|
401 |
-
| **passage_retrieval_test** | 0.4493 | 0.5271 | 0.3470 | 0.3407 | 0.2902 | 0.3248 | 0.7562 | 0.7389 | 0.8550 |
|
402 |
-
| **qasper_abstract_test** | 1.0000 | 0.9806 | 0.9982 | 0.9982 | 0.9973 | 0.9965 | 0.9973 | 0.9982 | 0.9982 |
|
403 |
-
| **qasper_title_test** | 0.9860 | 0.8892 | 0.9838 | 0.9833 | 0.9861 | 0.9812 | 0.9742 | 0.9742 | 0.9840 |
|
404 |
-
| **qmsum_test** | 0.6668 | 0.6307 | 0.6816 | 0.7237 | 0.7169 | 0.7148 | 0.7438 | 0.7613 | 0.8154 |
|
405 |
-
| **stackoverflow_test** | 0.9634 | 0.9087 | 0.9760 | 0.9760 | 0.9766 | 0.9690 | 0.9362 | 0.9369 | 0.9443 |
|
406 |
-
| **summ_screen_fd_test** | 0.9320 | 0.9379 | 0.9747 | 0.9635 | 0.9656 | 0.9580 | 0.9796 | 0.9821 | 0.9788 |
|
407 |
-
| **Average** | 0.7168 | 0.7150 | 0.7525 | 0.7579 | 0.7517 | 0.7455 | 0.7886 |**0.7949** |**0.8406** |
|
408 |
-
|
409 |
-
## 4 Limitations
|
410 |
-
|
411 |
-
- Only English text.
|
412 |
-
- On short text tasks, the performance might not be as good as that of conventional short text embedding models.
|
413 |
-
- As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.
|
414 |
-
|
415 |
-
## 5 Cite
|
416 |
-
|
417 |
-
```
|
418 |
-
@misc{zhang2025deweylongcontextembedding,
|
419 |
-
title={Dewey Long Context Embedding Model: A Technical Report},
|
420 |
-
author={Dun Zhang and Panxiang Zou and Yudong Zhou},
|
421 |
-
year={2025},
|
422 |
-
eprint={2503.20376},
|
423 |
-
archivePrefix={arXiv},
|
424 |
-
primaryClass={cs.IR},
|
425 |
-
url={https://arxiv.org/abs/2503.20376},
|
426 |
-
}
|
427 |
-
```
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- answerdotai/ModernBERT-large
|
4 |
datasets:
|
5 |
- BAAI/Infinity-Instruct
|
6 |
- HuggingFaceFW/fineweb-edu
|
7 |
language:
|
8 |
- en
|
9 |
+
license: mit
|
|
|
10 |
pipeline_tag: feature-extraction
|
11 |
tags:
|
12 |
- sentence-transformers
|
13 |
- transformers
|
14 |
+
library_name: sentence-transformers
|
15 |
---
|
16 |
+
|
17 |
+
# Dewey Long Context Embedding Model: A Technical Report
|
18 |
+
|
19 |
+
The model was presented in the paper [](https://huggingface.co/papers/2503.20376).
|
20 |
+
|
21 |
+
# Paper abstract
|
22 |
+
|
23 |
+
The abstract of the paper is the following:
|
24 |
+
|
25 |
+
```
|
26 |
+
In this technical report, we introduce Dewey, a novel long context embedding model designed to enhance retrieval performance in long document scenarios. Dewey builds upon the ModernBERT architecture, known for its efficient handling of extended sequences, and incorporates an instruction-based training approach to align embeddings with specific task requirements. Key features of Dewey include its 128k context window, multi-vector representation for improved granularity, and a flexible chunking mechanism that allows customizable vector combinations. We evaluate Dewey on the LongEmbed benchmark, where it achieves state-of-the-art results, surpassing several larger models. Additionally, we present comprehensive usage examples and implementation details to facilitate the adoption and adaptation of Dewey for various applications.
|
27 |
+
```
|
28 |
+
|
29 |
## 1 Introduction
|
30 |
|
31 |
Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
|
|
|
147 |
text_len-1) to get global vector
|
148 |
|
149 |
For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
|
150 |
+
score of query with every document vector.\
|
151 |
This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
|
152 |
|
153 |
Below are detailed code examples.
|
154 |
|
155 |
#### 2.3.1 Chunk text in the `encode` function
|
156 |
|
157 |
+
You can directly use `encode` method in our model to get multi vectors.\
|
158 |
+
This method will chunk text automatically.\
|
159 |
You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
|
160 |
ids, else using RecursiveCharacterTextSplitter.
|
161 |
|
|
|
205 |
|
206 |
So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
|
207 |
|
208 |
+
And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?\
|
209 |
|
210 |
+
Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?\
|
211 |
|
212 |
+
It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?\
|
213 |
|
214 |
Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
|
215 |
|
|
|
393 |
| [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
|
394 |
| [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
|
395 |
|
396 |
+
### 3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|