Update model card: Add paper link, abstract, and library name

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +24 -55
README.md CHANGED
@@ -1,17 +1,31 @@
1
  ---
2
- license: mit
 
3
  datasets:
4
  - BAAI/Infinity-Instruct
5
  - HuggingFaceFW/fineweb-edu
6
  language:
7
  - en
8
- base_model:
9
- - answerdotai/ModernBERT-large
10
  pipeline_tag: feature-extraction
11
  tags:
12
  - sentence-transformers
13
  - transformers
 
14
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ## 1 Introduction
16
 
17
  Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
@@ -133,15 +147,15 @@ axis=0)))\
133
  text_len-1) to get global vector
134
 
135
  For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
136
- score of query with every document vector.
137
  This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
138
 
139
  Below are detailed code examples.
140
 
141
  #### 2.3.1 Chunk text in the `encode` function
142
 
143
- You can directly use `encode` method in our model to get multi vectors.
144
- This method will chunk text automatically.
145
  You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
146
  ids, else using RecursiveCharacterTextSplitter.
147
 
@@ -191,11 +205,11 @@ Surely some of the other frequencies also get scattered during the day, just in
191
 
192
  So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
193
 
194
- And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?
195
 
196
- Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?
197
 
198
- It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?
199
 
200
  Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
201
 
@@ -379,49 +393,4 @@ Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scri
379
  | [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
380
  | [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
381
 
382
- ### 3.3 LoCoV1
383
-
384
- URL: https://huggingface.co/datasets/hazyresearch/LoCoV1-Queries\
385
- https://huggingface.co/datasets/hazyresearch/LoCoV1-Documents
386
-
387
- Reproduction script: https://huggingface.co/infgrad/dewey_en_beta/blob/main/scripts/evaluate/run_evaluate_loco.py
388
-
389
- Metric: NDCG@10
390
-
391
- Result:
392
-
393
- | **dataset-name** | **bge-m3-8k** | **gte-modernbert-base-8k** | **Linq-Embed-Mistral-4k** | **Linq-Embed-Mistral-8k** | **SFR-Embedding-Mistral-8k** | **e5-mistral-7b-instruct-8k** | **dewey_en_beta-8k** | **dewey_en_beta_64k** | **dewey_en_beta_64k-multi-vectors** |
394
- |:---------------------------------:|:-------------:|:--------------------------:|:-------------------------:|:-------------------------:|:----------------------------:|:-----------------------------:|:--------------------:|:------------------------:|:--------------------------------------:|
395
- | **2wikimqa_test** | 0.9271 | 0.8658 | 0.8884 | 0.9067 | 0.8965 | 0.8901 | 0.8953 | 0.9051 | 0.9775 |
396
- | **courtlistener_HTML_test** | 0.1933 | 0.2349 | 0.3551 | 0.3670 | 0.3647 | 0.3543 | 0.3415 | 0.3616 | 0.4775 |
397
- | **courtlistener_Plain_Text_test** | 0.1888 | 0.2478 | 0.3675 | 0.3761 | 0.3679 | 0.3579 | 0.3377 | 0.3485 | 0.4426 |
398
- | **gov_report_test** | 0.9869 | 0.9750 | 0.9832 | 0.9837 | 0.9816 | 0.9823 | 0.9855 | 0.9883 | 0.9853 |
399
- | **legal_case_reports_test** | 0.3702 | 0.4476 | 0.5398 | 0.5432 | 0.5319 | 0.4850 | 0.5474 | 0.5875 | 0.6534 |
400
- | **multifieldqa_test** | 0.9373 | 0.9341 | 0.9345 | 0.9327 | 0.9450 | 0.9321 | 0.9687 | 0.9564 | 0.9754 |
401
- | **passage_retrieval_test** | 0.4493 | 0.5271 | 0.3470 | 0.3407 | 0.2902 | 0.3248 | 0.7562 | 0.7389 | 0.8550 |
402
- | **qasper_abstract_test** | 1.0000 | 0.9806 | 0.9982 | 0.9982 | 0.9973 | 0.9965 | 0.9973 | 0.9982 | 0.9982 |
403
- | **qasper_title_test** | 0.9860 | 0.8892 | 0.9838 | 0.9833 | 0.9861 | 0.9812 | 0.9742 | 0.9742 | 0.9840 |
404
- | **qmsum_test** | 0.6668 | 0.6307 | 0.6816 | 0.7237 | 0.7169 | 0.7148 | 0.7438 | 0.7613 | 0.8154 |
405
- | **stackoverflow_test** | 0.9634 | 0.9087 | 0.9760 | 0.9760 | 0.9766 | 0.9690 | 0.9362 | 0.9369 | 0.9443 |
406
- | **summ_screen_fd_test** | 0.9320 | 0.9379 | 0.9747 | 0.9635 | 0.9656 | 0.9580 | 0.9796 | 0.9821 | 0.9788 |
407
- | **Average** | 0.7168 | 0.7150 | 0.7525 | 0.7579 | 0.7517 | 0.7455 | 0.7886 |**0.7949** |**0.8406** |
408
-
409
- ## 4 Limitations
410
-
411
- - Only English text.
412
- - On short text tasks, the performance might not be as good as that of conventional short text embedding models.
413
- - As said before, this model is still in alpha or beta stage, the model may have some unexpected behaviour.
414
-
415
- ## 5 Cite
416
-
417
- ```
418
- @misc{zhang2025deweylongcontextembedding,
419
- title={Dewey Long Context Embedding Model: A Technical Report},
420
- author={Dun Zhang and Panxiang Zou and Yudong Zhou},
421
- year={2025},
422
- eprint={2503.20376},
423
- archivePrefix={arXiv},
424
- primaryClass={cs.IR},
425
- url={https://arxiv.org/abs/2503.20376},
426
- }
427
- ```
 
1
  ---
2
+ base_model:
3
+ - answerdotai/ModernBERT-large
4
  datasets:
5
  - BAAI/Infinity-Instruct
6
  - HuggingFaceFW/fineweb-edu
7
  language:
8
  - en
9
+ license: mit
 
10
  pipeline_tag: feature-extraction
11
  tags:
12
  - sentence-transformers
13
  - transformers
14
+ library_name: sentence-transformers
15
  ---
16
+
17
+ # Dewey Long Context Embedding Model: A Technical Report
18
+
19
+ The model was presented in the paper [](https://huggingface.co/papers/2503.20376).
20
+
21
+ # Paper abstract
22
+
23
+ The abstract of the paper is the following:
24
+
25
+ ```
26
+ In this technical report, we introduce Dewey, a novel long context embedding model designed to enhance retrieval performance in long document scenarios. Dewey builds upon the ModernBERT architecture, known for its efficient handling of extended sequences, and incorporates an instruction-based training approach to align embeddings with specific task requirements. Key features of Dewey include its 128k context window, multi-vector representation for improved granularity, and a flexible chunking mechanism that allows customizable vector combinations. We evaluate Dewey on the LongEmbed benchmark, where it achieves state-of-the-art results, surpassing several larger models. Additionally, we present comprehensive usage examples and implementation details to facilitate the adoption and adaptation of Dewey for various applications.
27
+ ```
28
+
29
  ## 1 Introduction
30
 
31
  Cooperating with [Richinfo](https://www.richinfo.cn/index.html), this released model was trained using a novel approach,
 
147
  text_len-1) to get global vector
148
 
149
  For retrieval tasks, query vector should be **single vector**, so the final score between query and document is the max
150
+ score of query with every document vector.\
151
  This is compatible with FAISS, MILVUS and so on. Just enlarge the top-k and do de-duplicate on searched documents.
152
 
153
  Below are detailed code examples.
154
 
155
  #### 2.3.1 Chunk text in the `encode` function
156
 
157
+ You can directly use `encode` method in our model to get multi vectors.\
158
+ This method will chunk text automatically.\
159
  You can choose the chunk strategy by setting `fast_chunk` parameter, if `fast_chunk` is true, directly chunk on input
160
  ids, else using RecursiveCharacterTextSplitter.
161
 
 
205
 
206
  So during the evening blue light gets scattered even more, to the point where even less of it reaches the eyes?
207
 
208
+ And so it gets red because now we can see the lower frequencies being scattered without blue overshadowing them?\
209
 
210
+ Trying to word it myself: during the day only the highest frequencies get filtered, but during the evening also lower frequencies get filtered, because now the “light strainer” (air) is just catching more of it?\
211
 
212
+ It gets darker in the evening without a good ability to see colors because there’s is no blue and so on light to reflect off of objects?\
213
 
214
  Is it ok to speak about light as a frequency? Or it’s only correct to say “wave length”?
215
 
 
393
  | [voyage-3](https://blog.voyageai.com/2024/09/18/voyage-3/) | 100% | Unknown | 1024 | 32000 | 74.06 | 74.06 | 74.06 |
394
  | [inf-retriever-v1](https://huggingface.co/infly/inf-retriever-v1) | 100% | 7B | 3584 | 32768 | 73.19 | 73.19 | 73.19 |
395
 
396
+ ### 3.