File size: 4,833 Bytes
8c990a9
 
ec73b69
 
 
3e844cc
ec73b69
 
220472f
071e1bb
 
8c990a9
c2bae01
 
a2eb078
1ae6783
c2bae01
1ae6783
 
 
0d29fb4
 
1ae6783
2b02579
22002d6
0d29fb4
6843106
 
f38e2df
c2bae01
58ba0da
c2bae01
58ba0da
 
 
 
 
 
 
c2bae01
2d3ec97
 
 
 
 
 
 
 
c2bae01
 
071e1bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c2bae01
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3605b5f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
language:
- en
tags:
- Phrase Representation
- String Matching
- Fuzzy Join
- Entity Retrieval
- transformers
- sentence-transformers
---
## PEARL-small
[Learning High-Quality and General-Purpose Phrase Representations](https://arxiv.org/pdf/2401.10407.pdf). <br>
[Lihu Chen](https://chenlihu.com), [Gaël Varoquaux](https://gael-varoquaux.info/), [Fabian M. Suchanek](https://suchanek.name/). 
Accepted by EACL Findings 2024 <br>

PEARL-small is a lightweight string embedding model. It is the tool of choice for semantic similarity computation for strings,
creating excellent embeddings for string matching, entity retrieval, entity clustering, fuzzy join...
<br>
It differs from typical sentence embedders because it incorporates phrase type information and morphological features, 
allowing it to better capture variations in strings.
The model is a variant of [E5-small](https://huggingface.co/intfloat/e5-small-v2) finetuned on our constructed context-free [dataset](https://zenodo.org/records/10676475) to yield better representations 
for phrases and strings. <br>


🤗 [PEARL-small](https://huggingface.co/Lihuchen/pearl_small) 🤗 [PEARL-base](https://huggingface.co/Lihuchen/pearl_base)
<br>


| Model |Size|Avg| PPDB | PPDB filtered |Turney|BIRD|YAGO|UMLS|CoNLL|BC5CDR|AutoFJ|
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| FastText  |-| 40.3| 94.4  | 61.2  |  59.6  | 58.9  |16.9|14.5|3.0|0.2| 53.6|
| Sentence-BERT  |110M|50.1| 94.6  | 66.8  | 50.4  | 62.6  | 21.6|23.6|25.5|48.4| 57.2| 
| Phrase-BERT  |110M|54.5|  96.8  |  68.7  | 57.2  |  68.8  |23.7|26.1|35.4| 59.5|66.9| 
| E5-small  |34M|57.0|  96.0| 56.8|55.9| 63.1|43.3| 42.0|27.6| 53.7|74.8|
|E5-base|110M| 61.1| 95.4|65.6|59.4|66.3| 47.3|44.0|32.0| 69.3|76.1|
|PEARL-small|34M| 62.5| 97.0|70.2|57.9|68.1| 48.1|44.5|42.4|59.3|75.2|
|PEARL-base|110M|64.8|97.3|72.2|59.7|72.6|50.7|45.8|39.3|69.4|77.1|

Cost comparison of FastText and PEARL. The estimated memory is calculated by the number of parameters (float16). The unit of inference speed is `*ms/512 samples`.
The FastText model here is `crawl-300d-2M-subword.bin`.
| Model |Avg Score| Estimated Memory |Speed GPU | Speed CPU |
|-|-|-|-|-|
|FastText|40.3|1200MB|-|57ms|
|PEARL-small|62.5|68MB|42ms|446ms|
|PEARL-base|64.8|220MB|89ms|1394ms|

## Usage

### Sentence Transformers
PEARL is integrated with the Sentence Transformers library, and can be used like so:

```python
from sentence_transformers import SentenceTransformer, util

query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

model = SentenceTransformer("Lihuchen/pearl_small")
embeddings = model.encode(input_texts)
scores = util.cos_sim(embeddings[0], embeddings[1:]) * 100
print(scores.tolist())
# [[90.56318664550781, 79.65763854980469, 75.52056121826172]]
```

### Transformers
You can also use `transformers` to use PEARL. Below is an example of entity retrieval, and we reuse the code from E5.

```python
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def encode_text(model, input_texts):
    # Tokenize the input texts
    batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

    outputs = model(**batch_dict)
    embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
    
    return embeddings


query_texts = ["The New York Times"]
doc_texts = [ "NYTimes", "New York Post", "New York"]
input_texts = query_texts + doc_texts

tokenizer = AutoTokenizer.from_pretrained('Lihuchen/pearl_small')
model = AutoModel.from_pretrained('Lihuchen/pearl_small')

# encode
embeddings = encode_text(model, input_texts)

# calculate similarity
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

# expected outputs
# [[90.56318664550781, 79.65763854980469, 75.52054595947266]]
```

## Training and Evaluation
Have a look at our code on [Github](https://github.com/tigerchen52/PEARL)



## Citation

If you find our work useful, please give us a citation:

```
@article{chen2024learning,
  title={Learning High-Quality and General-Purpose Phrase Representations},
  author={Chen, Lihu and Varoquaux, Ga{\"e}l and Suchanek, Fabian M},
  journal={arXiv preprint arXiv:2401.10407},
  year={2024}
}
```