File size: 5,164 Bytes
09612c2
 
 
 
 
 
 
 
 
 
 
 
3acb36b
09612c2
588d31a
09612c2
 
 
588d31a
09612c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be40ae6
fea5138
 
 
 
086b708
fea5138
 
 
 
 
 
 
 
be40ae6
2bff784
be40ae6
 
 
 
 
42f8a05
be40ae6
 
 
 
 
42f8a05
be40ae6
42f8a05
 
 
 
fea5138
42f8a05
 
 
 
 
 
 
 
 
 
fea5138
42f8a05
 
 
 
 
 
fea5138
42f8a05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
be40ae6
 
 
 
 
 
 
 
 
 
 
 
09612c2
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
datasets:
- gowitheflowlab/parallel-medium-w-nli
- gowitheflow/allnli-sup
- gowitheflow/wiki1M-character-level-all
- gowitheflow/wiki1M-word-condition-shuffle
- gowitheflow/wiki1M-word-character-all-multiple
- gowitheflow/wiki1M-word-random-shuffle
- gowitheflow/wiki-span
pipeline_tag: sentence-similarity
---

# Model Card for Pixel-Linguist/Pixel-Linguist-v0

Official model checkpoint of **Pixel-Linguist-v0** from paper "Pixel Sentence Representation Learning".

### Model Summary

Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of.

### Model Sources

- **Github Repo:** https://github.com/gowitheflow-1998/Pixel-Linguist
- **Paper:** https://arxiv.org/pdf/2402.08183.pdf

### Downstream Use

Semantic Textual Similarity, Information Retrieval

### Out-of-Scope Use

The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.

### Training Data

All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process.

## Inference
Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class.

```python
model_name = "Pixel-Linguist/Pixel-Linguist-v0"
model = PixelLinguist(model_name)

texts = ["I love you","I like you"]
embeddings = model.encode(texts)
print(outputs[0] @ outputs[1].T)  # just use dot product because the embeddings are normalized automatically in the model class.
#tensor(0.9217)
```

To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow.

```python
import torch
from PIL import Image
from pixel import (
    AutoConfig,
    PangoCairoTextRenderer,
    PIXELForSequenceClassification,
    PIXELForRepresentation,
    PoolingMode,
    get_attention_mask,
    get_transforms,
    glue_strip_spaces,
    resize_model_embeddings,
)
from tqdm import tqdm

class PixelLinguist:
    def __init__(self, model_name, batch_size = 16, max_seq_length = 64, 
                 device=None, pooling = "mean", keep_mlp = False):
        if device is not None:
            self.device = device
        else:
            self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        self.config = AutoConfig.from_pretrained(model_name, num_labels=0)
        self.batch_size = batch_size
        if keep_mlp == True:
            self.model = PIXELForSequenceClassification.from_pretrained(
                model_name,
                config=self.config,
                pooling_mode=PoolingMode.from_string(pooling),
                add_layer_norm=True
            ).to(self.device)
        else:
            self.model = PIXELForRepresentation.from_pretrained(
                model_name,
                config=self.config,
                pooling_mode=PoolingMode.from_string(pooling),
                add_layer_norm=True
            ).to(self.device)
        self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False)
        self.processor.max_seq_length = max_seq_length
        resize_model_embeddings(self.model, self.processor.max_seq_length)
        self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length))

    def preprocess(self, texts):
        encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts]
        pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings])
        attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings])
        return {'pixel_values': pixel_values, 'attention_mask': attention_mask}

    def encode(self, texts, **kwargs):
        all_outputs = []
        for i in tqdm(range(0, len(texts), self.batch_size)):
            batch_texts = texts[i:i+batch_size]
            inputs = self.preprocess(batch_texts)
            inputs = {k: v.to(self.device) for k, v in inputs.items()}
            with torch.no_grad():
                outputs = self.model(**inputs).logits.detach().cpu()
            all_outputs.append(outputs)
        return torch.cat(all_outputs, dim=0)
```

### Evaluation

For STS-benchmark evaluation (see Github repo):
```
python tools/evaluation_sts.py
```
For BEIR information retrieval evaluation (see Github repo):
```
python tools/evaluation_retrieval.py
```

**BibTeX:**
```bibtex
@article{xiao2024pixel,
  title={Pixel Sentence Representation Learning},
  author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al},
  journal={arXiv preprint arXiv:2402.08183},
  year={2024}
}
```