File size: 5,164 Bytes
09612c2 3acb36b 09612c2 588d31a 09612c2 588d31a 09612c2 be40ae6 fea5138 086b708 fea5138 be40ae6 2bff784 be40ae6 42f8a05 be40ae6 42f8a05 be40ae6 42f8a05 fea5138 42f8a05 fea5138 42f8a05 fea5138 42f8a05 be40ae6 09612c2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
datasets:
- gowitheflowlab/parallel-medium-w-nli
- gowitheflow/allnli-sup
- gowitheflow/wiki1M-character-level-all
- gowitheflow/wiki1M-word-condition-shuffle
- gowitheflow/wiki1M-word-character-all-multiple
- gowitheflow/wiki1M-word-random-shuffle
- gowitheflow/wiki-span
pipeline_tag: sentence-similarity
---
# Model Card for Pixel-Linguist/Pixel-Linguist-v0
Official model checkpoint of **Pixel-Linguist-v0** from paper "Pixel Sentence Representation Learning".
### Model Summary
Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of.
### Model Sources
- **Github Repo:** https://github.com/gowitheflow-1998/Pixel-Linguist
- **Paper:** https://arxiv.org/pdf/2402.08183.pdf
### Downstream Use
Semantic Textual Similarity, Information Retrieval
### Out-of-Scope Use
The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.
### Training Data
All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process.
## Inference
Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class.
```python
model_name = "Pixel-Linguist/Pixel-Linguist-v0"
model = PixelLinguist(model_name)
texts = ["I love you","I like you"]
embeddings = model.encode(texts)
print(outputs[0] @ outputs[1].T) # just use dot product because the embeddings are normalized automatically in the model class.
#tensor(0.9217)
```
To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow.
```python
import torch
from PIL import Image
from pixel import (
AutoConfig,
PangoCairoTextRenderer,
PIXELForSequenceClassification,
PIXELForRepresentation,
PoolingMode,
get_attention_mask,
get_transforms,
glue_strip_spaces,
resize_model_embeddings,
)
from tqdm import tqdm
class PixelLinguist:
def __init__(self, model_name, batch_size = 16, max_seq_length = 64,
device=None, pooling = "mean", keep_mlp = False):
if device is not None:
self.device = device
else:
self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.config = AutoConfig.from_pretrained(model_name, num_labels=0)
self.batch_size = batch_size
if keep_mlp == True:
self.model = PIXELForSequenceClassification.from_pretrained(
model_name,
config=self.config,
pooling_mode=PoolingMode.from_string(pooling),
add_layer_norm=True
).to(self.device)
else:
self.model = PIXELForRepresentation.from_pretrained(
model_name,
config=self.config,
pooling_mode=PoolingMode.from_string(pooling),
add_layer_norm=True
).to(self.device)
self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False)
self.processor.max_seq_length = max_seq_length
resize_model_embeddings(self.model, self.processor.max_seq_length)
self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length))
def preprocess(self, texts):
encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts]
pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings])
attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings])
return {'pixel_values': pixel_values, 'attention_mask': attention_mask}
def encode(self, texts, **kwargs):
all_outputs = []
for i in tqdm(range(0, len(texts), self.batch_size)):
batch_texts = texts[i:i+batch_size]
inputs = self.preprocess(batch_texts)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs).logits.detach().cpu()
all_outputs.append(outputs)
return torch.cat(all_outputs, dim=0)
```
### Evaluation
For STS-benchmark evaluation (see Github repo):
```
python tools/evaluation_sts.py
```
For BEIR information retrieval evaluation (see Github repo):
```
python tools/evaluation_retrieval.py
```
**BibTeX:**
```bibtex
@article{xiao2024pixel,
title={Pixel Sentence Representation Learning},
author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al},
journal={arXiv preprint arXiv:2402.08183},
year={2024}
}
``` |