|
--- |
|
datasets: |
|
- gowitheflowlab/parallel-medium-w-nli |
|
- gowitheflow/allnli-sup |
|
- gowitheflow/wiki1M-character-level-all |
|
- gowitheflow/wiki1M-word-condition-shuffle |
|
- gowitheflow/wiki1M-word-character-all-multiple |
|
- gowitheflow/wiki1M-word-random-shuffle |
|
- gowitheflow/wiki-span |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# Model Card for Pixel-Linguist/Pixel-Linguist-v0 |
|
|
|
Official model checkpoint of **Pixel-Linguist-v0** from paper "Pixel Sentence Representation Learning". |
|
|
|
### Model Summary |
|
|
|
Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of. |
|
|
|
### Model Sources |
|
|
|
- **Github Repo:** https://github.com/gowitheflow-1998/Pixel-Linguist |
|
- **Paper:** https://arxiv.org/pdf/2402.08183.pdf |
|
|
|
### Downstream Use |
|
|
|
Semantic Textual Similarity, Information Retrieval |
|
|
|
### Out-of-Scope Use |
|
|
|
The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching. |
|
|
|
### Training Data |
|
|
|
All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process. |
|
|
|
## Inference |
|
Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class. |
|
|
|
```python |
|
model_name = "Pixel-Linguist/Pixel-Linguist-v0" |
|
model = PixelLinguist(model_name) |
|
|
|
texts = ["I love you","I like you"] |
|
embeddings = model.encode(texts) |
|
print(outputs[0] @ outputs[1].T) # just use dot product because the embeddings are normalized automatically in the model class. |
|
#tensor(0.9217) |
|
``` |
|
|
|
To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow. |
|
|
|
```python |
|
import torch |
|
from PIL import Image |
|
from pixel import ( |
|
AutoConfig, |
|
PangoCairoTextRenderer, |
|
PIXELForSequenceClassification, |
|
PIXELForRepresentation, |
|
PoolingMode, |
|
get_attention_mask, |
|
get_transforms, |
|
glue_strip_spaces, |
|
resize_model_embeddings, |
|
) |
|
from tqdm import tqdm |
|
|
|
class PixelLinguist: |
|
def __init__(self, model_name, batch_size = 16, max_seq_length = 64, |
|
device=None, pooling = "mean", keep_mlp = False): |
|
if device is not None: |
|
self.device = device |
|
else: |
|
self.device = "cuda:0" if torch.cuda.is_available() else "cpu" |
|
self.config = AutoConfig.from_pretrained(model_name, num_labels=0) |
|
self.batch_size = batch_size |
|
if keep_mlp == True: |
|
self.model = PIXELForSequenceClassification.from_pretrained( |
|
model_name, |
|
config=self.config, |
|
pooling_mode=PoolingMode.from_string(pooling), |
|
add_layer_norm=True |
|
).to(self.device) |
|
else: |
|
self.model = PIXELForRepresentation.from_pretrained( |
|
model_name, |
|
config=self.config, |
|
pooling_mode=PoolingMode.from_string(pooling), |
|
add_layer_norm=True |
|
).to(self.device) |
|
self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False) |
|
self.processor.max_seq_length = max_seq_length |
|
resize_model_embeddings(self.model, self.processor.max_seq_length) |
|
self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length)) |
|
|
|
def preprocess(self, texts): |
|
encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts] |
|
pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings]) |
|
attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings]) |
|
return {'pixel_values': pixel_values, 'attention_mask': attention_mask} |
|
|
|
def encode(self, texts, **kwargs): |
|
all_outputs = [] |
|
for i in tqdm(range(0, len(texts), self.batch_size)): |
|
batch_texts = texts[i:i+batch_size] |
|
inputs = self.preprocess(batch_texts) |
|
inputs = {k: v.to(self.device) for k, v in inputs.items()} |
|
with torch.no_grad(): |
|
outputs = self.model(**inputs).logits.detach().cpu() |
|
all_outputs.append(outputs) |
|
return torch.cat(all_outputs, dim=0) |
|
``` |
|
|
|
### Evaluation |
|
|
|
For STS-benchmark evaluation (see Github repo): |
|
``` |
|
python tools/evaluation_sts.py |
|
``` |
|
For BEIR information retrieval evaluation (see Github repo): |
|
``` |
|
python tools/evaluation_retrieval.py |
|
``` |
|
|
|
**BibTeX:** |
|
```bibtex |
|
@article{xiao2024pixel, |
|
title={Pixel Sentence Representation Learning}, |
|
author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al}, |
|
journal={arXiv preprint arXiv:2402.08183}, |
|
year={2024} |
|
} |
|
``` |