Model Card for Pixel-Linguist/Pixel-Linguist-v0
Official model checkpoint of Pixel-Linguist-v0 from paper "Pixel Sentence Representation Learning".
Model Summary
Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of.
Model Sources
- Github Repo: https://github.com/gowitheflow-1998/Pixel-Linguist
- Paper: https://arxiv.org/pdf/2402.08183.pdf
Downstream Use
Semantic Textual Similarity, Information Retrieval
Out-of-Scope Use
The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.
Training Data
All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process.
Inference
Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class.
model_name = "Pixel-Linguist/Pixel-Linguist-v0"
model = PixelLinguist(model_name)
texts = ["I love you","I like you"]
embeddings = model.encode(texts)
print(outputs[0] @ outputs[1].T) # just use dot product because the embeddings are normalized automatically in the model class.
#tensor(0.9217)
To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow.
import torch
from PIL import Image
from pixel import (
AutoConfig,
PangoCairoTextRenderer,
PIXELForSequenceClassification,
PIXELForRepresentation,
PoolingMode,
get_attention_mask,
get_transforms,
glue_strip_spaces,
resize_model_embeddings,
)
from tqdm import tqdm
class PixelLinguist:
def __init__(self, model_name, batch_size = 16, max_seq_length = 64,
device=None, pooling = "mean", keep_mlp = False):
if device is not None:
self.device = device
else:
self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
self.config = AutoConfig.from_pretrained(model_name, num_labels=0)
self.batch_size = batch_size
if keep_mlp == True:
self.model = PIXELForSequenceClassification.from_pretrained(
model_name,
config=self.config,
pooling_mode=PoolingMode.from_string(pooling),
add_layer_norm=True
).to(self.device)
else:
self.model = PIXELForRepresentation.from_pretrained(
model_name,
config=self.config,
pooling_mode=PoolingMode.from_string(pooling),
add_layer_norm=True
).to(self.device)
self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False)
self.processor.max_seq_length = max_seq_length
resize_model_embeddings(self.model, self.processor.max_seq_length)
self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length))
def preprocess(self, texts):
encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts]
pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings])
attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings])
return {'pixel_values': pixel_values, 'attention_mask': attention_mask}
def encode(self, texts, **kwargs):
all_outputs = []
for i in tqdm(range(0, len(texts), self.batch_size)):
batch_texts = texts[i:i+batch_size]
inputs = self.preprocess(batch_texts)
inputs = {k: v.to(self.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = self.model(**inputs).logits.detach().cpu()
all_outputs.append(outputs)
return torch.cat(all_outputs, dim=0)
Evaluation
For STS-benchmark evaluation (see Github repo):
python tools/evaluation_sts.py
For BEIR information retrieval evaluation (see Github repo):
python tools/evaluation_retrieval.py
BibTeX:
@article{xiao2024pixel,
title={Pixel Sentence Representation Learning},
author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al},
journal={arXiv preprint arXiv:2402.08183},
year={2024}
}
- Downloads last month
- 51