Update README.md

086b708 verified 7 months ago

No virus

5.16 kB

	---
	datasets:
	- gowitheflowlab/parallel-medium-w-nli
	- gowitheflow/allnli-sup
	- gowitheflow/wiki1M-character-level-all
	- gowitheflow/wiki1M-word-condition-shuffle
	- gowitheflow/wiki1M-word-character-all-multiple
	- gowitheflow/wiki1M-word-random-shuffle
	- gowitheflow/wiki-span
	pipeline_tag: sentence-similarity
	---

	# Model Card for Pixel-Linguist/Pixel-Linguist-v0

	Official model checkpoint of Pixel-Linguist-v0 from paper "Pixel Sentence Representation Learning".

	### Model Summary

	Pixel Linguist v0 is a sentence encoder that is trained to understand sentence and document-level semantics with only pixel-based textual signals. It is trained on 59 language pairs, and English unsupervised data, Wikipedia Spans, and NLI. It has strong zero-shot transferability to other existing langauges (even ancient ones) that you can think of.

	### Model Sources

	- Github Repo: https://github.com/gowitheflow-1998/Pixel-Linguist
	- Paper: https://arxiv.org/pdf/2402.08183.pdf

	### Downstream Use

	Semantic Textual Similarity, Information Retrieval

	### Out-of-Scope Use

	The model might not be optimal for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.

	### Training Data

	All the training sets involved in our progressive training scheme that we created can be found in tags in meta data. Please refer to the paper for the exact process.

	## Inference
	Encoding with our PixelLinguist class is very straightforward, just like using a SentenceTransformer class.

	```python
	model_name = "Pixel-Linguist/Pixel-Linguist-v0"
	model = PixelLinguist(model_name)

	texts = ["I love you","I like you"]
	embeddings = model.encode(texts)
	print(outputs[0] @ outputs[1].T) # just use dot product because the embeddings are normalized automatically in the model class.
	#tensor(0.9217)
	```

	To use the PixelLinguist class: First install the package following our Github Repo. Then define our PixelLinguist Class as follow.

	```python
	import torch
	from PIL import Image
	from pixel import (
	AutoConfig,
	PangoCairoTextRenderer,
	PIXELForSequenceClassification,
	PIXELForRepresentation,
	PoolingMode,
	get_attention_mask,
	get_transforms,
	glue_strip_spaces,
	resize_model_embeddings,
	)
	from tqdm import tqdm

	class PixelLinguist:
	def __init__(self, model_name, batch_size = 16, max_seq_length = 64,
	device=None, pooling = "mean", keep_mlp = False):
	if device is not None:
	self.device = device
	else:
	self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
	self.config = AutoConfig.from_pretrained(model_name, num_labels=0)
	self.batch_size = batch_size
	if keep_mlp == True:
	self.model = PIXELForSequenceClassification.from_pretrained(
	model_name,
	config=self.config,
	pooling_mode=PoolingMode.from_string(pooling),
	add_layer_norm=True
	).to(self.device)
	else:
	self.model = PIXELForRepresentation.from_pretrained(
	model_name,
	config=self.config,
	pooling_mode=PoolingMode.from_string(pooling),
	add_layer_norm=True
	).to(self.device)
	self.processor = PangoCairoTextRenderer.from_pretrained(model_name, rgb=False)
	self.processor.max_seq_length = max_seq_length
	resize_model_embeddings(self.model, self.processor.max_seq_length)
	self.transforms = get_transforms(do_resize=True, size=(self.processor.pixels_per_patch, self.processor.pixels_per_patch * self.processor.max_seq_length))

	def preprocess(self, texts):
	encodings = [self.processor(text=glue_strip_spaces(a)) for a in texts]
	pixel_values = torch.stack([self.transforms(Image.fromarray(e.pixel_values)) for e in encodings])
	attention_mask = torch.stack([get_attention_mask(e.num_text_patches, seq_length=self.processor.max_seq_length) for e in encodings])
	return {'pixel_values': pixel_values, 'attention_mask': attention_mask}

	def encode(self, texts, **kwargs):
	all_outputs = []
	for i in tqdm(range(0, len(texts), self.batch_size)):
	batch_texts = texts[i:i+batch_size]
	inputs = self.preprocess(batch_texts)
	inputs = {k: v.to(self.device) for k, v in inputs.items()}
	with torch.no_grad():
	outputs = self.model(**inputs).logits.detach().cpu()
	all_outputs.append(outputs)
	return torch.cat(all_outputs, dim=0)
	```

	### Evaluation

	For STS-benchmark evaluation (see Github repo):
	```
	python tools/evaluation_sts.py
	```
	For BEIR information retrieval evaluation (see Github repo):
	```
	python tools/evaluation_retrieval.py
	```

	BibTeX:
	```bibtex
	@article{xiao2024pixel,
	title={Pixel Sentence Representation Learning},
	author={Xiao, Chenghao and Huang, Zhuoxu and Chen, Danlu and Hudson, G Thomas and Li, Yizhi and Duan, Haoran and Lin, Chenghua and Fu, Jie and Han, Jungong and Moubayed, Noura Al},
	journal={arXiv preprint arXiv:2402.08183},
	year={2024}
	}
	```