juletxara
/

vilt-vsr-zeroshot

Inference Endpoints

Model card Files Files and versions Community

vilt-vsr-zeroshot / README.md

juletxara's picture

update readme

a8a05cd over 2 years ago

|

history blame contribute delete

2.2 kB

	---
	license: apache-2.0
	---

	# Vision-and-Language Transformer (ViLT), fine-tuned on VSR zeroshot split

	Vision-and-Language Transformer (ViLT) model fine-tuned on zeroshot split of [Visual Spatial Reasoning (VSR)](https://arxiv.org/abs/2205.00363). ViLT was introduced in the paper [ViLT: Vision-and-Language Transformer
	Without Convolution or Region Supervision](https://arxiv.org/abs/2102.03334) by Kim et al. and first released in [this repository](https://github.com/dandelin/ViLT).

	## Intended uses & limitations

	You can use the model to determine whether a sentence is true or false given an image.

	### How to use

	Here is how to use the model in PyTorch:

	```
	from transformers import ViltProcessor, ViltForImagesAndTextClassification
	import requests
	from PIL import Image

	image = Image.open(requests.get("https://camo.githubusercontent.com/ffcbeada14077b8e6d4b16817c91f78ba50aace210a1e4754418f1413d99797f/687474703a2f2f696d616765732e636f636f646174617365742e6f72672f747261696e323031372f3030303030303038303333362e6a7067", stream=True).raw)
	text = "The person is ahead of the cow."

	processor = ViltProcessor.from_pretrained("juletxara/vilt-vsr-zeroshot")
	model = ViltForImagesAndTextClassification.from_pretrained("juletxara/vilt-vsr-zeroshot")

	# prepare inputs
	encoding = processor(image, text, return_tensors="pt")

	# forward pass
	outputs = model(input_ids=encoding.input_ids, pixel_values=encoding.pixel_values.unsqueeze(0))
	logits = outputs.logits
	idx = logits.argmax(-1).item()
	print("Predicted answer:", model.config.id2label[idx])
	```

	## Training data

	(to do)

	## Training procedure

	### Preprocessing

	(to do)

	### Pretraining

	(to do)

	## Evaluation results

	(to do)

	### BibTeX entry and citation info

	```bibtex
	@misc{kim2021vilt,
	title={ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision},
	author={Wonjae Kim and Bokyung Son and Ildoo Kim},
	year={2021},
	eprint={2102.03334},
	archivePrefix={arXiv},
	primaryClass={stat.ML}
	}

	@article{liu2022visual,
	title={Visual Spatial Reasoning},
	author={Liu, Fangyu and Emerson, Guy and Collier, Nigel},
	journal={arXiv preprint arXiv:2205.00363},
	year={2022}
	}
	```