unum-cloud
/

uform3-image-text-english-small

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

uform3-image-text-english-small / README.md

ashvardanian's picture

Update README.md

ef4d3dc verified 7 months ago

|

2.8 kB

	---
	license: apache-2.0
	pipeline_tag: feature-extraction
	tags:
	- clip
	- vision
	datasets:
	- Ziyang/yfcc15m
	- conceptual_captions
	---
	<h1 align="center">UForm</h1>
	<h3 align="center">
	Pocket-Sized Multimodal AI<br/>
	For Content Understanding and Generation<br/>
	In Python, JavaScript, and Swift<br/>
	</h3>

	---

	The `uform3-image-text-english-small` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
	This model produces up to __256-dimensional embeddings__ and is made of:

	* Text encoder: 4-layer BERT for up to 64 input tokens.
	* Visual encoder: ViT-S/16 for images of 224 x 224 resolution.

	Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
	Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
	If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
	For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).

	## Evaluation

	For zero-shot ImageNet classification the model achieves Top-1 accuracy of 36.1% and Top-5 of 60.8%.
	On text-to-image retrieval it reaches 86% Recall@10 for Flickr:

	\| Dataset \|Recall@1 \| Recall@5 \| Recall@10 \|
	\| :------ \| ------: \| --------: \| --------: \|
	\| Zero-Shot Flickr \| 0.565 \| 0.790 \| 0.860 \|
	\| Zero-Shot MS-COCO \| 0.281 \| 0.525 \| 0.645 \|

	## Installation

	```bash
	pip install "uform[torch,onnx]"
	```

	## Usage

	To load the model:

	```python
	from uform import get_model, Modality

	import requests
	from io import BytesIO
	from PIL import Image

	model_name = 'unum-cloud/uform3-image-text-english-small'
	modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
	processors, models = get_model(model_name, modalities=modalities)

	model_text = models[Modality.TEXT_ENCODER]
	model_image = models[Modality.IMAGE_ENCODER]
	processor_text = processors[Modality.TEXT_ENCODER]
	processor_image = processors[Modality.IMAGE_ENCODER]
	```

	To encode the content:

	```python
	text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
	image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
	image_url = Image.open(BytesIO(requests.get(image_url).content))

	image_data = processor_image(image)
	text_data = processor_text(text)
	image_features, image_embedding = model_image.encode(image_data, return_features=True)
	text_features, text_embedding = model_text.encode(text_data, return_features=True)
	```