unum-cloud
/

uform3-image-text-english-small

Feature Extraction

Inference Endpoints

Model card Files Files and versions Community

ashvardanian commited on Apr 24

Commit

a01951e

•

1 Parent(s): e2c6da8

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -18,10 +18,10 @@ In Python, JavaScript, and Swift<br/>
 ---
 The `uform3-image-text-english-small` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
-This model is made of:
-* Text encoder: 4-layer BERT.
-* Visual encoder: ViT-S/16 for images of 224x224 resolution.
 Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
 Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.

 ---
 The `uform3-image-text-english-small` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
+This model produces up to __256-dimensional embeddings__ and is made of:
+* Text encoder: 4-layer BERT for up to 64 input tokens.
+* Visual encoder: ViT-S/16 for images of 224 x 224 resolution.
 Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
 Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.