unum-cloud
/

uform-gen-chat

Visual Question Answering

text-generation

image-captioning

Inference Endpoints

Model card Files Files and versions Community

kimihailv commited on Dec 28, 2023

Commit

07ca6b0

•

1 Parent(s): 9887a97

Create README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+license: apache-2.0
+language:
+- en
+---
+<h1 align="center">UForm</h1>
+<h3 align="center">
+Pocket-Sized Multimodal AI<br/>
+For Content Understanding and Generation<br/>
+</h3>
+## Description
+UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:
+1. [UForm Vision Encoder](https://huggingface.co/unum-cloud/uform-vl-english)
+2. [Sheared-LLaMA-1.3B](https://huggingface.co/princeton-nlp/Sheared-LLaMA-1.3B) manually tuned on the instruction dataset
+The model was pre-trained on: MSCOCO, SBU Captions, Visual Genome, VQAv2, GQA and a few internal datasets. UForm-Gen-Chat is SFT version of [`UForm-Gen`](https://huggingface.co/unum-cloud/uform-gen) for multimodal chat.
+### Usage
+```bash
+pip install uform
+```
+```python
+from uform.gen_model import VLMForCausalLM, VLMProcessor
+model = VLMForCausalLM.from_pretrained("unum-cloud/uform-gen-chat")
+processor = VLMProcessor.from_pretrained("unum-cloud/uform-gen-chat")
+messages = [
+    {"role": "system", "content": "You are a helpful assistant."},
+    {"role": "user", "content": "<image> {Your message}"}
+]
+image = processor.image_processor(Image.open("zebra.jpg")).unsqueeze(0)
+input_ids = processor.tokenizer.apply_chat_template(
+    messages, return_tensors="pt", add_generation_prompt=True
+)
+attention_mask = torch.ones(1, input_ids.shape[1] + processor.num_image_latents - 1)
+inputs = {
+    "input_ids": input_ids,
+    "attention_mask": attention_mask,
+    "images": image,
+}
+outputs = model.generate(
+    **inputs,
+    do_sample=False,
+    use_cache=True,
+    max_new_tokens=1024,
+    eos_token_id=32001,
+    pad_token_id=processor.tokenizer.pad_token_id,
+)
+message = processor.batch_decode(outputs[:, inputs["input_ids"].shape[1]:-1])
+```
+## Evaluation
+For captioning evaluation we measure CLIPScore and RefCLIPScore¹.
+| Model                               | Size | Caption Length | CLIPScore | RefCLIPScore |
+| :---------------------------------- | ---: | -------------: | --------: | -----------: |
+| `llava-hf/llava-1.5-7b-hf`          |   7B |           Long |     0.878 |        0.529 |
+| `llava-hf/llava-1.5-7b-hf`          |   7B |          Short |     0.886 |        0.531 |
+|                                     |
+| `Salesforce/instructblip-vicuna-7b` |   7B |           Long |     0.902 |        0.534 |
+| `Salesforce/instructblip-vicuna-7b` |   7B |          Short |     0.848 |        0.523 |
+|                                     |
+|                                     |
+| `unum-cloud/uform-gen-chat`         | 1.5B |           Long |     0.860 |        0.525 |
+| `unum-cloud/uform-gen-chat`         | 1.5B |          Short |     0.858 |        0.525 |
+¹ We used `apple/DFN5B-CLIP-ViT-H-14-378` CLIP model.