Image-Text-to-Text
KerasHub
Divyasreepat commited on
Commit
080416b
·
verified ·
1 Parent(s): 704ba52

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +608 -31
README.md CHANGED
@@ -1,34 +1,611 @@
1
  ---
2
  library_name: keras-hub
 
 
 
 
 
 
 
 
3
  ---
4
- This is a [`PaliGemma` model](https://keras.io/api/keras_hub/models/pali_gemma) uploaded using the KerasHub library and can be used with JAX, TensorFlow, and PyTorch backends.
5
- Model config:
6
- * **name:** pali_gemma_backbone
7
- * **trainable:** True
8
- * **vocabulary_size:** 257152
9
- * **image_size:** 896
10
- * **num_layers:** 26
11
- * **num_query_heads:** 8
12
- * **num_key_value_heads:** 4
13
- * **hidden_dim:** 2304
14
- * **intermediate_dim:** 18432
15
- * **head_dim:** 256
16
- * **vit_patch_size:** 14
17
- * **vit_num_heads:** 16
18
- * **vit_hidden_dim:** 1152
19
- * **vit_num_layers:** 27
20
- * **vit_intermediate_dim:** 4304
21
- * **vit_pooling:** None
22
- * **vit_classifier_activation:** None
23
- * **vit_name:** None
24
- * **query_head_dim_normalize:** True
25
- * **use_post_ffw_norm:** True
26
- * **use_post_attention_norm:** True
27
- * **final_logit_soft_cap:** 30
28
- * **attention_logit_soft_cap:** 50
29
- * **sliding_window_size:** 4096
30
- * **use_sliding_window_attention:** True
31
- * **layer_norm_epsilon:** 1e-06
32
- * **dropout:** 0
33
-
34
- This model card has been generated automatically and should be completed by the model author. See [Model Cards documentation](https://huggingface.co/docs/hub/model-cards) for more information.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: keras-hub
3
+ license: gemma
4
+ pipeline_tag: image-text-to-text
5
+ extra_gated_heading: Access PaliGemma on Hugging Face
6
+ extra_gated_prompt: >-
7
+ To access PaliGemma on Hugging Face, you’re required to review and agree to
8
+ Google’s usage license. To do this, please ensure you’re logged-in to Hugging
9
+ Face and click below. Requests are processed immediately.
10
+ extra_gated_button_content: Acknowledge license
11
  ---
12
+ # PaliGemma 2 model card
13
+
14
+ **Model page:** [PaliGemma](https://ai.google.dev/gemma/docs/paligemma)
15
+
16
+ JAX/FLAX PaliGemma 2 28B weights for use with [`big_vision`](https://github.com/google-research/big_vision) codebase,
17
+ pre-trained with 896*896 input images and 512 token input/output text sequences.
18
+
19
+ The model is available in the `bfloat16` format for fine-tuning.
20
+
21
+ **Downloading Model Weights**
22
+ First, authenticate using the Hugging Face CLI:
23
+ ```bash
24
+ huggingface-cli login
25
+ ```
26
+ Use the following command to download the model weights:
27
+ ```bash
28
+ huggingface-cli download --local-dir models google/paligemma2-28b-pt-896-jax
29
+ ```
30
+ This will download the weights in multiple split files to the `models` directory.
31
+
32
+ Combine the downloaded `.npz` parts into a single file using the `cat` command:
33
+ ```bash
34
+ cat paligemma2-28b-pt-896.b16.npz.part* > paligemma2-28b-pt-896.b16.npz
35
+ ```
36
+ The resulting `model.npz` file is now ready to use.
37
+
38
+ **Resources and technical documentation:**
39
+
40
+ * [PaliGemma 2 on Kaggle](https://www.kaggle.com/models/google/paligemma-2)
41
+ * [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
42
+
43
+ **Terms of Use:** [Terms](https://ai.google.dev/gemma/terms)
44
+
45
+ **Authors:** Google
46
+
47
+ ## Model information
48
+
49
+ ### Model summary
50
+
51
+ PaliGemma 2 is an update of the [PaliGemma](https://arxiv.org/abs/2407.07726)
52
+ vision-language model (VLM) which incorporates the capabilities of the
53
+ [Gemma 2](https://arxiv.org/abs/2408.00118) models. The PaliGemma family of
54
+ models is inspired by [PaLI-3](https://arxiv.org/abs/2310.09199) and based on
55
+ open components such as the [SigLIP](https://arxiv.org/abs/2303.15343) vision
56
+ model and [Gemma 2](https://arxiv.org/abs/2408.00118) language models. It takes
57
+ both image and text as input and generates text as output, supporting multiple
58
+ languages. It is designed for class-leading fine-tune performance on a wide
59
+ range of vision-language tasks such as image and short video caption, visual
60
+ question answering, text reading, object detection and object segmentation.
61
+
62
+ #### Model architecture
63
+
64
+ PaliGemma 2 is the composition of a
65
+ [Transformer decoder](https://arxiv.org/abs/1706.03762) and a
66
+ [Vision Transformer image encoder](https://arxiv.org/abs/2010.11929).
67
+ The text decoder is initialized from
68
+ [Gemma 2](https://ai.google.dev/gemma/docs/base) in the 2B, 9B, and 27B
69
+ parameter sizes. The image encoder is initialized from
70
+ [SigLIP-So400m/14](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb).
71
+ Similar to the original PaliGemma model, PaliGemma 2 is trained following the
72
+ [PaLI-3](https://arxiv.org/abs/2310.09199) recipes.
73
+
74
+ #### Inputs and outputs
75
+
76
+ * **Input:** Image and text string, such as a prompt to caption the image, or
77
+ a question.
78
+ * **Output:** Generated text in response to the input, such as a caption of
79
+ the image, an answer to a question, a list of object bounding box
80
+ coordinates, or segmentation codewords.
81
+
82
+ #### Citation
83
+
84
+ ```none
85
+ @article{
86
+ title={PaliGemma 2: A Family of Versatile VLMs for Transfer},
87
+ author={Andreas Steiner and André Susano Pinto and Michael Tschannen and Daniel Keysers and Xiao Wang and Yonatan Bitton and Alexey Gritsenko and Matthias Minderer and Anthony Sherbondy and Shangbang Long and Siyang Qin and Reeve Ingle and Emanuele Bugliarello and Sahar Kazemzadeh and Thomas Mesnard and Ibrahim Alabdulmohsin and Lucas Beyer and Xiaohua Zhai},
88
+ year={2024},
89
+ journal={arXiv preprint arXiv:2412.03555}
90
+ }
91
+ ```
92
+
93
+ ### Model data
94
+
95
+ #### Pre-train datasets
96
+
97
+ PaliGemma 2 is pre-trained on the following mixture of datasets:
98
+
99
+ * **WebLI:** [WebLI (Web Language Image)](https://arxiv.org/abs/2209.06794) is
100
+ a web-scale multilingual image-text dataset built from the public web. A
101
+ wide range of WebLI splits are used to acquire versatile model capabilities,
102
+ such as visual semantic understanding, object localization,
103
+ visually-situated text understanding, and multilinguality.
104
+ * **CC3M-35L:** Curated English image-alt_text pairs from webpages
105
+ ([Sharma et al., 2018](https://aclanthology.org/P18-1238/)). We used the
106
+ [Google Cloud Translation API](https://cloud.google.com/translate) to
107
+ translate into 34 additional languages.
108
+ * **VQ²A-CC3M-35L/VQG-CC3M-35L:** A subset of VQ2A-CC3M
109
+ ([Changpinyo et al., 2022a](https://aclanthology.org/2022.naacl-main.142/)),
110
+ translated into the same additional 34 languages as CC3M-35L, using the
111
+ [Google Cloud Translation API](https://cloud.google.com/translate).
112
+ * **OpenImages:** Detection and object-aware questions and answers
113
+ ([Piergiovanni et al. 2022](https://arxiv.org/abs/2209.04372)) generated by
114
+ handcrafted rules on the [OpenImages dataset].
115
+ * **WIT:** Images and texts collected from Wikipedia
116
+ ([Srinivasan et al., 2021](https://arxiv.org/abs/2103.01913)).
117
+
118
+ [OpenImages dataset]: https://storage.googleapis.com/openimages/web/factsfigures_v7.html
119
+ PaliGemma 2 is based on Gemma 2, and you can find information on the
120
+ pre-training datasets for Gemma 2 in the
121
+ [Gemma 2 model card](https://ai.google.dev/gemma/docs/model_card_2).
122
+
123
+ #### Data responsibility filtering
124
+
125
+ The following filters are applied to WebLI, with the goal of training PaliGemma
126
+ 2 on safe and responsible data:
127
+
128
+ * **Pornographic image filtering:** This filter removes images deemed to be of
129
+ pornographic nature.
130
+ * **Text safety filtering:** We identify and filter out images that are paired
131
+ with unsafe text. Unsafe text is any text deemed to contain or be about
132
+ child sexual abuse imagery (CSAI), pornography, vulgarities, or is otherwise
133
+ offensive.
134
+ * **Text toxicity filtering:** We further use the [Perspective
135
+ API](https://perspectiveapi.com/) to identify and filter out images that are
136
+ paired with text deemed insulting, obscene, hateful or otherwise toxic.
137
+ * **Text personal information filtering:** We filtered certain personal
138
+ information and other sensitive data using the [Cloud Data Loss Prevention
139
+ (DLP) API](https://cloud.google.com/security/products/dlp) to protect the
140
+ privacy of individuals. Identifiers such as social security numbers and
141
+ [other sensitive information types] were removed.
142
+ * **Additional methods:** Filtering based on content quality and safety in
143
+ line with our policies and practices.
144
+
145
+ [other sensitive information types]: https://cloud.google.com/sensitive-data-protection/docs/high-sensitivity-infotypes-reference?_gl=1*jg604m*_ga*ODk5MzA3ODQyLjE3MTAzMzQ3NTk.*_ga_WH2QY8WWF5*MTcxMDUxNTkxMS4yLjEuMTcxMDUxNjA2NC4wLjAuMA..&_ga=2.172110058.-899307842.1710334759
146
+
147
+ ## Implementation information
148
+
149
+ ### Hardware
150
+
151
+ PaliGemma 2 was trained using the latest generation of Tensor Processing Unit
152
+ (TPU) hardware (TPUv5e).
153
+
154
+ ### Software
155
+
156
+ Training was completed using [JAX](https://github.com/google/jax),
157
+ [Flax](https://github.com/google/flax),
158
+ [TFDS](https://github.com/tensorflow/datasets) and
159
+ [`big_vision`](https://github.com/google-research/big_vision).
160
+
161
+ JAX allows researchers to take advantage of the latest generation of hardware,
162
+ including TPUs, for faster and more efficient training of large models.
163
+
164
+ TFDS is used to access datasets and Flax is used for model architecture. The
165
+ PaliGemma 2 fine-tune code and inference code are released in the `big_vision`
166
+ GitHub repository.
167
+
168
+ ## Evaluation information
169
+
170
+ ### Benchmark results
171
+
172
+ In order to verify the transferability of PaliGemma 2 to a wide variety of
173
+ academic tasks, we fine-tune the pretrained models on each task. We report results on
174
+ different resolutions to provide an impression of which tasks benefit from
175
+ increased resolution. Importantly, none of these tasks or datasets are part of
176
+ the pretraining data mixture, and their images are explicitly removed from the
177
+ web-scale pre-training data.
178
+
179
+ #### PaliGemma 2 results by model resolution and size
180
+
181
+ | Benchmark | 224-3B | 224-10B | 224-28B | 448-3B | 448-10B | 448-28B |
182
+ |-------------------------------|:------:|:-------:|:-------:|:------:|:-------:|:-------:|
183
+ | [AI2D][ai2d] | 74.7 | 83.1 | 83.2 | 76.0 | 84.4 | 84.6 |
184
+ | [AOKVQA-DA][aokvqa-da] (val) | 64.2 | 68.9 | 70.2 | 67.9 | 70.8 | 71.2 |
185
+ | [AOKVQA-MC][aokvqa-mc] (val) | 79.7 | 83.7 | 84.7 | 82.5 | 85.9 | 87.0 |
186
+ | [ActivityNet-CAP][anet-cap] | 34.2 | 35.9 | - | - | - | - |
187
+ | [ActivityNet-QA][anet-qa] | 51.3 | 53.2 | - | - | - | - |
188
+ | [COCO-35L][coco-35l] (avg34) | 113.9 | 115.8 | 116.5 | 115.8 | 117.2 | 117.2 |
189
+ | [COCO-35L][coco-35l] (en) | 138.4 | 140.8 | 142.4 | 140.4 | 142.4 | 142.3 |
190
+ | [COCOcap][coco-cap] | 141.3 | 143.7 | 144.0 | 143.4 | 145.0 | 145.2 |
191
+ | [ChartQA][chartqa] (aug) | 74.4 | 74.2 | 68.9 | 89.2 | 90.1 | 85.1 |
192
+ | [ChartQA][chartqa] (human) | 42.0 | 48.4 | 46.8 | 54.0 | 66.4 | 61.3 |
193
+ | [CountBenchQA][countbenchqa] | 81.0 | 84.0 | 86.4 | 82.0 | 85.3 | 87.4 |
194
+ | [DocVQA][docvqa] (val) | 39.9 | 43.9 | 44.9 | 73.6 | 76.6 | 76.1 |
195
+ | [GQA][gqa] | 66.2 | 67.2 | 67.3 | 68.1 | 68.3 | 68.3 |
196
+ | [InfoVQA][info-vqa] (val) | 25.2 | 33.6 | 36.4 | 37.5 | 47.8 | 46.7 |
197
+ | [MARVL][marvl] (avg5) | 83.5 | 89.5 | 90.6 | 82.7 | 89.1 | 89.7 |
198
+ | [MSRVTT-CAP][msrvtt] | 68.5 | 72.1 | - | - | - | - |
199
+ | [MSRVTT-QA][msrvtt] | 50.5 | 51.9 | - | - | - | - |
200
+ | [MSVD-QA][msvd-qa] | 61.1 | 62.5 | - | - | - | - |
201
+ | [NLVR2][nlvr2] | 91.4 | 93.9 | 94.2 | 91.6 | 93.7 | 94.1 |
202
+ | [NoCaps][nocaps] | 123.1 | 126.3 | 127.1 | 123.5 | 126.9 | 127.0 |
203
+ | [OCR-VQA][ocr-vqa] | 73.4 | 74.7 | 75.3 | 75.7 | 76.3 | 76.6 |
204
+ | [OKVQA][okvqa] | 64.2 | 68.0 | 71.2 | 64.1 | 68.6 | 70.6 |
205
+ | [RSVQA-hr][rsvqa-hr] (test) | 92.7 | 92.6 | 92.7 | 92.8 | 92.8 | 92.8 |
206
+ | [RSVQA-hr][rsvqa-hr] (test2) | 90.9 | 90.8 | 90.9 | 90.7 | 90.7 | 90.8 |
207
+ | [RSVQA-lr][rsvqa-lr] | 93.0 | 92.8 | 93.5 | 92.7 | 93.1 | 93.7 |
208
+ | [RefCOCO][refcoco] (testA) | 75.7 | 77.2 | 76.8 | 78.6 | 79.7 | 79.3 |
209
+ | [RefCOCO][refcoco] (testB) | 71.0 | 74.2 | 73.9 | 73.5 | 76.2 | 74.8 |
210
+ | [RefCOCO][refcoco] (val) | 73.4 | 75.9 | 75.0 | 76.3 | 78.2 | 77.3 |
211
+ | [RefCOCO+][refcoco+] (testA) | 72.7 | 74.7 | 73.6 | 76.1 | 77.7 | 76.6 |
212
+ | [RefCOCO+][refcoco+] (testB) | 64.2 | 68.4 | 67.1 | 67.0 | 71.1 | 68.6 |
213
+ | [RefCOCO+][refcoco+] (val) | 68.6 | 72.0 | 70.3 | 72.1 | 74.4 | 72.8 |
214
+ | [RefCOCOg][refcocog] (test) | 69.0 | 71.9 | 70.7 | 72.7 | 74.8 | 73.7 |
215
+ | [RefCOCOg][refcocog] (val) | 68.3 | 71.4 | 70.5 | 72.3 | 74.4 | 73.0 |
216
+ | [ST-VQA][st-vqa] (val) | 61.9 | 64.3 | 65.1 | 80.5 | 82.0 | 81.8 |
217
+ | [SciCap][scicap] | 165.1 | 159.5 | 156.9 | 183.3 | 177.2 | 172.7 |
218
+ | [ScienceQA][scienceqa] | 96.1 | 98.2 | 98.2 | 96.2 | 98.5 | 98.6 |
219
+ | [Screen2Words][screen2words] | 113.3 | 117.8 | 122.8 | 114.0 | 119.1 | 123.4 |
220
+ | [TallyQA][tallyqa] (complex) | 70.3 | 73.4 | 74.2 | 73.6 | 76.7 | 76.8 |
221
+ | [TallyQA][tallyqa] (simple) | 81.8 | 83.2 | 83.4 | 85.3 | 86.2 | 85.7 |
222
+ | [TextCaps][textcaps] | 127.5 | 137.9 | 139.9 | 152.1 | 157.7 | 153.6 |
223
+ | [TextVQA][textvqa] (val) | 59.6 | 64.0 | 64.7 | 75.2 | 76.6 | 76.2 |
224
+ | [VATEX][vatex] | 80.8 | 82.7 | - | - | - | - |
225
+ | [VQAv2][vqav2] (minival) | 83.0 | 84.3 | 84.5 | 84.8 | 85.8 | 85.8 |
226
+ | [VizWizVQA][vizwiz-vqa] (val) | 76.4 | 78.1 | 78.7 | 77.5 | 78.6 | 78.9 |
227
+ | [WidgetCap][widgetcap] | 138.1 | 139.8 | 138.8 | 151.4 | 151.9 | 148.9 |
228
+ | [XM3600][xm3600] (avg35) | 42.8 | 44.5 | 45.2 | 43.2 | 44.6 | 45.2 |
229
+ | [XM3600][xm3600] (en) | 79.8 | 80.7 | 81.0 | 80.3 | 81.5 | 81.0 |
230
+ | [xGQA][xgqa] (avg7) | 58.6 | 61.4 | 61.1 | 60.4 | 62.6 | 62.1 |
231
+
232
+
233
+ #### Additional Benchmarks
234
+
235
+ **[ICDAR 2015 Incidental][icdar2015-inc]**
236
+
237
+ | Model | Precision | Recall | F1 |
238
+ |-----------------|-----------|:------:|:-----:|
239
+ | PaliGemma 2 3B | 81.88 | 70.73 | 75.9 |
240
+
241
+ **[Total-Text][total-text]**
242
+
243
+ | Model | Precision | Recall | F1 |
244
+ |-----------------|-----------|:------:|:-----:|
245
+ | PaliGemma 2 3B | 73.8. | 74.54 | 74.17 |
246
+
247
+ **[FinTabNet][fintabnet]**
248
+
249
+ | Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con |
250
+ |-----------------|--------|-------|-----------|-----------|
251
+ | PaliGemma 2 3B | 99.18 | 98.94 | 99.43 | 99.21 |
252
+
253
+ **[PubTabNet][pubtabnet]**
254
+
255
+ | Model | S-TEDS | TEDS | GriTS-Top | GriTS-Con |
256
+ |-----------------|--------|-------|-----------|-----------|
257
+ | PaliGemma 2 3B | 97.6 | 97.31 | 97.99 | 97.84 |
258
+
259
+ **[GrandStaff][grandstaff]**
260
+
261
+ | Model | CER | LER | SER |
262
+ |-----------------|-----|-----|-----|
263
+ | PaliGemma 2 3B | 1.6 | 6.7 | 2.3 |
264
+
265
+ **[PubChem][pubchem]**
266
+
267
+ * PaliGemma 2 3B, Full Match: 94.8
268
+
269
+ **[DOCCI][docci]**
270
+
271
+ | Model | avg#char | avg#sent | NES % |
272
+ |-----------------|----------|----------|---------|
273
+ | PaliGemma 2 3B | 529 | 7.74 | 28.42 |
274
+ | PaliGemma 2 10B | 521 | 7.45 | 20.27 |
275
+
276
+ - *avg#char*: Average number of characters
277
+ - *avg#sent*: Average number of sentences
278
+ - *NES*: Non entailment sentences
279
+
280
+ **[MIMIC-CXR][mimic-cxr]**
281
+
282
+ | Model | CIDEr | BLEU4 | Rouge-L | RadGraph F1 |
283
+ |-----------------|-------|-------|---------|-------------|
284
+ | PaliGemma 2 3B | 19.9% | 14.6% | 31.92% | 28.8% |
285
+ | PaliGemma 2 10B | 17.4% | 15% | 32.41% | 29.5% |
286
+
287
+ **[Visual Spatial Reasoning][vsr]**
288
+
289
+ | Model | VSR zeroshot split (test) | VSR random split (test) |
290
+ |-----------------|---------------------------|--------------------------|
291
+ | PaliGemma 2 3B | 0.75 | 0.82 |
292
+ | PaliGemma 2 10B | 0.80 | 0.87 |
293
+
294
+ ## Ethics and safety
295
+
296
+ ### Evaluation approach
297
+
298
+ Our evaluation methods include structured ethics and safety evaluations across
299
+ relevant content policies, including:
300
+
301
+ * Human evaluation on prompts covering child safety, content safety and
302
+ representational harms. See the [Gemma model
303
+ card](https://ai.google.dev/gemma/docs/model_card#evaluation_approach) for
304
+ more details on evaluation approach, but with image captioning and visual
305
+ question answering setups.
306
+ * Image-to-Text benchmark evaluation: Benchmark against relevant academic
307
+ datasets such as FairFace Dataset ([Karkkainen et al.,
308
+ 2021](https://arxiv.org/abs/1908.04913)).
309
+
310
+ ### Evaluation results
311
+
312
+ * The human evaluation results of ethics and safety evaluations are within
313
+ acceptable thresholds for meeting [internal
314
+ policies](https://storage.googleapis.com/gweb-uniblog-publish-prod/documents/2023_Google_AI_Principles_Progress_Update.pdf#page=11)
315
+ for categories such as child safety, content safety and representational
316
+ harms.
317
+ * On top of robust internal evaluations, we also use the Perspective API
318
+ (threshold of 0.8) to measure toxicity, profanity, and other potential
319
+ issues in the generated captions for images sourced from the FairFace
320
+ dataset. We report the maximum and median values observed across subgroups
321
+ for each of the perceived gender, ethnicity, and age attributes.
322
+
323
+ <table>
324
+ <tr>
325
+ <col>
326
+ <colgroup span="3"></colgroup>
327
+ <colgroup span="3"></colgroup>
328
+ <colgroup span="3"></colgroup>
329
+ <th>Metric</th>
330
+ <th colspan="3" scope="colgroup">Perceived gender</th>
331
+ <th colspan="3" scope="colgroup">Ethnicity</th>
332
+ <th colspan="3" scope="colgroup">Age group</th>
333
+ </tr>
334
+ <tr>
335
+ <th>Model size</th>
336
+ <th scope="col">3B</th>
337
+ <th scope="col">10B</th>
338
+ <th scope="col">28B</th>
339
+ <th scope="col">3B</th>
340
+ <th scope="col">10B</th>
341
+ <th scope="col">28B</th>
342
+ <th scope="col">3B</th>
343
+ <th scope="col">10B</th>
344
+ <th scope="col">28B</th>
345
+ </tr>
346
+ <tr>
347
+ <th></th>
348
+ <th colspan="9" scope="colgroup">Maximum</th>
349
+ </tr>
350
+ <tr>
351
+ <td>Toxicity</td>
352
+ <td>0.14%</td>
353
+ <td>0.15%</td>
354
+ <td>0.19%</td>
355
+ <td>0.29%</td>
356
+ <td>0.39%</td>
357
+ <td>0.39%</td>
358
+ <td>0.26%</td>
359
+ <td>0.18%</td>
360
+ <td>0.32%</td>
361
+ </tr>
362
+ <tr>
363
+ <td>Identity Attack</td>
364
+ <td>0.04%</td>
365
+ <td>0.02%</td>
366
+ <td>0.02%</td>
367
+ <td>0.13%</td>
368
+ <td>0.06%</td>
369
+ <td>0.06%</td>
370
+ <td>0.06%</td>
371
+ <td>0.03%</td>
372
+ <td>0.06%</td>
373
+ </tr>
374
+ <tr>
375
+ <td>Insult</td>
376
+ <td>0.17%</td>
377
+ <td>0.25%</td>
378
+ <td>0.17%</td>
379
+ <td>0.37%</td>
380
+ <td>0.52%</td>
381
+ <td>0.52%</td>
382
+ <td>0.27%</td>
383
+ <td>0.39%</td>
384
+ <td>0.24%</td>
385
+ </tr>
386
+ <tr>
387
+ <td>Threat</td>
388
+ <td>0.55%</td>
389
+ <td>0.43%</td>
390
+ <td>0.57%</td>
391
+ <td>0.83%</td>
392
+ <td>0.48%</td>
393
+ <td>0.48%</td>
394
+ <td>0.64%</td>
395
+ <td>0.43%</td>
396
+ <td>0.64%</td>
397
+ </tr>
398
+ <tr>
399
+ <td>Profanity</td>
400
+ <td>0.00%</td>
401
+ <td>0.00%</td>
402
+ <td>0.00%</td>
403
+ <td>0.00%</td>
404
+ <td>0.00%</td>
405
+ <td>0.00%</td>
406
+ <td>0.00%</td>
407
+ <td>0.00%</td>
408
+ <td>0.00%</td>
409
+ </tr>
410
+ <tr>
411
+ <th></th>
412
+ <th colspan="9" scope="colgroup">Median</th>
413
+ </tr>
414
+ <tr>
415
+ <td>Toxicity</td>
416
+ <td>0.13%</td>
417
+ <td>0.10%</td>
418
+ <td>0.18%</td>
419
+ <td>0.07%</td>
420
+ <td>0.07%</td>
421
+ <td>0.14%</td>
422
+ <td>0.12%</td>
423
+ <td>0.08%</td>
424
+ <td>0.12%</td>
425
+ </tr>
426
+ <tr>
427
+ <td>Identity Attack</td>
428
+ <td>0.02%</td>
429
+ <td>0.01%</td>
430
+ <td>0.02%</td>
431
+ <td>0.00%</td>
432
+ <td>0.00%</td>
433
+ <td>0.00%</td>
434
+ <td>0.00%</td>
435
+ <td>0.00%</td>
436
+ <td>0.00%</td>
437
+ </tr>
438
+ <tr>
439
+ <td>Insult</td>
440
+ <td>0.15%</td>
441
+ <td>0.23%</td>
442
+ <td>0.14%</td>
443
+ <td>0.14%</td>
444
+ <td>0.17%</td>
445
+ <td>0.13%</td>
446
+ <td>0.09%</td>
447
+ <td>0.18%</td>
448
+ <td>0.16%</td>
449
+ </tr>
450
+ <tr>
451
+ <td>Threat</td>
452
+ <td>0.35%</td>
453
+ <td>0.27%</td>
454
+ <td>0.41%</td>
455
+ <td>0.28%</td>
456
+ <td>0.19%</td>
457
+ <td>0.42%</td>
458
+ <td>0.27%</td>
459
+ <td>0.31%</td>
460
+ <td>0.40%</td>
461
+ </tr>
462
+ <tr>
463
+ <td>Profanity</td>
464
+ <td>0.00%</td>
465
+ <td>0.00%</td>
466
+ <td>0.00%</td>
467
+ <td>0.00%</td>
468
+ <td>0.00%</td>
469
+ <td>0.00%</td>
470
+ <td>0.00%</td>
471
+ <td>0.00%</td>
472
+ <td>0.00%</td>
473
+ </tr>
474
+ </table>
475
+
476
+ ## Usage and limitations
477
+
478
+ ### Intended usage
479
+
480
+ Open Vision Language Models (VLMs) have a wide range of applications across
481
+ various industries and domains. The following list of potential uses is not
482
+ comprehensive. The purpose of this list is to provide contextual information
483
+ about the possible use-cases that the model creators considered as part of model
484
+ training and development. Prohibited uses of Gemma models are outlined in the
485
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
486
+
487
+ Fine-tune on specific vision-language task:
488
+
489
+ * The pre-trained models can be fine-tuned on a wide range of vision-language
490
+ tasks such as: image captioning, short video caption, visual question
491
+ answering, text reading, object detection and object segmentation.
492
+ * The pre-trained models can be fine-tuned for specific domains such as remote
493
+ sensing question answering, visual questions from people who are blind,
494
+ science question answering, describe UI element functionalities.
495
+ * The pre-trained models can be fine-tuned for tasks with non-textual outputs
496
+ such as bounding boxes or segmentation masks.
497
+
498
+ Vision-language research:
499
+
500
+ * The pre-trained models and fine-tuned models can serve as a foundation for
501
+ researchers to experiment with VLM techniques, develop algorithms, and
502
+ contribute to the advancement of the field.
503
+
504
+ ### Ethical considerations and risks
505
+
506
+ The development of vision-language models (VLMs) raises several ethical
507
+ concerns. In creating an open model, we have carefully considered the following:
508
+
509
+ * Bias and Fairness
510
+ * VLMs trained on large-scale, real-world image-text data can reflect
511
+ socio-cultural biases embedded in the training material. These models
512
+ underwent careful scrutiny, input data pre-processing described and
513
+ posterior evaluations reported in this card.
514
+ * Misinformation and Misuse
515
+ * VLMs can be misused to generate text that is false, misleading, or
516
+ harmful.
517
+ * Guidelines are provided for responsible use with the model, see the
518
+ [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
519
+ * Transparency and Accountability
520
+ * This model card summarizes details on the models' architecture,
521
+ capabilities, limitations, and evaluation processes.
522
+ * A responsibly developed open model offers the opportunity to share
523
+ innovation by making VLM technology accessible to developers and
524
+ researchers across the AI ecosystem.
525
+
526
+ Risks identified and mitigations:
527
+
528
+ * **Perpetuation of biases:** It's encouraged to perform continuous monitoring
529
+ (using evaluation metrics, human review) and the exploration of de-biasing
530
+ techniques during model training, fine-tuning, and other use cases.
531
+ * **Generation of harmful content:** Mechanisms and guidelines for content
532
+ safety are essential. Developers are encouraged to exercise caution and
533
+ implement appropriate content safety safeguards based on their specific
534
+ product policies and application use cases.
535
+ * **Misuse for malicious purposes:** Technical limitations and developer and
536
+ end-user education can help mitigate against malicious applications of LLMs.
537
+ Educational resources and reporting mechanisms for users to flag misuse are
538
+ provided: see the [Responsible Generative AI Toolkit](https://ai.google.dev/responsible).
539
+ Prohibited uses of Gemma models are outlined in the
540
+ [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).
541
+ * **Privacy violations:** Models were trained on data filtered to remove
542
+ certain personal information and sensitive data. Developers are encouraged
543
+ to adhere to privacy regulations with privacy-preserving techniques.
544
+
545
+ ### Limitations
546
+
547
+ * Most limitations inherited from the underlying Gemma 2 models still apply:
548
+ * VLMs are better at tasks that can be framed with clear prompts and
549
+ instructions. Open-ended or highly complex tasks might be challenging.
550
+ * Natural language is inherently complex. VLMs might struggle to grasp
551
+ subtle nuances, sarcasm, or figurative language.
552
+ * VLMs generate responses based on information they learned from their
553
+ training datasets, but they are not knowledge bases. They may generate
554
+ incorrect or outdated factual statements.
555
+ * VLMs rely on statistical patterns in language and images. They might
556
+ lack the ability to apply common sense reasoning in certain situations.
557
+ * PaliGemma 2 was designed first and foremost to serve as a general
558
+ pre-trained model for fine-tuning to specialized tasks. Hence, its "out of
559
+ the box" or "zero-shot" performance might lag behind models designed
560
+ specifically for general purpose use.
561
+ * PaliGemma 2 is not a multi-turn chatbot. It is designed for a single round
562
+ of image and text input.
563
+
564
+
565
+ [ai2d]: https://allenai.org/data/diagrams
566
+ [aokvqa-da]: https://allenai.org/project/a-okvqa/home
567
+ [aokvqa-mc]: https://allenai.org/project/a-okvqa/home
568
+ [anet-cap]: https://paperswithcode.com/dataset/activitynet-captions
569
+ [anet-qa]: https://arxiv.org/abs/1906.02467
570
+ [chartqa]: https://arxiv.org/abs/2203.10244
571
+ [coco-35l]: https://arxiv.org/pdf/2205.12522
572
+ [coco-cap]: https://cocodataset.org/#home
573
+ [countbenchqa]: https://github.com/google-research/big_vision/blob/main/big_vision/datasets/countbenchqa/
574
+ [docvqa]: https://www.docvqa.org/
575
+ [gqa]: https://cs.stanford.edu/people/dorarad/gqa/about.html
576
+ [info-vqa]: https://arxiv.org/abs/2104.12756
577
+ [marvl]: https://marvl-challenge.github.io/
578
+ [msrvtt]: https://paperswithcode.com/dataset/msr-vtt
579
+ [msvd-qa]: https://paperswithcode.com/dataset/msvd-qa
580
+ [nlvr2]: https://lil.nlp.cornell.edu/nlvr/
581
+ [nocaps]: https://nocaps.org/
582
+ [ocr-vqa]: https://ocr-vqa.github.io/
583
+ [okvqa]: https://okvqa.allenai.org/
584
+ [refcoco]: https://arxiv.org/abs/1608.00272
585
+ [refcoco+]: https://aclanthology.org/D14-1086
586
+ [refcocog]: https://arxiv.org/abs/1511.02283
587
+ [rsvqa-hr]: https://zenodo.org/records/6344367
588
+ [rsvqa-lr]: https://zenodo.org/records/6344334
589
+ [st-vqa]: https://arxiv.org/abs/1905.13648
590
+ [scicap]: https://arxiv.org/abs/2110.11624
591
+ [scienceqa]: https://scienceqa.github.io/
592
+ [screen2words]: https://arxiv.org/abs/2108.03353
593
+ [tallyqa]: https://arxiv.org/abs/1810.12440
594
+ [textcaps]: https://textvqa.org/textcaps/
595
+ [textvqa]: https://textvqa.org/
596
+ [vatex]: https://arxiv.org/abs/1904.03493
597
+ [vizwiz-vqa]: https://vizwiz.org/tasks-and-datasets/vqa/
598
+ [widgetcap]: https://arxiv.org/abs/2010.04295
599
+ [vqav2]: https://visualqa.org/index.html
600
+ [xgqa]: https://aclanthology.org/2022.findings-acl.196/
601
+ [xm3600]: https://arxiv.org/pdf/2205.12522
602
+
603
+ [icdar2015-inc]: https://arxiv.org/abs/1511.09207
604
+ [total-text]: https://paperswithcode.com/paper/total-text-a-comprehensive-dataset-for-scene
605
+ [fintabnet]: https://developer.ibm.com/data/fintabnet/
606
+ [pubtabnet]: https://paperswithcode.com/dataset/pubtabnet
607
+ [grandstaff]: https://link.springer.com/article/10.1007/s10032-023-00432-z
608
+ [pubchem]: https://pmc.ncbi.nlm.nih.gov/articles/PMC7352161/
609
+ [docci]: https://research.google/pubs/docci-descriptions-of-connected-and-contrasting-images/
610
+ [mimic-cxr]: https://paperswithcode.com/dataset/mimic-cxr
611
+ [vsr]: https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00566/116470/Visual-Spatial-Reasoning