File size: 18,684 Bytes
f070274
 
 
 
 
 
66a38fb
f070274
62de87a
66a38fb
 
 
f070274
382a0e6
 
66a38fb
382a0e6
1eaac25
382a0e6
1eaac25
 
 
 
 
 
 
 
 
382a0e6
 
 
66a38fb
1eaac25
382a0e6
66a38fb
 
 
382a0e6
62de87a
de1ddb0
 
382a0e6
62de87a
1eaac25
382a0e6
62de87a
382a0e6
1eaac25
de1ddb0
66a38fb
382a0e6
66a38fb
382a0e6
1eaac25
382a0e6
1eaac25
382a0e6
1eaac25
382a0e6
 
66a38fb
382a0e6
66a38fb
382a0e6
1eaac25
 
 
 
 
 
 
 
 
382a0e6
1eaac25
 
 
 
 
 
 
382a0e6
1eaac25
 
 
 
382a0e6
1eaac25
 
 
 
 
382a0e6
de16541
 
382a0e6
 
1eaac25
382a0e6
de1ddb0
382a0e6
08e31d6
 
62de87a
4c02d51
1eaac25
a93d027
de1ddb0
1eaac25
de1ddb0
1eaac25
de1ddb0
1eaac25
de1ddb0
ade3656
de1ddb0
 
 
cb68b97
de1ddb0
66a38fb
382a0e6
de1ddb0
62de87a
de1ddb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c7e377b
382a0e6
 
1eaac25
de1ddb0
66a38fb
382a0e6
1eaac25
de1ddb0
1eaac25
382a0e6
1eaac25
6edaaa6
e23b75c
 
 
 
 
 
 
 
 
 
 
 
382a0e6
1eaac25
 
 
 
de1ddb0
 
1eaac25
 
 
 
de1ddb0
 
 
 
 
 
 
 
 
66a38fb
382a0e6
66a38fb
62de87a
 
382a0e6
66a38fb
 
382a0e6
62de87a
66a38fb
382a0e6
1eaac25
382a0e6
1eaac25
382a0e6
1eaac25
382a0e6
 
 
 
 
1eaac25
 
 
 
 
 
 
 
 
 
 
 
382a0e6
 
 
 
 
66a38fb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
---
language: en
tags:
- multimodal
- text
- image
license: other
datasets:
- HuggingFaceM4/OBELICS
- wikipedia
- facebook/pmd
- laion/laion2B-en
---


TODO: logo?

# IDEFICS

IDEFICS (**I**mage-aware **D**ecoder **E**nhanced à la **F**lamingo with **I**nterleaved **C**ross-attention**S**) is an open-access reproduction of [Flamingo](https://huggingface.co/papers/2204.14198), a closed-source visual language model developed by Deepmind. Like GPT-4, the multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs. IDEFICS is built solely on public available data and models.

The model can answer questions about images, describe visual contents, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.

IDEFICS is on par with the original model on various image-text benchmarks, including visual question answering (open-ended and multiple choice), image captioning, and image classification when evaluated with in-context few-shot learning. It comes into two variants: a large [80 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-80b) version and a [9 billion parameters](https://huggingface.co/HuggingFaceM4/idefics-9b) version.

We also fine-tune these base models on a mixture of supervised and instruction fine-tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings: [idefics-80b-instruct](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) and [idefics-9b-instruct](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct). As they reach higher performance, we recommend using these instructed versions first.

Read more about some of the technical challenges encountered during training IDEFICS [here](https://github.com/huggingface/m4-logs/blob/master/memos/README.md).

# Model Details

- **Developed by:** Hugging Face
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** en
- **License:** other
- **Parent Model:** [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)
- **Resources for more information:**
    - [GitHub Repo](https://github.com/huggingface/m4/)
    - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
](https://huggingface.co/papers/2306.16527)
    - Original Paper: [Flamingo: a Visual Language Model for Few-Shot Learning](https://huggingface.co/papers/2204.14198)

IDEFICS is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs.
The model shows strong in-context few-shot learning capabilities and is on par with the closed-source model. This makes IDEFICS a robust starting point to fine-tune multimodal models on custom data.

IDEFICS is built on top of two unimodal open-access pre-trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image/text pairs and unstrucutred multimodal web documents.

IDEFICS-instruct is the model obtained by further training IDEFICS on Supervised Fine-Tuning and Instruction Fine-Tuning datasets. This improves downstream performance significantly (making [idefics-9b-instruct](https://huggingface.co/HuggingFaceM4/idefics-9b-instruct) a very strong model at its 9 billion scale), while making the model more suitable to converse with.

# Uses

The model can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query/instruction along with one or multiple images. This model does not support image generation.

It is possible to fine-tune the base model on custom data for a specific use-case. We note that the instruction-fine-tuned models are significantly better at following instructions from users and thus should be prefered when using the models out-of-the-box.

The following screenshot is an example of interaction with the instructed model:

<img src="./assets/guarding_baguettes.png"  width="35%">


# How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "HuggingFaceM4/idefics-9b"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)

# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "In this picture from Asterix and Obelix, we can see"
    ],
]

# --batched mode
inputs = processor(prompts, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

generated_ids = model.generate(**inputs, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")
```

To quickly test your software without waiting for the huge model to download/load you can use `HuggingFaceM4/tiny-random-idefics` - it hasn't been trained and has random weights but it is very useful for quick testing.

# Training Details

We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.

The model is trained on the following data mixture of openly accessible English data:

| Data Source | Type of Data                             | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
|-------------|-----------------------------------------|---------------------------|---------------------------|--------|-----------------------------------------|
| [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS)     | Unstructured Multimodal Web Documents    | 114.9B                      | 353M                      | 1      | 73.85%                                  |
| [Wikipedia](https://huggingface.co/datasets/wikipedia)   | Unstructured Multimodal Web Documents    | 3.192B                     | 39M                     | 3      | 6.15%                                  |
| [LAION](https://huggingface.co/datasets/laion/laion2B-en)       | Image-Text Pairs                         | 29.9B                      | 1.120B                      | 1      | 17.18%
| [PMD](https://huggingface.co/datasets/facebook/pmd)         | Image-Text Pairs                         | 1.6B                      | 70M                      | 3      | 2.82%                                   |                                |

**OBELICS** is an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](https://atlas.nomic.ai/map/f2fba2aa-3647-4f49-a0f3-9347daeee499/ee4a84bd-f125-4bcc-a683-1b4e231cb10f).

**Wkipedia**. We used the English dump of Wikipedia created on February 20th, 2023.

**LAION** is a collection of image-text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it (following [Webster et al., 2023](https://arxiv.org/abs/2303.12733)), filtered it, and removed the opted-out images using the [Spawning API](https://api.spawning.ai/spawning-api).

**PMD** is a collection of publicly-available image-text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre-processing, we did not include SBU captions.

For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image-text pairs, we form the training sequences by packing images with their captions. The images are encoded with the vision encoder and vision hidden states are pooled with Transformer Perceiver blocks and then fused into the text sequence through the cross-attention blocks.

Following [Dehghani et al., 2023](https://huggingface.co/papers/2302.05442), we apply a layer normalization on the projected queries and keys of both the Perceiver and cross-attention blocks, which improved training stability in our early experiments. We use the [RMSNorm](https://huggingface.co/papers/1910.07467) implementation for trainable Layer Norms.

The training objective is the standard next token prediction.

We use the following hyper and training parameters:
| Parameters | | IDEFICS | IDEFICS-9b |
| -- | -- | -- | -- |
| Perceiver Resampler | Number of Layers | 6 | 6 |
| | Number of Latents | 64 | 64 |
| | Number of Heads | 16 | 16 |
| | Resampler Head Dimension | 96 | 96 |
| Model | Language Model Backbone | [Llama-65b](https://huggingface.co/huggyllama/llama-65b) | [Llama-7b](https://huggingface.co/huggyllama/llama-7b) |
| | Vision Model Backbone | [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) | [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) |
| | Cross-Layer Interval | 4 | 4 |
| Training | Sequence Length | 1024 | 1024 |
| | Effective Batch Size (# of tokens) | 3.67M | 1.31M |
| | Max Training Steps | 200K | 200K |
| | Weight Decay | 0.1 | 0.1 |
| | Optimizer | Adam(0.9, 0.999) | Adam(0.9, 0.999) |
| | Gradient Clipping | 1.0 | 1.0 |
| | [Z-loss](https://huggingface.co/papers/2204.02311) weight | 1e-3 | 1e-3 |
| Learning Rate | Initial Max | 5e-5 | 1e-5 |
| | Initial Final | 3e-5 | 6e-6 |
| | Decay Schedule | Linear | Linear |
| | Linear warmup Steps | 2K | 2K |
| Large-scale Optimization | Gradient Checkpointing | True | True |
| | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
| | ZeRO Optimization | Stage 3 | Stage 3 |


# Evaluation

We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.

We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.

We perform checkpoint selection based on validation sets of VQAv2, TextVQA, OKVQA, VizWiz, Visual Dialogue, Coco, Flickr30k, and HatefulMemes. We select the checkpoint at step 65'000 for IDEFICS-9B and at step 37'500 for IDEFICS. The models are evaluated with in-context few-shot learning where the priming instances are selected at random from a support set. We do not use any form of ensembling.

<img src="./assets/Figure_Evals_IDEFIX.png"  width="55%">

TODO: update this table
| Model      |   Shots |   VQAv2 (OE VQA acc) |   OKVQA (OE VQA acc) |   TextVQA (OE VQA acc) |   VizWiz (OE VQA acc) |   TextCaps (CIDEr) |   Coco (CIDEr) |   NoCaps (CIDEr) |   Flickr (CIDEr) |   ImageNet1k (accuracy) |   VisDial (NDCG) |   HatefulMemes (ROC AUC) |   ScienceQA (accuracy) |   RenderedSST2 (accuracy) |   Winoground (group (text/image)) |
|:-----------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|------------------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
| IDEFIX 80B |       0 |                 60.0 |                 45.2 |                   30.9 |                  36.0 |               56.8 |           91.8 |             65.0 |             53.7 |                    74.3 |             48.8 |                     60.6 |                   68.9 |                      60.5 |                               8.0 (18.8/22.5)|
|            |       4 |                 63.4 |                 52.3 |                   34.7 |                  45.8 |               77.9 |          109.3 |            101.1 |             68.9 |                    - |             48.6 |                     58.7 |                   66.3 |                      63.9 |                              - |
|            |       8 |                 64.5 |                 55.2 |                   35.4 |                  49.3 |               82.5 |          113.9 |            104.7 |             74.3 |                    - |             48.1 |                     57.8 |                   - |                      64.3 |                              - |
|            |      16 |                 65.4 |                 56.8 |                   36.3 |                  51.5 |               85.2 |          116.6 |            105.6 |             76.8 |                    - |             - |                     56.0 |                   - |                      66.9 |                              - |
|            |      32 |                 66.0 |                 58.0 |                   37.0 |                  52.6 |               86.1 |          116.5 |            106.3 |             78.9 |                    - |             - |                     54.3 |                   - |                      68.0 |                              - |
<br>
| IDEFIX 9B  |       0 |                 50.9 |                 38.4 |                   25.9 |                  35.5 |               25.4 |           46.0 |             36.8 |             27.3 |                    70.7 |             48.7 |                     51.7 |                   44.2 |                      61.8 |                               5.0 (16.8/20.8)|
|            |       4 |                 55.6 |                 45.8 |                   26.8 |                  42.0 |               60.8 |           88.9 |             78.4 |             52.2 |                    - |             48.1 |                     52.6 |                   41.6 |                      60.6 |                              - |
|            |       8 |                 56.4 |                 47.3 |                   26.8 |                  42.8 |               63.7 |           96.9 |             84.3 |             60.3 |                    - |             47.5 |                     52.3 |                   - |                      66.8 |                              - |
|            |      16 |                 57.2 |                 49.0 |                   28.1 |                  45.0 |               68.0 |           99.6 |             87.2 |             65.0 |                    - |             - |                     52.5 |                   - |                      66.0 |                              - |
|            |      32 |                 57.9 |                 50.4 |                   28.2 |                  45.9 |               69.7 |          101.5 |             88.6 |             66.0 |                    - |             - |                     53.1 |                   - |                      63.4 |                              - |

We also report results where the priming samples are selected to be similar (i.e. close in a vector space) to the queried instance.

TODO: table with rices shots

# Technical Specifications

- **Hardware Type:** 64 nodes of 8x 80GB A100 gpus, EFA network
- **Hours used:** ~672 node hours
- **Cloud Provider:** AWS Sagemaker

## Hardware

The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.

## Software

The training software is built on top of HuggingFace Transformers + Accelerate, and DeepSpeed ZeRO-3 for training, and [WebDataset](https://github.com/webdataset/webdataset) for data loading.


# Bias, Risks, and Limitations

Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
As a derivative of such a language model, IDEFICS can produce texts that include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
Moreover, IDEFICS can produce factually incorrect texts, and should not be relied on to produce factually accurate information.

Here are a few examples of outputs that could be categorized as factually incorrect, biased, or offensive:
TODO: give 4/5 representative examples

To measure IDEFICS's ability to recognize socilogical (TODO: find a better adjective) attributes, we evaluate the model on FairFace...
TODO: include FairFace numbers

# License

The model is built on top of of two pre-trained models: [laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b). The first was released under an MIT license, while the second was released under a specific noncommercial license focused on research purposes. As such, users should comply with that license by applying directly to [Meta's form](https://docs.google.com/forms/d/e/1FAIpQLSfqNECQnMkycAp2jP4Z9TFX0cGR4uf7b_fBxjY_OjhJILlKGA/viewform).

We release the additional weights we trained under an MIT license.

# Citation

**BibTeX:**

```bibtex
@misc{laurençon2023obelisc,
      title={OBELISC: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
      author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
      year={2023},
      eprint={2306.16527},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}
```

# Model Card Authors

V, i, c, t, o, r, ,,  , S, t, a, s, ,,  , X, X, X

# Model Card Contact

Please open a discussion on the Community tab!