fix_details_model_card (#11)
Browse files- detail fixes (5edd82e6c8071cf4e18b5f390ec494165401ce8b)
- smalll fixes (258ef74eea4986fe296eaf65270e459e989a1ca2)
- add authors' names (8a431e388fb3a8c5d4a8c15e6752f0e3252f03dd)
- Merge commit 'refs/pr/11' of https://huggingface.co/HuggingFaceM4/idefics-80b into pr/11 (71dc0d079ff6b4e0a1d679db5e0d0b07a826274a)
- add colab link (27edbefdc6a43243806f6c78ad41a57b9ce99a2f)
- fix legend + display (7d92337cdbb979005914321b3997b66a21d57a7c)
- fix space (bb80664aa2a26e6af5ef5a7f1628bcf79c0a610d)
README.md
CHANGED
@@ -60,7 +60,7 @@ The following screenshot is an example of interaction with the instructed model:
|
|
60 |
|
61 |
# How to Get Started with the Model
|
62 |
|
63 |
-
This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](
|
64 |
|
65 |
We provide quick-start code for both the base and the instruct models.
|
66 |
|
@@ -139,7 +139,7 @@ for i, t in enumerate(generated_text):
|
|
139 |
|
140 |
# Training Details
|
141 |
|
142 |
-
## IDEFICS
|
143 |
|
144 |
We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
|
145 |
|
@@ -207,7 +207,7 @@ We start from the base IDEFICS models and fine-tune the models by unfreezing all
|
|
207 |
|
208 |
We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
|
209 |
|
210 |
-
Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and
|
211 |
|
212 |
The training objective is the standard next token prediction. We use the following hyper and training parameters:
|
213 |
| Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
|
@@ -229,7 +229,7 @@ The training objective is the standard next token prediction. We use the followi
|
|
229 |
|
230 |
# Evaluation
|
231 |
|
232 |
-
## IDEFICS
|
233 |
|
234 |
We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
|
235 |
|
@@ -243,15 +243,15 @@ As opposed to Flamingo, we did not train IDEFICS on video-text pairs datasets, a
|
|
243 |
|
244 |
We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
|
245 |
|
246 |
-
| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group
|
247 |
|:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
|
248 |
-
| IDEFICS 80B | 0 | 60.0 | 45.2 | 30.9 | 36.0 | 56.8 | 91.8 | 65.0 | 53.7 | 48.8 | 60.6 | 68.9 | 60.5 | 8.0
|
249 |
| | 4 | 63.6 | 52.4 | 34.4 | 40.4 | 72.7 | 110.3 | 99.6 | 73.7 | 48.4 | 57.8 | 58.9 | 66.6 | - |
|
250 |
| | 8 | 64.8 | 55.1 | 35.7 | 46.1 | 77.6 | 114.3 | 105.7 | 76.6 | 47.9 | 58.2 | - | 67.8 | - |
|
251 |
| | 16 | 65.4 | 56.8 | 36.3 | 48.3 | 81.4 | 116.6 | 107.0 | 80.1 | - | 55.8 | - | 67.7 | - |
|
252 |
| | 32 | 65.9 | 57.8 | 36.7 | 50.0 | 82.7 | 116.6 | 107.5 | 81.1 | - | 52.5 | - | 67.3 | - |
|
253 |
<br>
|
254 |
-
| IDEFICS 9B | 0 | 50.9 | 38.4 | 25.9 | 35.5 | 25.4 | 46.0 | 36.8 | 27.3 | 48.7 | 51.7 | 44.2 | 61.8 | 5.0
|
255 |
| | 4 | 55.4 | 45.5 | 27.6 | 36.9 | 60.0 | 93.0 | 81.3 | 59.7 | 47.9 | 50.7 | 37.4 | 62.3 | - |
|
256 |
| | 8 | 56.4 | 47.7 | 27.5 | 40.4 | 63.2 | 97.0 | 86.8 | 61.9 | 47.6 | 51.0 | - | 66.3 | - |
|
257 |
| | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
|
@@ -271,9 +271,9 @@ For ImageNet-1k, we also report results where the priming samples are selected t
|
|
271 |
|
272 |
Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
|
273 |
|
274 |
-
| Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group
|
275 |
| :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
|
276 |
-
| Finetuning data does not contain dataset | - | &#
|
277 |
| <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
|
278 |
| | 4 | 67.5 (4.0) | 54.0 (1.7) | 37.8 (3.5) | 39.8 (-0.7) | 71.7 (-1.0) | 116.9 (6.6) | 104.0 (4.4) | 67.1 (-6.6) | 48.9 (0.5) | 57.5 (-0.3) | 60.5 (1.6) | 65.5 (-1.1) | - |
|
279 |
| | 8 | 68.1 (3.4) | 56.9 (1.8) | 38.2 (2.5) | 44.8 (-1.3) | 72.7 (-4.9) | 116.8 (2.5) | 104.8 (-0.9) | 70.7 (-5.9) | 48.2 (0.3) | 58.0 (-0.2) | - | 68.6 (0.8) | - |
|
@@ -286,6 +286,7 @@ Similarly to the base IDEFICS models, we performed checkpoint selection to stop
|
|
286 |
| | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
|
287 |
| | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
|
288 |
|
|
|
289 |
|
290 |
# Technical Specifications
|
291 |
|
@@ -393,8 +394,7 @@ We release the additional weights we trained under an MIT license.
|
|
393 |
|
394 |
# Model Card Authors
|
395 |
|
396 |
-
|
397 |
-
|
398 |
# Model Card Contact
|
399 |
|
400 |
Please open a discussion on the Community tab!
|
|
|
60 |
|
61 |
# How to Get Started with the Model
|
62 |
|
63 |
+
This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](https://colab.research.google.com/drive/1o6hSdApDoaavkAXTI7clIG1ZWfJvwzRj?usp=sharing) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
|
64 |
|
65 |
We provide quick-start code for both the base and the instruct models.
|
66 |
|
|
|
139 |
|
140 |
# Training Details
|
141 |
|
142 |
+
## IDEFICS
|
143 |
|
144 |
We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
|
145 |
|
|
|
207 |
|
208 |
We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
|
209 |
|
210 |
+
Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and 30.7% of OBELICS multimodal web documents.
|
211 |
|
212 |
The training objective is the standard next token prediction. We use the following hyper and training parameters:
|
213 |
| Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
|
|
|
229 |
|
230 |
# Evaluation
|
231 |
|
232 |
+
## IDEFICS
|
233 |
|
234 |
We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
|
235 |
|
|
|
243 |
|
244 |
We note that since IDEFICS was trained on PMD (which contains COCO), the evaluation numbers on COCO are not directly comparable with Flamingo and OpenFlamingo since they did not explicitely have this dataset in the training mixture. Additionally, Flamingo is trained with images of resolution 320 x 320 while IDEFICS and OpenFlamingo were trained with images of 224 x 224 resolution.
|
245 |
|
246 |
+
| Model | Shots | <nobr>VQAv2<br>OE VQA acc.</nobr> | <nobr>OKVQA<br>OE VQA acc.</nobr> | <nobr>TextVQA<br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps<br>CIDEr</nobr> | <nobr>Coco<br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial<br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA<br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
|
247 |
|:------------|--------:|---------------------:|---------------------:|-----------------------:|----------------------:|-------------------:|---------------:|-----------------:|-----------------:|-----------------:|-------------------------:|-----------------------:|--------------------------:|----------------------------------:|
|
248 |
+
| IDEFICS 80B | 0 | 60.0 | 45.2 | 30.9 | 36.0 | 56.8 | 91.8 | 65.0 | 53.7 | 48.8 | 60.6 | 68.9 | 60.5 | 8.0/18.75/22.5|
|
249 |
| | 4 | 63.6 | 52.4 | 34.4 | 40.4 | 72.7 | 110.3 | 99.6 | 73.7 | 48.4 | 57.8 | 58.9 | 66.6 | - |
|
250 |
| | 8 | 64.8 | 55.1 | 35.7 | 46.1 | 77.6 | 114.3 | 105.7 | 76.6 | 47.9 | 58.2 | - | 67.8 | - |
|
251 |
| | 16 | 65.4 | 56.8 | 36.3 | 48.3 | 81.4 | 116.6 | 107.0 | 80.1 | - | 55.8 | - | 67.7 | - |
|
252 |
| | 32 | 65.9 | 57.8 | 36.7 | 50.0 | 82.7 | 116.6 | 107.5 | 81.1 | - | 52.5 | - | 67.3 | - |
|
253 |
<br>
|
254 |
+
| IDEFICS 9B | 0 | 50.9 | 38.4 | 25.9 | 35.5 | 25.4 | 46.0 | 36.8 | 27.3 | 48.7 | 51.7 | 44.2 | 61.8 | 5.0/16.8/20.8 |
|
255 |
| | 4 | 55.4 | 45.5 | 27.6 | 36.9 | 60.0 | 93.0 | 81.3 | 59.7 | 47.9 | 50.7 | 37.4 | 62.3 | - |
|
256 |
| | 8 | 56.4 | 47.7 | 27.5 | 40.4 | 63.2 | 97.0 | 86.8 | 61.9 | 47.6 | 51.0 | - | 66.3 | - |
|
257 |
| | 16 | 57.0 | 48.4 | 27.9 | 42.6 | 67.4 | 99.7 | 89.4 | 64.5 | - | 50.9 | - | 67.8 | - |
|
|
|
271 |
|
272 |
Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
|
273 |
|
274 |
+
| Model | Shots | <nobr>VQAv2 <br>OE VQA acc.</nobr> | <nobr>OKVQA <br>OE VQA acc.</nobr> | <nobr>TextVQA <br>OE VQA acc.</nobr> | <nobr>VizWiz<br>OE VQA acc.</nobr> | <nobr>TextCaps <br>CIDEr</nobr> | <nobr>Coco <br>CIDEr</nobr> | <nobr>NoCaps<br>CIDEr</nobr> | <nobr>Flickr<br>CIDEr</nobr> | <nobr>VisDial <br>NDCG</nobr> | <nobr>HatefulMemes<br>ROC AUC</nobr> | <nobr>ScienceQA <br>acc.</nobr> | <nobr>RenderedSST2<br>acc.</nobr> | <nobr>Winoground<br>group/text/image</nobr> |
|
275 |
| :--------------------- | --------: | ---------------------: | ---------------------: | -----------------------: | ----------------------: | -------------------: | ---------------: | -----------------: | -----------------: | -----------------: | -------------------------: | -----------------------: | --------------------------: | ----------------------------------: |
|
276 |
+
| Finetuning data **does not** contain the evaluation dataset | - | ✖ | ✖ | ✖ | ✔ | ✖ | ✖ | ✖ | ✔ | ✖ | ✔ | ✖ | ✔ | ✖ |
|
277 |
| <nobr>IDEFICS 80B Instruct<br> | 0 | 37.4 (-22.7) | 36.9 (-8.2) | 32.9 (1.9) | 26.2 (-9.8) | 76.5 (19.7) | 117.2 (25.4) | 104.5 (39.5) | 65.3 (11.7) | 49.3 (0.4) | 58.9 (-1.7) | 69.5 (0.5) | 67.3 (6.8) | 9.2/20.0/25.0 (1.2/1.2/2.5) |
|
278 |
| | 4 | 67.5 (4.0) | 54.0 (1.7) | 37.8 (3.5) | 39.8 (-0.7) | 71.7 (-1.0) | 116.9 (6.6) | 104.0 (4.4) | 67.1 (-6.6) | 48.9 (0.5) | 57.5 (-0.3) | 60.5 (1.6) | 65.5 (-1.1) | - |
|
279 |
| | 8 | 68.1 (3.4) | 56.9 (1.8) | 38.2 (2.5) | 44.8 (-1.3) | 72.7 (-4.9) | 116.8 (2.5) | 104.8 (-0.9) | 70.7 (-5.9) | 48.2 (0.3) | 58.0 (-0.2) | - | 68.6 (0.8) | - |
|
|
|
286 |
| | 16 | 66.8 (9.8) | 51.7 (3.3) | 31.6 (3.7) | 44.8 (2.3) | 70.2 (2.7) | 128.8 (29.1) | 101.5 (12.2) | 75.8 (11.4) | - | 51.7 (0.7) | - | 63.3 (-4.6) | - |
|
287 |
| | 32 | 66.9 (9.0) | 52.3 (2.7) | 32.0 (3.7) | 46.0 (2.2) | 71.7 (3.6) | 127.8 (29.8) | 101.0 (10.5) | 76.3 (11.9) | - | 50.8 (1.0) | - | 60.9 (-6.1) | - |
|
288 |
|
289 |
+
*() Improvement over non-instruct version.
|
290 |
|
291 |
# Technical Specifications
|
292 |
|
|
|
394 |
|
395 |
# Model Card Authors
|
396 |
|
397 |
+
Stas Bekman, Victor Sanh, Léo Tronchon, Hugo Laurençon
|
|
|
398 |
# Model Card Contact
|
399 |
|
400 |
Please open a discussion on the Community tab!
|