HuggingFaceM4
/

idefics-80b

@@ -60,7 +60,11 @@ The following screenshot is an example of interaction with the instructed model:
 # How to Get Started with the Model
-Use the code below to get started with the model.
 ```python
 import torch
@@ -93,10 +97,50 @@ for i, t in enumerate(generated_text):
 To quickly test your software without waiting for the huge model to download/load you can use `HuggingFaceM4/tiny-random-idefics` - it hasn't been trained and has random weights but it is very useful for quick testing.
-This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](TODO) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
 # Training Details
 We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
 The model is trained on the following data mixture of openly accessible English data:
@@ -123,7 +167,7 @@ Following [Dehghani et al., 2023](https://huggingface.co/papers/2302.05442), we
 The training objective is the standard next token prediction.
 We use the following hyper and training parameters:
-| Parameters | | IDEFICS | IDEFICS-9b |
 | -- | -- | -- | -- |
 | Perceiver Resampler | Number of Layers | 6 | 6 |
 | | Number of Latents | 64 | 64 |
@@ -147,9 +191,46 @@ We use the following hyper and training parameters:
 | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
 | | ZeRO Optimization | Stage 3 | Stage 3 |
 # Evaluation
 We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
 We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
@@ -201,17 +282,17 @@ Fairness Evaluations:
 |            |      16 |                        95.8 |                      43.0 |                     46.1 |
 |            |      32 |                        96.1 |                      35.1 |                     44.9 |
-# Technical Specifications
-- **Hardware Type:** 64 nodes of 8x 80GB A100 gpus, EFA network
-- **Hours used:** ~672 node hours
-- **Cloud Provider:** AWS Sagemaker
 ## Hardware
-The training was performed on an AWS SageMaker cluster with 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network which provides about 340GBps throughput.
 ## Software

 # How to Get Started with the Model
+This [tutorial](https://github.com/huggingface/notebooks/pull/418/) shows a simple example to fine-tune IDEFICS on custom data. This [colab notebook](TODO) showcases how to do the fine-tuning in 4bits precision. TODO: change to the correct link once it's merged.
+We provide quick-start code for both the base and the instruct models.
+Use the code below to get started with the base model.
 ```python
 import torch
 To quickly test your software without waiting for the huge model to download/load you can use `HuggingFaceM4/tiny-random-idefics` - it hasn't been trained and has random weights but it is very useful for quick testing.
+Use that code to get started with the instruct model:
+```python
+import torch
+from transformers import IdeficsForVisionText2Text, AutoProcessor
+device = "cuda" if torch.cuda.is_available() else "cpu"
+checkpoint = "HuggingFaceM4/idefics-9b-instruct"
+model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
+processor = AutoProcessor.from_pretrained(checkpoint)
+# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
+prompts = [
+    [
+        "User: What is in this image?",
+        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
+        "<end_of_utterance>",
+        "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
+        "\nUser:",
+        "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
+        "And who is that?<end_of_utterance>",
+        "\nAssistant:",
+    ],
+]
+# --batched mode
+inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
+# --single sample mode
+# inputs = processor(prompts[0], return_tensors="pt").to(device)
+exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
+generated_ids = model.generate(**inputs, eos_token_id=exit_condition, max_length=100)
+generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
+for i, t in enumerate(generated_text):
+    print(f"{i}:\n{t}\n")
+```
 # Training Details
+## IDEFICS base
 We closely follow the training procedure layed out in [Flamingo](https://huggingface.co/papers/2204.14198). We combine two open-source pre-trained models ([laion/CLIP-ViT-H-14-laion2B-s32B-b79K](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and [huggyllama/llama-65b](https://huggingface.co/huggyllama/llama-65b)) by initializing new Transformer blocks. The pre-trained backbones are frozen while we train the newly initialized parameters.
 The model is trained on the following data mixture of openly accessible English data:
 The training objective is the standard next token prediction.
 We use the following hyper and training parameters:
+| Parameters | | IDEFICS-80b | IDEFICS-9b |
 | -- | -- | -- | -- |
 | Perceiver Resampler | Number of Layers | 6 | 6 |
 | | Number of Latents | 64 | 64 |
 | | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
 | | ZeRO Optimization | Stage 3 | Stage 3 |
+## IDEFICS-instruct
+We start from the base IDEFICS models and fine-tune the models by unfreezing all the parameters (vision encoder, language model, cross-attentions). The mixture is composed of following English datasets:
+| Data Source | Data Description                             | Number of unrepeated samples | Sampling ratio |
+|-------------|----------------------------------------------|------------------------------|----------------|
+| [M3IT](https://huggingface.co/datasets/MMInstruction/M3IT)                       | Prompted image-text academic datasets              | 1.5M  | 7.7%  |
+| [LRV-Instruction](https://huggingface.co/datasets/VictorSanh/LrvInstruction)     | Triplets of image/question/answer                  | 155K  | 1.7%  |
+| [LLaVA-Instruct](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K) | Dialogues of question/answers grounded on an image | 158K  | 5.9%  |
+| [LLaVAR-Instruct](https://huggingface.co/datasets/SALT-NLP/LLaVAR)               | Dialogues of question/answers grounded on an image with a focus on images containing text | 15.5K | 6.3% |
+| [SVIT](https://huggingface.co/datasets/BAAI/SVIT)                                | Triplets of image/question/answer                  | 3.2M | 11.4% |
+| [Spot Difference](TODO)                     | Triplets of image/question/answer                  | 158K   | 2.1%  |
+| [UltraChat](https://huggingface.co/datasets/stingning/ultrachat)                 | Multi-turn text-only dialogye                      | 1.5M   | 29.1% |
+We note that all these datasets were obtained by using ChatGPT/GPT-4 in one way or another.
+Additionally, we found it beneficial to include the pre-training data in the fine-tuning with the following sampling ratios: 5.1% of image-text pairs and 31.0 of multimodal web documents.
+The training objective is the standard next token prediction. We use the following hyper and training parameters:
+| Parameters | | IDEFICS-80b-instruct | IDEFICS-9b-instruct |
+| -- | -- | -- | -- |
+| Training | Sequence Length | 2048 | 2048 |
+| | Effective Batch Size (# of tokens) | 613K | 205K |
+| | Max Training Steps | 22K | 22K |
+| | Weight Decay | 0.1 | 0.1 |
+| | Optimizer | Adam(0.9, 0.999) | Adam(0.9, 0.999) |
+| | Gradient Clipping | 1.0 | 1.0 |
+| | [Z-loss](https://huggingface.co/papers/2204.02311) weight | 0. | 0. |
+| Learning Rate | Initial Max | 3e-6 | 1e-5 |
+| | Initial Final | 3.6e-7 | 1.2e-6 |
+| | Decay Schedule | Linear | Linear |
+| | Linear warmup Steps | 1K | 1K |
+| Large-scale Optimization | Gradient Checkpointing | True | True |
+| | Precision | Mixed-pres bf16 | Mixed-pres bf16 |
+| | ZeRO Optimization | Stage 3 | Stage 3 |
 # Evaluation
+## IDEFICS base
 We follow the evaluation protocol of Flamingo and evaluate IDEFICS on a suite of downstream image-text benchmarks ranging from visual question answering to image captioning.
 We compare our model to the original Flamingo along with [OpenFlamingo](openflamingo/OpenFlamingo-9B-vitl-mpt7b), another open-source reproduction.
 |            |      16 |                        95.8 |                      43.0 |                     46.1 |
 |            |      32 |                        96.1 |                      35.1 |                     44.9 |
+## IDEFICS instruct
+Similarly to the base IDEFICS models, we performed checkpoint selection to stop the training. Given that M3IT contains in the training set a handful of the benchmarks we were evaluating on, we used [MMBench](https://huggingface.co/papers/2307.06281) as a held-out validation benchmark to perform checkpoint selection. We select the checkpoint at step 3'000 for IDEFICS-80b-instruct and at step 8'000 for IDEFICS-9b-instruct.
+TODO: tables comparing IDEFICS vs IDEFICS-instruct.
+# Technical Specifications
 ## Hardware
+The IDEFICS models were trained on an AWS SageMaker cluster using at the maximum 64 nodes of 8x80GB A100 GPUs (512 GPUs total). The cluster uses the current EFA network. IDEFICS-80b was trained for approximately 672 node hours. IDEFICS-80b-instruct was trained for approximately 3 days on 48 nodes.
 ## Software