Idefics-Obelics logo

As of April 18th, 2024, Idefics2 is part of the 4.40.0 Transformers pypi release. Please upgrade your Transformers version (pip install transformers --upgrade).

idefics2 8b Fine tuned on DocVQA Dataset

Model Information

Training Details

  • The training process took approximately 38hours on an A100 80GB GPU, and model was fine-tuned using QLoRA.
  • Trained with 39.5k train dataset from DocVQA single page questions
  • Training Log:
Epoch Loss Grad Norm Learning Rate
0.01 2.3776 10.40 4.8e-05
0.25 0.5029 6.10 9.5412e-05
0.50 0.434 5.74 7.5973e-05
0.75 0.4608 7.46 7.3925e-05
1.0 0.3846 4.77 5.0369e-05
1.25 0.3226 3.63 4.9857e-05
1.5 0.3175 5.03 2.5277e-05
1.75 0.2918 5.63 2.5789e-05
2.0 0.2917 4.58 2.0483e-07

{'train_runtime': 141781.6786, 'train_samples_per_second': 0.557, 'train_steps_per_second': 0.035, 'train_loss': 0.3973848872424526, 'epoch': 2.0}

Processor Configuration

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=True
)

Vision Encoder Efficiency

Given the high resolution supported, the vision part of the model can be memory hungry depending on your configuration. If you are GPU-memory-constrained, you can:

  1. Deactivate image splitting: To do so, add do_image_splitting=False when initializing the processor (AutoProcessor.from_pretrained). There are no changes required on the model side. Note that only the SFT model has been trained with image splitting.

  2. Decrease maximum image resolution: To do so, add size={"longest_edge": 448, "shortest_edge": 378} when initializing the processor (AutoProcessor.from_pretrained). In particular, the longest_edge value can be adapted to fit the need (the default value is 980). We recommend using values that are multiples of 14. There are no changes required on the model side.

do_image_splitting=True is especially needed to boost performance on OCR tasks where a very large image is used as input. For regular VQA or captioning tasks, this argument can be safely set to False with minimal impact on performance (see the evaluation table above).

Testing and Inference

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Load images
image1 = load_image("https://templates.invoicehome.com/invoice-template-us-classic-white-750px.png")
image2 = load_image("https://cdn.vertex42.com/WordTemplates/images/word-invoice-template.png")

# Initialize processor and model
processor = AutoProcessor.from_pretrained("SalmanFaroz/idefics2-8b-DocVQA-SP", do_image_splitting=True)

Full Precision:

model = AutoModelForVision2Seq.from_pretrained(
    "SalmanFaroz/idefics2-8b-DocVQA-SP",
).to(DEVICE)

*or

Half Precision Inference:

model = AutoModelForVision2Seq.from_pretrained(
    "SalmanFaroz/idefics2-8b-DocVQA-SP",
    torch_dtype=torch.float16,    
).to(DEVICE)

*or

4 Bit Quantization with bitsandbytes: Make sure to have accelerate and bitsandbytes installed

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForVision2Seq.from_pretrained(
    "SalmanFaroz/idefics2-8b-DocVQA-SP",
    torch_dtype=torch.float16,    
    quantization_config=quantization_config,
).to(DEVICE)

then..

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "what is invoice date?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "11.02.2019"},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "what is the total?"},
        ]
    },
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}

# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
Downloads last month
10
Safetensors
Model size
8.4B params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the HF Inference API does not support transformers models with pipeline type image-text-to-text

Dataset used to train SalmanFaroz/idefics2-8b-DocVQA-SP