## Model Summary BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks. microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training - 1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable. 2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable. ## General Document Benchmarks Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance. | **Model** | **DocVQA**
*VAL* | **InfoVQA**
*VAL* | **DeepForm**
*TEST* | **KLC**
*TEST* | **WTQ**
*TEST* | **TabFact**
*TEST* | **ChartQA**
*TEST* | **TextVQA**
*VAL* | **MMMU**
*VAL* | **DudeMini**
*TEST* | **SlideVQA-M**
*TEST* | **TableVQA**
*TEST* | **Avg. Score** | |-----------------------------------|---------------------|-----------------------|-------------------------|-------------------|-------------------|-----------------------|-----------------------|----------------------|------------------|------------------------|--------------------------|-------------------------|----------------| | DocOwl1.5-8B (instruct) | 80.73 | 49.94 | 68.84 | 37.99 | 38.87 | 79.67 | 68.56 | 68.91 | 33.67 | 34.64 | 31.62 | 52.60 | 53.84 | | DocOwl1.5-8B (base) | 2.07 | 1.84 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 24.44 | 19.07 | 3.30 | 13.63 | 5.36 | | DocOwl1.5-8B (base) + DocStruct4M | 75.99 | 46.88 | 62.77 | 35.21 | 32.86 | 71.56 | 68.36 | 65.08 | 33.67 | 29.00 | 27.03 | 46.27 | 49.56 | | DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70 | 47.62 | 64.39 | 36.93 | 35.69 | 72.65 | 65.80 | 67.30 | 32.33 | 32.55 | 29.60 | 49.03 | 51.05 | | Qwen2-VL-2B (instruct) | 89.16 | 64.11 | 32.38 | 25.18 | 38.20 | 57.21 | 73.40 | 79.90 | 42.00 | 45.23 | 46.50 | 43.07 | 53.03 | | Qwen2-VL-2B (base) | 7.26 | 0.78 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.14 | 34.89 | 28.43 | 14.55 | 0.00 | 7.25 | | Qwen2-VL-2B (base) + DocStruct4M | 59.53 | 32.00 | 53.98 | 36.38 | 28.48 | 64.24 | 54.44 | 55.89 | 34.89 | 28.78 | 22.68 | 46.53 | 43.15 | | *Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23 | 31.88 | 49.31 | 34.39 | 31.61 | 64.75 | 68.60 | 61.01 | 35.67 | 27.19 | 17.46 | 47.53 | 43.89 | | Phi3.5-Vision-4B (instruct) | 86.00 | 56.20 | 10.47 | 7.49 | 17.18 | 30.43 | 82.16 | 73.12 | 46.00 | 37.20 | 30.93 | 70.70 | 45.66 | | Phi3.5-Vision-4B + DocStruct4M | 86.76 | 68.90 | 70.12 | 37.83 | 51.30 | 82.12 | 79.76 | 68.60 | 44.11 | 35.52 | 31.90 | 69.17 | 60.51 | | **Phi3.5-Vision-4B + BigDocs (Ours)** | **87.05** | **70.05** | **70.97** | **37.45** | **51.21** | **81.24** | **81.56** | **68.72** | **45.00** | **36.15** | **32.47** | **67.77** | **60.80** | | LLaVA-NeXT-7B (instruct) | 63.51 | 30.90 | 1.30 | 5.35 | 20.06 | 52.83 | 52.12 | 65.10 | 38.89 | 17.94 | 7.46 | 32.87 | 32.36 | | LLaVA-NeXT-7B + DocStruct4M | 60.95 | 26.14 | 39.78 | 28.34 | 25.90 | 67.72 | 61.20 | 52.25 | 25.78 | 21.70 | 15.33 | 27.03 | 37.68 | | LLaVA-NeXT-7B + BigDocs (Ours) | 57.13 | 24.47 | 46.38 | 31.09 | 27.06 | 72.58 | 54.72 | 49.06 | 17.78 | 22.88 | 16.07 | 33.13 | 37.70 | | Llama-3.2-90B | 74.15* | 48.71 | 4.18 | 1.81 | 24.20 | 63.01 | 11.36* | 71.69 | 57.78 | 41.24 | 26.09 | 41.57 | 38.82 | | GPT-4o 20240806 | 92.80 | 66.37 | 38.39 | 29.92 | 46.63 | 81.10 | 85.70 | 70.46 | 69.10 | 54.55 | 67.58 | 72.87 | 64.62 | | Claude-3.5 Sonnet | 88.48 | 59.05 | 31.41 | 24.82 | 47.13 | 53.48 | 51.84 | 71.42 | 64.78 | 35.11 | 0.00 | 81.27 | 50.73 | | GeminiPro-1.5 | 91.23 | 73.94 | 32.16 | 24.07 | 50.29 | 71.22 | 34.68 | 68.16 | 58.22 | 48.15 | 52.05 | 80.43 | 57.05 | | Qwen2-VL-72B | 96.50 | 84.50 | 30.45 | 24.78 | 55.63 | 0.00 | 88.30 | 85.50 | 64.50 | 35.87 | 2.15 | 74.23 | 58.40 | ### Input Formats BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct: Single image: ``` <|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n ``` Multi-turn conversations: ``` <|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n ``` For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows: ``` <|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n ``` ### Loading the model locally After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference. ```python from PIL import Image import requests from transformers import AutoModelForCausalLM from transformers import AutoProcessor model_id = "BigDocs/BigDocs-Phi-3.5-instruct" # Note: set _attn_implementation='eager' if you don't have flash_attn installed model = AutoModelForCausalLM.from_pretrained( model_id, device_map="cuda", trust_remote_code=True, torch_dtype="auto", _attn_implementation='flash_attention_2' ) # for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame. processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True, num_crops=4 ) images = [] placeholder = "" # Note: if OOM, you might consider reduce number of frames in this example. for i in range(1,20): url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" images.append(Image.open(requests.get(url, stream=True).raw)) placeholder += f"<|image_{i}|>\n" messages = [ {"role": "user", "content": placeholder+"Summarize the deck of slides."}, ] prompt = processor.tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") generation_args = { "max_new_tokens": 1000, "temperature": 0.0, "do_sample": False, } generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args ) # remove input tokens generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:] response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0] print(response) ```