File size: 9,791 Bytes
356bdc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
## Model Summary

BigDocs-Phi-3.5-instruct is a multi-modal model that is trained with BigDocs for document intelligence tasks. 

microsoft/Phi-3.5-vision-instruct is used as the base and we perform 2 stages of training - 
1. Continual Pre-Training (CPT) with BigDocs-CPT keeping the encoder and adapter trainable.
2. Fine Tuning (FT) with DocDownstream-1.0 keeping the decoder and adapter trainable.


## General Document Benchmarks 

Models trained on [BigDocs-7.5M+DocDownstream] perform competitively across multimodal document benchmarks. We compare them to base checkpoints, instruction-tuned models, and those trained on [DocStruct4M+DocDownstream]. BigDocs models show consistent performance.

| **Model**                          | **DocVQA**<br>*VAL* | **InfoVQA**<br>*VAL* | **DeepForm**<br>*TEST* | **KLC**<br>*TEST* | **WTQ**<br>*TEST* | **TabFact**<br>*TEST* | **ChartQA**<br>*TEST* | **TextVQA**<br>*VAL* | **MMMU**<br>*VAL* | **DudeMini**<br>*TEST* | **SlideVQA-M**<br>*TEST* | **TableVQA**<br>*TEST* | **Avg. Score** |
|-----------------------------------|---------------------|-----------------------|-------------------------|-------------------|-------------------|-----------------------|-----------------------|----------------------|------------------|------------------------|--------------------------|-------------------------|----------------|
| DocOwl1.5-8B (instruct)          | 80.73               | 49.94                | 68.84                  | 37.99             | 38.87             | 79.67                | 68.56                | 68.91               | 33.67           | 34.64                 | 31.62                   | 52.60                  | 53.84      |
| DocOwl1.5-8B (base)              | 2.07                | 1.84                 | 0.00                   | 0.00              | 0.00              | 0.00                 | 0.00                 | 0.00                | 24.44           | 19.07                 | 3.30                    | 13.63                  | 5.36       |
| DocOwl1.5-8B (base) + DocStruct4M | 75.99               | 46.88                | 62.77                  | 35.21             | 32.86             | 71.56                | 68.36            | 65.08               | 33.67        | 29.00                 | 27.03                   | 46.27                  | 49.56      |
| DocOwl1.5-8B (base) + BigDocs (Ours) | 78.70          | 47.62            | 64.39              | 36.93         | 35.69         | 72.65            | 65.80                | 67.30           | 32.33           | 32.55             | 29.60               | 49.03              | 51.05      |
| Qwen2-VL-2B (instruct)           | 89.16               | 64.11                | 32.38                  | 25.18             | 38.20             | 57.21                | 73.40                | 79.90               | 42.00           | 45.23                 | 46.50                   | 43.07                  | 53.03      |
| Qwen2-VL-2B (base)               | 7.26                | 0.78                 | 0.00                   | 0.00              | 0.00              | 0.00                 | 0.00                 | 1.14                | 34.89           | 28.43                 | 14.55                   | 0.00                   | 7.25       |
| Qwen2-VL-2B (base) + DocStruct4M  | 59.53           | 32.00            | 53.98              | 36.38         | 28.48             | 64.24                | 54.44                | 55.89               | 34.89           | 28.78             | 22.68               | 46.53                  | 43.15      |
| *Qwen2-VL-2B (base) + BigDocs (Ours) | 57.23              | 31.88                | 49.31                  | 34.39             | 31.61         | 64.75            | 68.60            | 61.01           | 35.67        | 27.19                 | 17.46                   | 47.53              | 43.89      |
| Phi3.5-Vision-4B (instruct)      | 86.00               | 56.20                | 10.47                  | 7.49              | 17.18             | 30.43                | 82.16                | 73.12               | 46.00           | 37.20                 | 30.93                   | 70.70                  | 45.66      |
| Phi3.5-Vision-4B + DocStruct4M    | 86.76               | 68.90                | 70.12                  | 37.83         | 51.30         | 82.12            | 79.76                | 68.60               | 44.11           | 35.52                 | 31.90                   | 69.17              | 60.51      |
| **Phi3.5-Vision-4B + BigDocs (Ours)** | **87.05**           | **70.05**            | **70.97**              | **37.45**             | **51.21**             | **81.24**                | **81.56**            | **68.72**           | **45.00**        | **36.15**             | **32.47**               | **67.77**                  | **60.80**      |
| LLaVA-NeXT-7B (instruct)         | 63.51               | 30.90                | 1.30                   | 5.35              | 20.06             | 52.83                | 52.12                | 65.10               | 38.89           | 17.94                 | 7.46                    | 32.87                  | 32.36      |
| LLaVA-NeXT-7B + DocStruct4M       | 60.95           | 26.14            | 39.78                  | 28.34             | 25.90             | 67.72                | 61.20            | 52.25           | 25.78        | 21.70                 | 15.33                   | 27.03                  | 37.68      |
| LLaVA-NeXT-7B + BigDocs (Ours)  | 57.13               | 24.47                | 46.38              | 31.09         | 27.06         | 72.58            | 54.72                | 49.06               | 17.78           | 22.88             | 16.07               | 33.13              | 37.70      |
| Llama-3.2-90B                | 74.15*              | 48.71                | 4.18                   | 1.81              | 24.20             | 63.01                | 11.36*               | 71.69               | 57.78           | 41.24                 | 26.09                   | 41.57                  | 38.82      |
| GPT-4o 20240806              | 92.80               | 66.37                | 38.39                  | 29.92             | 46.63             | 81.10                | 85.70                | 70.46               | 69.10           | 54.55                 | 67.58                   | 72.87                  | 64.62      |
| Claude-3.5 Sonnet            | 88.48               | 59.05                | 31.41                  | 24.82             | 47.13             | 53.48                | 51.84                | 71.42               | 64.78           | 35.11                 | 0.00                    | 81.27                  | 50.73      |
| GeminiPro-1.5                | 91.23               | 73.94                | 32.16                  | 24.07             | 50.29             | 71.22                | 34.68                | 68.16               | 58.22           | 48.15                 | 52.05                   | 80.43                  | 57.05      |
| Qwen2-VL-72B                 | 96.50               | 84.50                | 30.45                  | 24.78             | 55.63             | 0.00                 | 88.30                | 85.50               | 64.50           | 35.87                 | 2.15                    | 74.23                  | 58.40      |


### Input Formats

BigDocs-Phi-3.5-instruct follows the same chat format as Phi-3.5-vision-instruct:

Single image:
```
<|user|>\n<|image_1|>\n{prompt}<|end|>\n<|assistant|>\n
```

Multi-turn conversations:
```
<|user|>\n<|image_1|>\n{prompt_1}<|end|>\n<|assistant|>\n{response_1}<|end|>\n<|user|>\n{prompt_2}<|end|>\n<|assistant|>\n
```

For multi-image usage, add multiple image placeholders in the front of the prompts. <|image_{}|> index should start from 1. One example of prompt is shown as follows:
```
<|user|>\n<|image_1|>\n<|image_2|>\n<|image_3|>\n<|image_4|>\n{prompt}<|end|>\n<|assistant|>\n 
```
### Loading the model locally
After obtaining the Phi-3.5-vision-instruct model checkpoints, users can use this sample code for inference.
```python
from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor 
model_id = "BigDocs/BigDocs-Phi-3.5-instruct"

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="cuda", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation='flash_attention_2'    
)
# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=4
) 

images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1,20):
    url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" 
    images.append(Image.open(requests.get(url, stream=True).raw))
    placeholder += f"<|image_{i}|>\n"
messages = [
    {"role": "user", "content": placeholder+"Summarize the deck of slides."},
]
prompt = processor.tokenizer.apply_chat_template(
  messages, 
  tokenize=False, 
  add_generation_prompt=True
)

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") 
generation_args = { 
    "max_new_tokens": 1000, 
    "temperature": 0.0, 
    "do_sample": False, 
} 
generate_ids = model.generate(**inputs, 
  eos_token_id=processor.tokenizer.eos_token_id, 
  **generation_args
)

# remove input tokens 
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, 
  skip_special_tokens=True, 
  clean_up_tokenization_spaces=False)[0] 

print(response)
```