File size: 3,848 Bytes
4b62e65
 
 
 
 
 
 
 
 
 
 
 
 
 
b57f2c6
4b62e65
5c90b7f
 
7397d8a
4b62e65
402e4eb
4b62e65
873e6c4
 
7486058
 
e8530d3
7486058
 
873e6c4
 
 
 
 
 
 
3269bdc
a930042
873e6c4
 
 
3269bdc
 
873e6c4
 
 
3269bdc
873e6c4
 
3269bdc
 
 
 
fdc1361
a930042
 
873e6c4
 
 
 
 
 
 
 
 
 
 
 
 
 
e0e929d
 
 
873e6c4
e0e929d
 
4b62e65
402e4eb
b57f2c6
4b62e65
402e4eb
57dbb57
 
 
 
 
 
 
 
1455c71
 
 
 
 
e678101
1455c71
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
---
license: cc
datasets:
- liuhaotian/LLaVA-Instruct-150K
- liuhaotian/LLaVA-Pretrain
language:
- en
---

# Model Card for LLaVA-LLaMA-3-8B

<!-- Provide a quick summary of what the model is/does. -->

A reproduced LLaVA LVLM based on Llama-3-8B LLM backbone. Not an official implementation.
Please follow my reproduced implementation [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) for more details on fine-tuning LLaVA model with Llama-3 as the foundatiaon LLM.

## Updates 
- [5/14/2024] The codebase has been upgraded to llava-next (llava-v1.6). Now it supports the latest llama-3, phi-3, mistral-v0.1-7b models.

## Model Details
Follows LLavA-1.5 pre-train and supervised fine-tuning pipeline. You do not need to change the LLaVA codebase to accommodate Llama-3. 

## How to Use

Please firstly install llava via
```
pip install git+https://github.com/Victorwz/LLaVA-Unified.git
```

You can load the model and perform inference as follows:
```python
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, process_images, get_model_name_from_path
from PIL import Image
import requests
import torch
from io import BytesIO

# load model and processor
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = get_model_name_from_path("weizhiwang/LLaVA-Llama-3-8B")
tokenizer, model, image_processor, context_len = load_pretrained_model("weizhiwang/LLaVA-Llama-3-8B", None, model_name, False, False, device=device)

# prepare inputs for the model
text = '<image>' + '\n' + "Describe the image."
conv = conv_templates["llama_3"].copy()
conv.append_message(conv.roles[0], text)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
input_ids = tokenizer_image_token(prompt, tokenizer, -200, return_tensors='pt').unsqueeze(0).cuda()

# prepare image input
url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert('RGB')
image_tensor = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].half().cuda()

# autoregressively generate text
with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=False,
        max_new_tokens=512,
        use_cache=True)

outputs = tokenizer.batch_decode(output_ids[:, input_ids.shape[1]:], skip_special_tokens=True)
print(outputs[0])
```
The image caption results look like:
```
The image features a blue and orange double-decker bus parked on a street. The bus is stopped at a bus stop, waiting for passengers to board. There are several people standing around the bus, some of them closer to the bus and others further away. 

In the background, there are two cars parked on the street, one on the left side and the other on the right side. Additionally, there is a traffic light visible in the scene, indicating that the bus is stopped at an intersection.
```

# Fine-Tune LLaVA-Llama-3 on Your Visual Instruction Data
Please refer to our [LLaVA-Unified](https://github.com/Victorwz/LLaVA-Unified) git repo for fine-tuning data preparation and scripts. The data loading function and fastchat conversation template are changed due to a different tokenizer.

## Benchmark Results


| Model                 |      MMMU Val     |
| :-------------------- | :---------------: |
| LLaVA-v1.5-7B         |       35.3        |
| LLaVA-Llama-3-8B      |       36.7        | 

Please refer to `eval_outputs/LLaVA-Llama-3-8B_mmmu_val.json` for reproduce the benchmark performance on MMMU validation set.

## Citation

```bibtex
@misc{wang2024llavallama3,
  title={LLaVA-Llama-3-8B: A reproduction towards LLaVA-v1.5 based on Llama-3-8B LLM backbone},
  author={Wang, Weizhi},
  year={2024}
}
```