File size: 5,695 Bytes
885c7fd 1356e92 885c7fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 |
---
license: apache-2.0
---
<div align="center">
<img src="assets/logo.png" alt="Lit-LLaMA" width="200"/>
<p style="font-size: 30px;"><b>HawkLlama</b></p>
[๐ค**Huggingface**](https://huggingface.co/AIM-ZJU/HawkLlama_8b) | [๐๏ธ**Github**](https://github.com/aim-uofa/VLModel) | [๐**Technical Report**](assets/technical_report.pdf)
Zhejiang University, China
</div>
This is the official implementation of HawkLlama, an open-source multimodal large language model designed for real-world vision and language understanding applications. Our model features the following highlights.
1. HawkLlama-8B is constructed utilizing:
- [Llama3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B), the latest open-source large language model, trained on over 15 trillion tokens.
- [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384), an enhancement over CLIP employing sigmoid loss, which achieves superior performance in image recognition.
- An efficient vision-language connector, designed to capture high-resolution details without increasing the number of visual tokens, helps reduce the training overhead associated with high-resolution images.
2. For model training, we utilize [Llava-Pretrain](https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain) dataset for pretraining and a mixed dataset specifically curated for instruction tuning, which contains both multimodal and language-only data for supervised fine-tuning.
3. HawkLlama-8B is developed on [NeMo](https://github.com/NVIDIA/NeMo.git) framework, which facilitates 3D parallelism and offers scalability potential for future extension.
Our model is open-source and reproducable. Please check our [technical report](assets/technical_report.pdf) for more details.
<!-- ## News
[04/30] Llama3-LaMMly-8B is released, trained on a larger dataset, supporting higher resolution images, and also supporting Llama3 as the backbone. For LaMMly, we constructed a multimodal dataset containing 2.6M SFT sample, ensuring that LaMMly can achieve better generalization and improved image understanding. For more details, please refer to our [blog] and [technical report]. -->
## Contents
- [Setup](#setup)
- [Model Weights](#model-weights)
- [Inference](#inference)
- [Evaluation](#evaluation)
- [Demo](#demo)
## Setup
1. Create envoirment and activate it.
```Shell
conda create -n hawkllama python=3.10 -y
conda activate hawkllama
```
2. Clone and install this repo.
```
git clone https://github.com/aim-uofa/VLModel.git
cd VLModel
pip install -e .
pip install -e third_party/VLMEvalKit
```
## Model Weights
Please refer to our [HuggingFace repository](https://huggingface.co/AIM-ZJU/HawkLlama_8b) to download the pretrained model weights.
## Inference
We provide an example code for inference.
```Python
import torch
from PIL import Image
from HawkLlama.model import LlavaNextProcessor, LlavaNextForConditionalGeneration
from HawkLlama.utils.conversation import conv_llava_llama_3, DEFAULT_IMAGE_TOKEN
processor = LlavaNextProcessor.from_pretrained("AIM-ZJU/HawkLlama_8b")
model = LlavaNextForConditionalGeneration.from_pretrained("AIM-ZJU/HawkLlama_8b", torch_dtype=torch.bfloat16, low_cpu_mem_usage=True)
model.to("cuda:0")
image_file = "assets/coin.png"
image = Image.open(image_file).convert('RGB')
prompt = "what coin is that?"
prompt = DEFAULT_IMAGE_TOKEN + "\n" + prompt
conversation = conv_llava_llama_3.copy()
user_role_ind = 0
bot_role_ind = 1
conversation.append_message(conversation.roles[user_role_ind], prompt)
conversation.append_message(conversation.roles[bot_role_ind], "")
prompt = conversation.get_prompt()
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
inputs['pixel_values'] = inputs['pixel_values'].to(torch.bfloat16)
output = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, max_new_tokens=2048, do_sample=False, use_cache=True)
print(processor.decode(output[0], skip_special_tokens=True))
```
## Evaluation
Evaluate is modified based on the VLMEval codebase.
``` bash
# single gpu
python third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
# multi-gpus
torchrun --nproc-per-node=8 third_party/VLMEvalKit/run.py --data MMBench_DEV_EN MMMU_DEV_VAL SEEDBench_IMG --model hawkllama_llama3_vlm --verbose
```
The results are shown below:
| Benchmark | Our MethodName | LLaVA-Llama3-v1.1 | LLaVA-Next |
|-----------------|----------------|-------------------|------------|
| MMMU val | **37.8** | 36.8 | 36.9 |
| SEEDBench img | **71.0** | 70.1 | 70.0 |
| MMBench-EN dev | **70.6** | 70.4 | 68.0 |
| MMBench-CN dev | **64.4** | 64.2 | 60.6 |
| CCBench | **33.9** | 31.6 | 24.7 |
| AI2D test | 65.6 | **70.0** | 67.1 |
| ScienceQA test | **76.1** | 72.9 | 70.4 |
| HallusionBench | 41.0 | **47.7** | 35.2 |
| MMStar | 43.0 | **45.1** | 38.1 |
## Demo
Welcome to try our [demo](http://115.236.57.99:30020/)!
## Acknowledgements
We express our appreciation to the following projects for their outstanding contributions in academia and code development: [LLaVA](https://github.com/haotian-liu/LLaVA), [NeMo](https://github.com/NVIDIA/NeMo), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and [xtuner](https://github.com/InternLM/xtuner).
## License
HawkLlama is released under the [Apache 2.0](https://github.com/Lightning-AI/lightning-llama/blob/main/LICENSE) license. |