File size: 5,611 Bytes
c3d78e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 |
---
license: llama3
language:
- en
pipeline_tag: image-text-to-text
tags:
- text-generation-inference
extra_gated_fields:
First Name: text
Last Name: text
Country: country
Affiliation: text
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: Other
I agree to use this model in accordance to META LLAMA 3 COMMUNITY LICENSE AGREEMENT and to not use this model for commercial purposes: checkbox
---
# Dragonfly-Med Model Card
**Note: Users are permitted to use this model in accordance with the Llama 3 Community License Agreement. Additionally, due to the licensing restrictions of the dataset used to train this model, which prohibits commercial use, the Dragonfly-Med model is restricted to non-commercial use only.**
## Model Details
Dragonfly-Med is a multimodal biomedical visual-language model, trained by instruction tuning on Llama 3.
- **Developed by:** [Together AI](https://www.together.ai/)
- **Model type:** An autoregressive visual-language model based on the transformer architecture
- **License:** [Llama 3 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3-8B/blob/main/LICENSE)
- **Finetuned from model:** [Llama 3](https://github.com/meta-llama/llama3)
### Model Sources
- **Repository:** https://github.com/togethercomputer/Dragonfly
- **Blog:** https://www.together.ai/blog/dragonfly-v1
- **Paper:** https://arxiv.org/abs/2406.00977
## Uses
The primary use of Dragonfly-Med is research on large visual-language models.
It is primarily intended for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence.
## How to Get Started with the Model
### ๐ฟ Installation
Create a conda environment and install necessary packages
```bash
conda env create -f environment.yml
conda activate dragonfly_env
```
Install flash attention
```bash
pip install flash-attn --no-build-isolation
```
As a final step, please run the following command.
```bash
pip install --upgrade -e .
```
### ๐ง Inference
If you have successfully completed the installation process, then you should be able to follow the steps below.
Question: Provide a brief description of the given image.
![roco](ROCO_04197.jpg)
Load necessary packages
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer
from dragonfly.models.modeling_dragonfly import DragonflyForCausalLM
from dragonfly.models.processing_dragonfly import DragonflyProcessor
from pipeline.train.train_utils import random_seed
```
Instantiate the tokenizer, processor, and model.
```python
device = torch.device("cuda:0")
tokenizer = AutoTokenizer.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-Med-v1")
clip_processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
image_processor = clip_processor.image_processor
processor = DragonflyProcessor(image_processor=image_processor, tokenizer=tokenizer, image_encoding_style="llava-hd")
model = DragonflyForCausalLM.from_pretrained("togethercomputer/Llama-3-8B-Dragonfly-Med-v1")
model = model.to(torch.bfloat16)
model = model.to(device)
```
Now, lets load the image and process them.
```python
image = Image.open("ROCO_04197.jpg")
image = image.convert("RGB")
images = [image]
# images = [None] # if you do not want to pass any images
text_prompt = "<|start_header_id|>user<|end_header_id|>\n\nSummarize the visual content of the image.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = processor(text=[text_prompt], images=images, max_length=2048, return_tensors="pt", is_generate=True)
inputs = inputs.to(device)
```
Finally, let us generate the responses from the model
```python
temperature = 0
with torch.inference_mode():
generation_output = model.generate(**inputs, max_new_tokens=1024, eos_token_id=tokenizer.encode("<|eot_id|>"), do_sample=temperature > 0, temperature=temperature, use_cache=True)
generation_text = processor.batch_decode(generation_output, skip_special_tokens=False)
```
An example response.
```plaintext
Computed tomography scan showing a large heterogenous mass in the pelvis<|eot_id|>
```
## Training Details
See more details in the "Implementation" section of our [paper](https://arxiv.org/abs/2406.00977).
## Evaluation
See more details in the "Results" section of our [paper](https://arxiv.org/abs/2406.00977).
## ๐ Credits
We would like to acknowledge the following resources that were instrumental in the development of Dragonfly:
- [Meta Llama 3](https://huggingface.co/meta-llama/Meta-Llama-3-8B): We utilized the Llama 3 model as our foundational language model.
- [CLIP](https://huggingface.co/openai/clip-vit-base-patch32): Our vision backbone is CLIP model from OpenAI.
- Our codebase is built upon the following two codebases:
- [Otter: A Multi-Modal Model with In-Context Instruction Tuning](https://github.com/Luodian/Otter)
- [LLaVA-UHD: an LMM Perceiving Any Aspect Ratio and High-Resolution Images](https://github.com/thunlp/LLaVA-UHD)
## ๐ BibTeX
```bibtex
@misc{chen2024dragonfly,
title={Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model},
author={Kezhen Chen and Rahul Thapa and Rahul Chalamala and Ben Athiwaratkun and Shuaiwen Leon Song and James Zou},
year={2024},
eprint={2406.00977},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
```
## Model Card Authors
Rahul Thapa, Kezhen Chen, Rahul Chalamala
## Model Card Contact
Rahul Thapa ([email protected]), Kezhen Chen ([email protected]) |