Potential ways to accelerate for image to text tasks?
Hi guys,
I have been trying to utilize this model for image-to-text tasks, and by far it seems that this model provides better qualities compared to other alternative models, such as Blip2 etc, regarding included details, accuracies and the ability of summarizing context. However the issue is it takes too much time, for me with a single image (resized to <= 300 for w/h) it will takes ~20-30 sec on average. I wonder if there's any possible ways to improve the inference process though? currently I have tried with 1 A10G with 24G mem or 4 A10G cards but seems the processing time doesn't change too much.
Any advice?
~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that
Also nice if you can share how exactly you are running generation
Hi, I would recommend using vLLM insead of transformers. On some test examples inference time was reduced from ~30s (transformers) to 2.8s (vLMM) on A100. It is also possible to run finetunned model, but vLLM doesn't support LoRA for Llava yet, so you must merge the adapter to model weight first. I have an access to machine with 4xV100 and vLMM does amazing job, inference was even a bit faster than on single A100. If you need some help, I'd be glad to provide examples. Here is test script I used to test vLLM:
from PIL import Image
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download
def run_llava_next():
# llm = LLM(model="/workspace/model/", max_model_len=4096) <- local
llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
#llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", tensor_parallel_size=4, dtype="half", max_model_len=4096) <- 4xV100 support (half precision)
prompt = "[INST] <image>\nExtract JSON [\INST]"
# IMAGE PATH
image = Image.open("test.jpg")
sampling_params = SamplingParams(temperature=0,
top_p=0.95,
max_tokens=256)
import time
print("START")
start = time.time()
outputs = llm.generate(
{
"prompt": prompt,
"multi_modal_data": {
"image": image
}
},
sampling_params=sampling_params)
for req in outputs:
out = [output.text for output in req.outputs]
print(out)
print(time.time() - start)
if __name__ == "__main__":
run_llava_next()
~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that
Also nice if you can share how exactly you are running generation
Hi @RaushanTurganbay ! thanks for the response; below is the detailed info that I'm using for inference:
transformers==4.41.2
To load existing model:
model = LlavaNextForConditionalGeneration.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
torch_dtype=torch.float16,
device_map="auto",
)
processor = LlavaNextProcessor.from_pretrained(
"llava-hf/llava-v1.6-mistral-7b-hf",
device_map="auto"
)
To generate image captioning:
...
torch.cuda.empty_cache()
prompt_text = "Describe this visual content's purpose, message, and visual presentation....." # with more details to ask for describing other object context, layout etc if available with length of 815 in total.
prompt = f"[INST] <image>\n{prompt_text}[/INST]"
# load image with PIL.Image
inputs = processor(prompt, img, return_tensors="pt").to(
device, torch.float16
)
generate_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
generated_text = processor.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=True,
)
Hi, I would recommend using vLLM insead of transformers. On some test examples inference time was reduced from ~30s (transformers) to 2.8s (vLMM) on A100. It is also possible to run finetunned model, but vLLM doesn't support LoRA for Llava yet, so you must merge the adapter to model weight first. I have an access to machine with 4xV100 and vLMM does amazing job, inference was even a bit faster than on single A100. If you need some help, I'd be glad to provide examples. Here is test script I used to test vLLM:
from PIL import Image from vllm import LLM, SamplingParams from huggingface_hub import snapshot_download def run_llava_next(): # llm = LLM(model="/workspace/model/", max_model_len=4096) <- local llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096) #llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", tensor_parallel_size=4, dtype="half", max_model_len=4096) <- 4xV100 support (half precision) prompt = "[INST] <image>\nExtract JSON [\INST]" # IMAGE PATH image = Image.open("test.jpg") sampling_params = SamplingParams(temperature=0, top_p=0.95, max_tokens=256) import time print("START") start = time.time() outputs = llm.generate( { "prompt": prompt, "multi_modal_data": { "image": image } }, sampling_params=sampling_params) for req in outputs: out = [output.text for output in req.outputs] print(out) print(time.time() - start) if __name__ == "__main__": run_llava_next()
Much appreciated it @bialykostek ! I wonder if you can please provide a bit more information about how to merge the adapter to model weight? I'm downloading the model with specific version since recent updated model doesn't work due to the introduce of config files
Btw vLLM (similar to TGI) also has OpenAI compatibility. I'd recommend taking a look here: https://docs.vllm.ai/en/stable/models/vlm.html#openai-vision-api
Both TGI and vLLM support loading LoRa adapters on top of the base model. See this one for a nice guide: https://huggingface.co/docs/text-generation-inference/en/conceptual/lora
@nielsr
Unfortunatelly, vLLM doesn't support LoRa adapters for all avaiable models, check out this list: https://docs.vllm.ai/en/v0.6.4/models/supported_models.html
It would be great to load adapter directly on top of the model.
@triscuiter here is a script I'm using for merging model with adapter. It might take a few minutes.
from transformers import AutoModelForCausalLM, AutoTokenizer, LlavaNextForConditionalGeneration
import torch
import peft # Assuming you're using Hugging Face's `peft` library for LoRA
base_model_name = "llava-hf/llava-v1.6-mistral-7b-hf"
# Load the original Mistral model
model = LlavaNextForConditionalGeneration.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# print(model)
lora_path = "hf_username/adapter_name"
lora_model = peft.PeftModel.from_pretrained(model, lora_path)
merged_model = lora_model.merge_and_unload()
# print(merged_model)
merged_model.save_pretrained('model/')
tokenizer.save_pretrained('model/')
After merging you have to download preprocessor from the orginal model (at least in my case):
cd model
wget https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/raw/main/preprocessor_config.json
cd ..
After merging it is possible to load model via vLLM using script from my previous answer.
@nielsr Unfortunatelly, vLLM doesn't support LoRa adapters for all avaiable models, check out this list: https://docs.vllm.ai/en/v0.6.4/models/supported_models.html
It would be great to load adapter directly on top of the model.@triscuiter here is a script I'm using for merging model with adapter. It might take a few minutes.
from transformers import AutoModelForCausalLM, AutoTokenizer, LlavaNextForConditionalGeneration import torch import peft # Assuming you're using Hugging Face's `peft` library for LoRA base_model_name = "llava-hf/llava-v1.6-mistral-7b-hf" # Load the original Mistral model model = LlavaNextForConditionalGeneration.from_pretrained(base_model_name) tokenizer = AutoTokenizer.from_pretrained(base_model_name) # print(model) lora_path = "hf_username/adapter_name" lora_model = peft.PeftModel.from_pretrained(model, lora_path) merged_model = lora_model.merge_and_unload() # print(merged_model) merged_model.save_pretrained('model/') tokenizer.save_pretrained('model/')
After merging you have to download preprocessor from the orginal model (at least in my case):
cd model wget https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/raw/main/preprocessor_config.json cd ..
After merging it is possible to load model via vLLM using script from my previous answer.
Thanks much @bialykostek ,really appreciate it.
@triscuiter
this might also help you, for fine tunning I've used this script based on this notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb
For some reason it is easier to push adapter to hugginface and then download it for merging - you don't have to worry about files format.
Remember to use A100/H100, with around 80GB VRAM.
~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that
Also nice if you can share how exactly you are running generation
Hi @RaushanTurganbay ! thanks for the response; below is the detailed info that I'm using for inference:
transformers==4.41.2
To load existing model:
model = LlavaNextForConditionalGeneration.from_pretrained( "llava-hf/llava-v1.6-mistral-7b-hf", torch_dtype=torch.float16, device_map="auto", ) processor = LlavaNextProcessor.from_pretrained( "llava-hf/llava-v1.6-mistral-7b-hf", device_map="auto" )
To generate image captioning:
... torch.cuda.empty_cache() prompt_text = "Describe this visual content's purpose, message, and visual presentation....." # with more details to ask for describing other object context, layout etc if available with length of 815 in total. prompt = f"[INST] <image>\n{prompt_text}[/INST]" # load image with PIL.Image inputs = processor(prompt, img, return_tensors="pt").to( device, torch.float16 ) generate_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False) generated_text = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True, )
Hi @RaushanTurganbay I wonder if I was right with above code? am still trying to figure out the performance issue with original solution (meanwhile working on other alternatives as mentioned above thanks to other guys provided solutions)