llava-hf/llava-v1.6-mistral-7b-hf · Potential ways to accelerate for image to text tasks?

Dec 6, 2024

Hi guys,

I have been trying to utilize this model for image-to-text tasks, and by far it seems that this model provides better qualities compared to other alternative models, such as Blip2 etc, regarding included details, accuracies and the ability of summarizing context. However the issue is it takes too much time, for me with a single image (resized to <= 300 for w/h) it will takes ~20-30 sec on average. I wonder if there's any possible ways to improve the inference process though? currently I have tried with 1 A10G with 24G mem or 4 A10G cards but seems the processing time doesn't change too much.

Any advice?

RaushanTurganbay

Llava Hugging Face org Dec 6, 2024

~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that

Also nice if you can share how exactly you are running generation

bialykostek

Dec 9, 2024

Hi, I would recommend using vLLM insead of transformers. On some test examples inference time was reduced from ~30s (transformers) to 2.8s (vLMM) on A100. It is also possible to run finetunned model, but vLLM doesn't support LoRA for Llava yet, so you must merge the adapter to model weight first. I have an access to machine with 4xV100 and vLMM does amazing job, inference was even a bit faster than on single A100. If you need some help, I'd be glad to provide examples. Here is test script I used to test vLLM:

from PIL import Image
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

def run_llava_next():
    # llm = LLM(model="/workspace/model/", max_model_len=4096) <- local
    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
    #llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", tensor_parallel_size=4, dtype="half", max_model_len=4096) <- 4xV100 support (half precision)

    prompt = "[INST] <image>\nExtract JSON [\INST]"
    
    # IMAGE PATH 
    image = Image.open("test.jpg")
    
    sampling_params = SamplingParams(temperature=0,
                                     top_p=0.95,
                                     max_tokens=256)
    import time

    print("START")
    start = time.time()
    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            }
        },
        sampling_params=sampling_params)
    for req in outputs:
        out = [output.text for output in req.outputs]
        print(out)
    print(time.time() - start)

if __name__ == "__main__":
    run_llava_next()

triscuiter

Dec 9, 2024

~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that

Also nice if you can share how exactly you are running generation

Hi @RaushanTurganbay ! thanks for the response; below is the detailed info that I'm using for inference:

transformers==4.41.2

To load existing model:

                model = LlavaNextForConditionalGeneration.from_pretrained(
                    "llava-hf/llava-v1.6-mistral-7b-hf",
                    torch_dtype=torch.float16,
                    device_map="auto",
                )
                processor = LlavaNextProcessor.from_pretrained(
                    "llava-hf/llava-v1.6-mistral-7b-hf", 
                    device_map="auto"
                )

To generate image captioning:

...
torch.cuda.empty_cache()
prompt_text = "Describe this visual content's purpose, message, and visual presentation....." # with more details to ask for describing other object context, layout etc if available with length of 815 in total.
prompt = f"[INST] <image>\n{prompt_text}[/INST]"
# load image with PIL.Image 
inputs = processor(prompt, img, return_tensors="pt").to(
                device, torch.float16
            )
            generate_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
            generated_text = processor.batch_decode(
                generate_ids,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )

triscuiter

Dec 9, 2024

Hi, I would recommend using vLLM insead of transformers. On some test examples inference time was reduced from ~30s (transformers) to 2.8s (vLMM) on A100. It is also possible to run finetunned model, but vLLM doesn't support LoRA for Llava yet, so you must merge the adapter to model weight first. I have an access to machine with 4xV100 and vLMM does amazing job, inference was even a bit faster than on single A100. If you need some help, I'd be glad to provide examples. Here is test script I used to test vLLM:
from PIL import Image
from vllm import LLM, SamplingParams
from huggingface_hub import snapshot_download

def run_llava_next():
    # llm = LLM(model="/workspace/model/", max_model_len=4096) <- local
    llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", max_model_len=4096)
    #llm = LLM(model="llava-hf/llava-v1.6-mistral-7b-hf", tensor_parallel_size=4, dtype="half", max_model_len=4096) <- 4xV100 support (half precision)

    prompt = "[INST] <image>\nExtract JSON [\INST]"
    
    # IMAGE PATH 
    image = Image.open("test.jpg")
    
    sampling_params = SamplingParams(temperature=0,
                                     top_p=0.95,
                                     max_tokens=256)
    import time

    print("START")
    start = time.time()
    outputs = llm.generate(
        {
            "prompt": prompt,
            "multi_modal_data": {
                "image": image
            }
        },
        sampling_params=sampling_params)
    for req in outputs:
        out = [output.text for output in req.outputs]
        print(out)
    print(time.time() - start)

if __name__ == "__main__":
    run_llava_next()

Much appreciated it @bialykostek ! I wonder if you can please provide a bit more information about how to merge the adapter to model weight? I'm downloading the model with specific version since recent updated model doesn't work due to the introduce of config files

nielsr

Llava Hugging Face org Dec 9, 2024

Btw vLLM (similar to TGI) also has OpenAI compatibility. I'd recommend taking a look here: https://docs.vllm.ai/en/stable/models/vlm.html#openai-vision-api

Both TGI and vLLM support loading LoRa adapters on top of the base model. See this one for a nice guide: https://huggingface.co/docs/text-generation-inference/en/conceptual/lora

bialykostek

Dec 10, 2024

@nielsr Unfortunatelly, vLLM doesn't support LoRa adapters for all avaiable models, check out this list: https://docs.vllm.ai/en/v0.6.4/models/supported_models.html
It would be great to load adapter directly on top of the model.

@triscuiter here is a script I'm using for merging model with adapter. It might take a few minutes.

from transformers import AutoModelForCausalLM, AutoTokenizer, LlavaNextForConditionalGeneration
import torch
import peft  # Assuming you're using Hugging Face's `peft` library for LoRA

base_model_name = "llava-hf/llava-v1.6-mistral-7b-hf"

# Load the original Mistral model
model = LlavaNextForConditionalGeneration.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# print(model)

lora_path = "hf_username/adapter_name"
lora_model = peft.PeftModel.from_pretrained(model, lora_path)

merged_model = lora_model.merge_and_unload()
# print(merged_model)

merged_model.save_pretrained('model/')
tokenizer.save_pretrained('model/')

After merging you have to download preprocessor from the orginal model (at least in my case):

cd model
wget https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/raw/main/preprocessor_config.json
cd ..

After merging it is possible to load model via vLLM using script from my previous answer.

triscuiter

Dec 10, 2024

@nielsr Unfortunatelly, vLLM doesn't support LoRa adapters for all avaiable models, check out this list: https://docs.vllm.ai/en/v0.6.4/models/supported_models.html
It would be great to load adapter directly on top of the model.

@triscuiter here is a script I'm using for merging model with adapter. It might take a few minutes.
from transformers import AutoModelForCausalLM, AutoTokenizer, LlavaNextForConditionalGeneration
import torch
import peft  # Assuming you're using Hugging Face's `peft` library for LoRA

base_model_name = "llava-hf/llava-v1.6-mistral-7b-hf"

# Load the original Mistral model
model = LlavaNextForConditionalGeneration.from_pretrained(base_model_name)
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
# print(model)

lora_path = "hf_username/adapter_name"
lora_model = peft.PeftModel.from_pretrained(model, lora_path)

merged_model = lora_model.merge_and_unload()
# print(merged_model)

merged_model.save_pretrained('model/')
tokenizer.save_pretrained('model/')
After merging you have to download preprocessor from the orginal model (at least in my case):
cd model
wget https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf/raw/main/preprocessor_config.json
cd ..
After merging it is possible to load model via vLLM using script from my previous answer.

Thanks much @bialykostek ,really appreciate it.

bialykostek

Dec 12, 2024

@triscuiter this might also help you, for fine tunning I've used this script based on this notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVa-NeXT/Fine_tune_LLaVaNeXT_on_a_custom_dataset_(with_PyTorch_Lightning).ipynb
For some reason it is easier to push adapter to hugginface and then download it for merging - you don't have to worry about files format.
Remember to use A100/H100, with around 80GB VRAM.

triscuiter

4 days ago

~20-30 sec per image? Can you share which transformers version you are using and if possible the image? It should not take that much if you are running inference on a GPU with half-precision. We had one buggy version when the latency grew twice and it was fixed later, so I am thinking it might be that

Also nice if you can share how exactly you are running generation

Hi @RaushanTurganbay ! thanks for the response; below is the detailed info that I'm using for inference:
transformers==4.41.2
To load existing model:
                model = LlavaNextForConditionalGeneration.from_pretrained(
                    "llava-hf/llava-v1.6-mistral-7b-hf",
                    torch_dtype=torch.float16,
                    device_map="auto",
                )
                processor = LlavaNextProcessor.from_pretrained(
                    "llava-hf/llava-v1.6-mistral-7b-hf", 
                    device_map="auto"
                )
To generate image captioning:
...
torch.cuda.empty_cache()
prompt_text = "Describe this visual content's purpose, message, and visual presentation....." # with more details to ask for describing other object context, layout etc if available with length of 815 in total.
prompt = f"[INST] <image>\n{prompt_text}[/INST]"
# load image with PIL.Image 
inputs = processor(prompt, img, return_tensors="pt").to(
                device, torch.float16
            )
            generate_ids = model.generate(**inputs, max_new_tokens=512, do_sample=False)
            generated_text = processor.batch_decode(
                generate_ids,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=True,
            )

Hi @RaushanTurganbay I wonder if I was right with above code? am still trying to figure out the performance issue with original solution (meanwhile working on other alternatives as mentioned above thanks to other guys provided solutions)