llava-hf/llava-v1.6-34b-hf · How do you fine tune LLaVA NeXT?

Nishgop

Mar 28, 2024

Is there a way to fine tune LLaVA-NeXT?

nielsr

Llava Hugging Face org Mar 28, 2024

cc @lewtun the TRL team is going to make it super easy to fine-tune models like these.

For now I'll refer you to my demo notebook, which includes a bunch of utilities from the original LLaVa repository.

Nishgop

Mar 28, 2024

Thanks Niels, This is great!
I assume the same approach works also for LLaVA-NeXT. Is that correct?

Nishant

lzh986

Apr 22, 2024

•

edited Apr 22, 2024

.

nielsr

Llava Hugging Face org May 10, 2024

Yes it should, although Llava-NeXT is a bit more complex compared to Llava in terms of image preprocessing. A PR to add batched generation (which should also solve training issues) is here: https://github.com/huggingface/transformers/pull/29850.

For now I'd recommend either Llava or Idefics2. Refer to my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/Idefics2/Fine_tune_Idefics2_for_JSON_extraction_use_cases_(PyTorch_Lightning).ipynb. Have tested this with both models.

lcolonn

Jun 5, 2024

Hi @nielsr , thanks for all the work! If I understand correctly, as the PR you mentioned above has been merged, training should now work properly for LLaVA-Next (LLaMA 8B + 72B and 110B) models and it already worked for LLaVA1.6? Do you know of any example scripts or articles?

RaushanTurganbay

Llava Hugging Face org Jun 6, 2024

Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook

lcolonn

Jun 6, 2024

•

edited Jun 6, 2024

Hey @RaushanTurganbay , very cool! I was a little confused because in the PR it also says that it's fine-tunable but for cases without images. Also if you are using llava-v1.6-mistral-7b-hf shouldn't you be using the following prompt format: "[INST] \n What is shown in this image? [/INST]" as described here: https://huggingface.co/docs/transformers/main/en/model_doc/llava_next

nielsr

Llava Hugging Face org Jun 6, 2024

Yes that's right, LLaVa-NeXt does not have a chat template yet which means that for now you need to manually make sure that the right format is used. Looks like @RaushanTurganbay might need to update that

RaushanTurganbay

Llava Hugging Face org Jun 7, 2024

Oke, thanks for noting. Will change it in the notebook and I will try to add chat templates to all Llava models

lcolonn

Jun 20, 2024

Hi @nielsr , sorry it's still not quite clear to me whether training for LLaVA-Next supports training with batched (images). It did say in this PR that only support for training without images was added: https://github.com/huggingface/transformers/pull/29850

RaushanTurganbay

Llava Hugging Face org Jun 21, 2024

I updated the comment in PR to (with and w/o images). The model should be tunable with images as well

GohioAC

Jun 25, 2024

This comment has been hidden

tjiang217

Jul 31, 2024

@RaushanTurganbay , thanks for sharing the notebook on finetuning LLaVA-NEXT! Is there a similar one for finetuning LLaVA-NEXT-Video? or can I easily adapt this notebook for LLaVA-NEXT-Video as well? @nielsr

nielsr

Llava Hugging Face org Jul 31, 2024

Yes here it is: https://github.com/NielsRogge/Transformers-Tutorials/tree/master/VideoLLaVa. Should be very similar for LLaVA-Next-Video.

RaushanTurganbay

Llava Hugging Face org Aug 1, 2024

There is actually a notebook for llava-next-video here, I will port it to the Tutorials repo for easier discovery

chrishoertnagl

Aug 5, 2024

Hey, thanks so much for the great examples! Trying to follow along, but I have only small GPUs and try to use Deepspeed. Do you know if your code would work with Deepspeed on 4 GPUs?

RaushanTurganbay

Llava Hugging Face org Aug 6, 2024

For DeepSpeed we support it when using Trainer but the example notebook relies on custom trainer. Take a look at (https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/deepspeed#deepspeed-non-trainer-integration) for more information on how to use deepspeed with custom Trainers

tjiang217

Aug 6, 2024

Sorry a little of a different question.. How many images and/or videos can LLaVA-Next-Video take? I couldn't find it stated elsewhere. Thanks in advance. @RaushanTurganbay @nielsr

RaushanTurganbay

Llava Hugging Face org Aug 7, 2024

@tjiang217 LLaVA-Next-Video was not trained in multi-image/multi-video setting afaik, but it doesn't mean we can't try and feed several visuals. But note that the generation quality might not be as good as in single image.

You can also take a look at https://huggingface.co/collections/llava-hf/llava-interleave-668e19a97da0036aad4a2f19, which were trained for interleaved images/videos. It doesn't state however how many images/videos per prompt was used in train, I guess it was 2 images/videos in most examples

tjiang217

Aug 25, 2024

@RaushanTurganbay I tried to run the llava-next-video finetuning notebook you shared without changing any code on 4 A10 GPU ec2 instance and ran into the following issue. The inference code works just the training part. Do you have any ideas why? It has to do with device_map = 'auto' but putting on one gpu causes CUDA out of memory error. Any help would be greatly appreciated

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:3! (when checking argument for argument mat2 in method wrapper_CUDA_bmm)

tjiang217

Aug 27, 2024

@RaushanTurganbay sorry just wanted to follow up here. I was able to bypass the previous bug when I make the batch size smaller and remove device_map = 'auto', but ran into the following bug using the same code in the llava-next-video finetuning notebook. Do you know for this notebook, which transformers version you used and other package versions? Thanks in advance!

Error I ran into.
RuntimeError: Input tensor at index 1 has invalid shape [1, 1595, 32064], but expected [1, 1500, 32064]

RaushanTurganbay

Llava Hugging Face org Aug 28, 2024

Further discussion/solutions will be in https://github.com/huggingface/trl/issues/1785#issuecomment-2314793662 for anyone having the same issue

Prasun

Oct 2, 2024

What changes i need to make in the notebook if my dataset is unique_id, image and conversations. I can't see any notebook using conversations to train.

RaushanTurganbay

Llava Hugging Face org Oct 2, 2024

You can find SFT tuning example for VLMs here (https://github.com/huggingface/trl/blob/main/examples/scripts/sft_vlm.py). But the general idea is same, and you just have to prepare the inputs in the format you want and thus write your own data collator. You can also take a look at how LLMs are tuned with dialog datasets to see how the inputs have to be formatted/masked

tjiang217

Oct 2, 2024

@RaushanTurganbay I understand the current llava-next-video model processes each frame as 12x12 tokens (result of 2 stride pooling from 24x24 tokens), I am working with a soccer video dataset that has fine-grain details, such as the soccer ball, so I thought the 12x12 tokens may not be able capture enough details. The LLaVA-next-video blog talked about testing different variation of pooling strides. Do you know if we could tweak the current model or access the other model so the number of tokens representing each frame is greater than 12x12 tokens?

Thanks in advance, much appreciated!

RaushanTurganbay

Llava Hugging Face org Oct 3, 2024

Unfortunately we don't support different polling methods and strides. Maybe you can tune your model with llava-vl repo for that and then convert to HF format? We are currently trying to make VLMs more modular and will take out image encoder related code into a separate method. So you will have more freedom of how to obtain image hidden states by overwriting only that method :)

bialykostek

Oct 12, 2024

Hi @lcolonn ! Yes, the PR was merged and LLaVa-NeXT is tunable now. Fine-tuning script is almost the same as LLaVa with a few changes in input arguments, find here my adaptation of Niels' notebook

Hi guys!
I'm trying to fine tune LLava 1.6, but I'm facing a problem - I've tried @RaushanTurganbay collab (and many others I found on the Internet) but there is CUDA out of memory error when I try to run Lightning Trainer:

OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 7.06 MiB is free. Process 130036 has 14.74 GiB memory in use. Of the allocated memory 14.04 GiB is allocated by PyTorch, and 580.33 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

I've tested this code in Collab on L4 (16GB VRAM) as well as on my local machine with 3090 (24GB VRAM). I'm not sure if I need a lot more VRAM or is there a memory leak somewhere - maybe some library or driver has changed since then? Setting PYTORCH_CUDA_ALLOC_CONF to max_split_size_mb:512 or expandable_segments:True didn't resove problem. I've tested with most recent versions of packages as well with these (tutorial from 2 months ago):

https://github.com/Farzad-R/Finetune-LLAVA-NEXT/blob/main/requirements.txt

Can somebody help me?
Thanks

RaushanTurganbay

Llava Hugging Face org Oct 14, 2024

FYI, I had an 80GB A100 GPU when training the model and simply uploaded the notebook in colab for ease of sharing. You might consider getting more GPU :)

tjiang217

Oct 29, 2024

hi @RaushanTurganbay hope you are doing well! Just a qq, do we plan on supporting the new LLaVA-Video model too? (previously llava-next-video) thanks

RaushanTurganbay

Llava Hugging Face org Oct 29, 2024

Hey! If you mean support for fine-tuning demo notebook, we have it here -> https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LLaVA-NeXT-Video/Fine_tune_LLaVa_NeXT_Video_with_HFTrainer.ipynb

Also there is a community maintained repo for tuning various VLMs in https://github.com/zjysteven/lmms-finetune

tjiang217

Oct 29, 2024

Hi @RaushanTurganbay , sorry I wasn't super clear. I meant that I see lmms-lab released a new set of LLaVA-Video models here (https://huggingface.co/collections/lmms-lab/llava-video-661e86f5e8dabc3ff793c944) Does llava-hf have plan to support them too, as I see you had also recently supported LLaVA-OneVision. It has made things quite easier working with your models, would really appreciate if you could. Thanks in advance!

RaushanTurganbay

Llava Hugging Face org Oct 30, 2024

Oh, that will be added in https://github.com/huggingface/transformers/pull/34195