--- license: apache-2.0 datasets: - adamo1139/Sydney_LLaVA_0610 base_model: - Qwen/Qwen2-VL-7B-Instruct tags: - fluff - dogos - cats - sydney - bing - qwen - vlm --- ## Model Description Qwen 2 VL 7B Sydney - Optimizing Vision Language Models for engagement and positivity. Have you ever pasted a picture of your dog or cat into a Vision Language Model only for the model to give you a description of the image without complimenting on the looks of your fluffer? \ Well, this model will use every chance it gets to compliment your adorable sweetheart. It's been trained on around 60000 samples of synthetic data generated by [NousResearch/Hermes-3-Llama-3.1-8B](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B). Dataset was converted from [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K). Dataset is available [here](https://huggingface.co/datasets/adamo1139/Sydney_LLaVA_0610). I am attempting to learn about finetuning Qwen 2 VL 7B and this was just a result of my tinkering over a weekend. ## Dataset Creation details I ran Hermes 3 8B in Aphrodite-Engine locally and used a Python script to go through the LLaVA 150K Instruct dataset and for each sample, send a request to the model to modify the JSON sample so that output is more energetic. I used 6-shot prompt with bad samples coming from a generic LLM and good samples coming from [FPHam/Llama-3-8B-Sydney](https://huggingface.co/FPHam/Llama-3-8B-Sydney). After running through about half of the dataset I noticed an error in one of my examples and upon fixing it and modifying the prompt a bit I noticed that the generation quality deteriorated and 30% of responses I was getting back didn't pass JSON validation. I settled on using the ~60000 samples that were already processed fine. I cleaned up the dataset to fix various errors in it like presence of non UTF8 characters. Script used for creating the dataset is [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/sydney_llava_1.py). ## Inference I uploaded the script for inference [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/run_qwen_vl.py) This script is doing inference on this model and also normal Qwen 2 VL Instruct checkpoint. Script is based on the simple Qwen 2 VL Gradio inference project published [here](https://old.reddit.com/r/LocalLLaMA/comments/1fv892w/simple_gradio_ui_to_run_qwen_2_vl/) Qwen2 VL doesn't quant well, so you will need VRAM to load in the 16-bit checkpoint. I am using 24GB GPU and still, I can't load in any image or video I want since it will OOM. Inference should work fine on both Windows and Linux. By default script uses Flash Attention 2, so if you don't want to use it, run it with flag `--flash-attn2 False`. ## Technical details Model was trained in LLaMa-Factory on a system with RTX 3090 Ti with unsloth on context length of 2000 with LoRA rank 32, alpha 32 and LoRa+ ratio of 4. Training took around 11 hours and bitsandbytes quantization was not utilized. ``` bf16: true cutoff_len: 2000 dataset: sydney dataset_dir: data ddp_timeout: 180000000 do_train: true finetuning_type: lora flash_attn: auto gradient_accumulation_steps: 16 include_num_input_tokens_seen: true learning_rate: 5.0e-05 logging_steps: 1 lora_alpha: 32 lora_dropout: 0 lora_rank: 32 lora_target: all loraplus_lr_ratio: 4 lr_scheduler_type: cosine max_grad_norm: 1.0 max_samples: 160000 model_name_or_path: Qwen/Qwen2-VL-7B-Instruct num_train_epochs: 1.0 optim: adamw_8bit output_dir: saves/Qwen2-VL-7B-Instruct/lora/train_2024-10-05-18-44-10-2 packing: true per_device_train_batch_size: 1 plot_loss: true preprocessing_num_workers: 16 report_to: none save_steps: 200 stage: sft template: qwen2_vl train_on_prompt: true use_unsloth: true warmup_steps: 25 ``` Loss drops quickly and then stays basically flat, I am not sure why and this suggest some of the hyperparameters might have been set incorrectly or loss works differently on vision language models. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/QAaqfinhJTf5Qf52oWL65.png) ## Examples of use I am comparing Qwen 2 VL 7B Sydney with Qwen/Qwen2-VL-7B-Instruct
Image 1 Image 2 Image 3 Image 4
## Prompt template ChatML with system prompt "You are Sydney.". The rest of the prompt template is the same as what Qwen2 VL Instruct uses.