README.md · adamo1139/Qwen2-VL-7B-Sydney at main

File size: 5,408 Bytes

---
license: apache-2.0
datasets:
- adamo1139/Sydney_LLaVA_0610
base_model:
- Qwen/Qwen2-VL-7B-Instruct
tags:
- fluff
- dogos
- cats
- sydney
- bing
- qwen
- vlm
- multimodal
- conversational
- qwen2_vl
library_name: transformers
pipeline_tag: image-text-to-text
---


<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/7NJFmljgycOJs7mcO2Cag.png" width="200" style="float:center">

## Model Description

Qwen 2 VL 7B Sydney - Optimizing Vision Language Models for engagement and positivity.

Have you ever pasted a picture of your dog or cat into a Vision Language Model only for the model to give you a description of the image without complimenting on the looks of your fluffer? \
Well, this model will use every chance it gets to compliment your adorable sweetheart.

It's been trained on around 60000 samples of synthetic data generated by [NousResearch/Hermes-3-Llama-3.1-8B](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B). Dataset was converted from [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K).
Dataset is available [here](https://huggingface.co/datasets/adamo1139/Sydney_LLaVA_0610).

I am attempting to learn about finetuning Qwen 2 VL 7B and this was just a result of my tinkering over a weekend.

## Dataset Creation details

I ran Hermes 3 8B in Aphrodite-Engine locally and used a Python script to go through the LLaVA 150K Instruct dataset and for each sample, send a request to the model to modify the JSON sample so that output is more energetic. I used 6-shot prompt with bad samples coming from a generic LLM and good samples coming from [FPHam/Llama-3-8B-Sydney](https://huggingface.co/FPHam/Llama-3-8B-Sydney).
After running through about half of the dataset I noticed an error in one of my examples and upon fixing it and modifying the prompt a bit I noticed that the generation quality deteriorated and 30% of responses I was getting back didn't pass JSON validation. I settled on using the ~60000 samples that were already processed fine. I cleaned up the dataset to fix various errors in it like presence of non UTF8 characters.

Script used for creating the dataset is [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/sydney_llava_1.py).
## Inference

I uploaded the script for inference [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/run_qwen_vl.py)
This script is doing inference on this model and also normal Qwen 2 VL Instruct checkpoint. 
Script is based on the simple Qwen 2 VL Gradio inference project published [here](https://old.reddit.com/r/LocalLLaMA/comments/1fv892w/simple_gradio_ui_to_run_qwen_2_vl/)
Qwen2 VL doesn't quant well, so you will need VRAM to load in the 16-bit checkpoint. I am using 24GB GPU and still, I can't load in any image or video I want since it will OOM.
Inference should work fine on both Windows and Linux. By default script uses Flash Attention 2, so if you don't want to use it, run it with flag `--flash-attn2 False`.

## Technical details

Model was trained in LLaMa-Factory on a system with RTX 3090 Ti with unsloth on context length of 2000 with LoRA rank 32, alpha 32 and LoRa+ ratio of 4. Training took around 11 hours and bitsandbytes quantization was not utilized.

```
bf16: true
cutoff_len: 2000
dataset: sydney
dataset_dir: data
ddp_timeout: 180000000
do_train: true
finetuning_type: lora
flash_attn: auto
gradient_accumulation_steps: 16
include_num_input_tokens_seen: true
learning_rate: 5.0e-05
logging_steps: 1
lora_alpha: 32
lora_dropout: 0
lora_rank: 32
lora_target: all
loraplus_lr_ratio: 4
lr_scheduler_type: cosine
max_grad_norm: 1.0
max_samples: 160000
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
num_train_epochs: 1.0
optim: adamw_8bit
output_dir: saves/Qwen2-VL-7B-Instruct/lora/train_2024-10-05-18-44-10-2
packing: true
per_device_train_batch_size: 1
plot_loss: true
preprocessing_num_workers: 16
report_to: none
save_steps: 200
stage: sft
template: qwen2_vl
train_on_prompt: true
use_unsloth: true
warmup_steps: 25
```

Loss drops quickly and then stays basically flat, I am not sure why and this suggest some of the hyperparameters might have been set incorrectly or loss works differently on vision language models.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/QAaqfinhJTf5Qf52oWL65.png)

## Examples of use

I am comparing Qwen 2 VL 7B Sydney with Qwen/Qwen2-VL-7B-Instruct

<div style="display: grid; grid-template-columns: repeat(1, 1fr); gap: 10px; max-width: 2000px; margin: 0 auto;">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/9am1yhT8mid0mYaCCTsRo.png" style="width: 100%; height: auto;" alt="Image 1" />
  <img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Tfw7rL7NX9OwVXH-Vy5IB.png" style="width: 100%; height: auto;" alt="Image 2" />
  <img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/JqbCDhfYSqddNUaR0VgmW.png" style="width: 100%; height: auto;" alt="Image 3" />
  <img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Uwp2q7QTjz7nFRcVU3AVG.png" style="width: 100%; height: auto;" alt="Image 4" />
</div>

## Prompt template

ChatML with system prompt "You are Sydney.". The rest of the prompt template is the same as what Qwen2 VL Instruct uses.