|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- adamo1139/Sydney_LLaVA_0610 |
|
base_model: |
|
- Qwen/Qwen2-VL-7B-Instruct |
|
tags: |
|
- fluff |
|
- dogos |
|
- cats |
|
- sydney |
|
- bing |
|
- qwen |
|
- vlm |
|
--- |
|
|
|
|
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/7NJFmljgycOJs7mcO2Cag.png" width="200" style="float:center"> |
|
|
|
## Model Description |
|
|
|
Qwen 2 VL 7B Sydney - Optimizing Vision Language Models for engagement and positivity. |
|
|
|
Have you ever pasted a picture of your dog or cat into a Vision Language Model only for the model to give you a description of the image without complimenting on the looks of your fluffer? \ |
|
Well, this model will use every chance it gets to compliment your adorable sweetheart. |
|
|
|
It's been trained on around 60000 samples of synthetic data generated by [NousResearch/Hermes-3-Llama-3.1-8B](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B). Dataset was converted from [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K). |
|
Dataset is available [here](https://huggingface.co/datasets/adamo1139/Sydney_LLaVA_0610). |
|
|
|
I am attempting to learn about finetuning Qwen 2 VL 7B and this was just a result of my tinkering over a weekend. |
|
|
|
## Dataset Creation details |
|
|
|
I ran Hermes 3 8B in Aphrodite-Engine locally and used a Python script to go through the LLaVA 150K Instruct dataset and for each sample, send a request to the model to modify the JSON sample so that output is more energetic. I used 6-shot prompt with bad samples coming from a generic LLM and good samples coming from [FPHam/Llama-3-8B-Sydney](https://huggingface.co/FPHam/Llama-3-8B-Sydney). |
|
After running through about half of the dataset I noticed an error in one of my examples and upon fixing it and modifying the prompt a bit I noticed that the generation quality deteriorated and 30% of responses I was getting back didn't pass JSON validation. I settled on using the ~60000 samples that were already processed fine. I cleaned up the dataset to fix various errors in it like presence of non UTF8 characters. |
|
|
|
Script used for creating the dataset is [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/sydney_llava_1.py). |
|
## Inference |
|
|
|
I uploaded the script for inference [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/run_qwen_vl.py) |
|
This script is doing inference on this model and also normal Qwen 2 VL Instruct checkpoint. |
|
Script is based on the simple Qwen 2 VL Gradio inference project published [here](https://old.reddit.com/r/LocalLLaMA/comments/1fv892w/simple_gradio_ui_to_run_qwen_2_vl/) |
|
Qwen2 VL doesn't quant well, so you will need VRAM to load in the 16-bit checkpoint. I am using 24GB GPU and still, I can't load in any image or video I want since it will OOM. |
|
Inference should work fine on both Windows and Linux. By default script uses Flash Attention 2, so if you don't want to use it, run it with flag `--flash-attn2 False`. |
|
|
|
## Technical details |
|
|
|
Model was trained in LLaMa-Factory on a system with RTX 3090 Ti with unsloth on context length of 2000 with LoRA rank 32, alpha 32 and LoRa+ ratio of 4. Training took around 11 hours and bitsandbytes quantization was not utilized. |
|
|
|
``` |
|
bf16: true |
|
cutoff_len: 2000 |
|
dataset: sydney |
|
dataset_dir: data |
|
ddp_timeout: 180000000 |
|
do_train: true |
|
finetuning_type: lora |
|
flash_attn: auto |
|
gradient_accumulation_steps: 16 |
|
include_num_input_tokens_seen: true |
|
learning_rate: 5.0e-05 |
|
logging_steps: 1 |
|
lora_alpha: 32 |
|
lora_dropout: 0 |
|
lora_rank: 32 |
|
lora_target: all |
|
loraplus_lr_ratio: 4 |
|
lr_scheduler_type: cosine |
|
max_grad_norm: 1.0 |
|
max_samples: 160000 |
|
model_name_or_path: Qwen/Qwen2-VL-7B-Instruct |
|
num_train_epochs: 1.0 |
|
optim: adamw_8bit |
|
output_dir: saves/Qwen2-VL-7B-Instruct/lora/train_2024-10-05-18-44-10-2 |
|
packing: true |
|
per_device_train_batch_size: 1 |
|
plot_loss: true |
|
preprocessing_num_workers: 16 |
|
report_to: none |
|
save_steps: 200 |
|
stage: sft |
|
template: qwen2_vl |
|
train_on_prompt: true |
|
use_unsloth: true |
|
warmup_steps: 25 |
|
``` |
|
|
|
Loss drops quickly and then stays basically flat, I am not sure why and this suggest some of the hyperparameters might have been set incorrectly or loss works differently on vision language models. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/QAaqfinhJTf5Qf52oWL65.png) |
|
|
|
## Examples of use |
|
|
|
I am comparing Qwen 2 VL 7B Sydney with Qwen/Qwen2-VL-7B-Instruct |
|
|
|
<div style="display: grid; grid-template-columns: repeat(1, 1fr); gap: 10px; max-width: 2000px; margin: 0 auto;"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/9am1yhT8mid0mYaCCTsRo.png" style="width: 100%; height: auto;" alt="Image 1" /> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Tfw7rL7NX9OwVXH-Vy5IB.png" style="width: 100%; height: auto;" alt="Image 2" /> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/JqbCDhfYSqddNUaR0VgmW.png" style="width: 100%; height: auto;" alt="Image 3" /> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Uwp2q7QTjz7nFRcVU3AVG.png" style="width: 100%; height: auto;" alt="Image 4" /> |
|
</div> |
|
|
|
## Prompt template |
|
|
|
ChatML with system prompt "You are Sydney.". The rest of the prompt template is the same as what Qwen2 VL Instruct uses. |