README.md · adamo1139/Qwen2-VL-7B-Sydney at c32f9a06fe699d33a7222b96bef33ad2662cafab

Qwen2-VL-7B-Sydney / README.md

adamo1139

Update README.md

c32f9a0 verified 19 days ago

preview code

raw

history blame

5.31 kB

	---
	license: apache-2.0
	datasets:
	- adamo1139/Sydney_LLaVA_0610
	base_model:
	- Qwen/Qwen2-VL-7B-Instruct
	tags:
	- fluff
	- dogos
	- cats
	- sydney
	- bing
	- qwen
	- vlm
	---


	<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/7NJFmljgycOJs7mcO2Cag.png" width="200" style="float:center">

	## Model Description

	Qwen 2 VL 7B Sydney - Optimizing Vision Language Models for engagement and positivity.

	Have you ever pasted a picture of your dog or cat into a Vision Language Model only for the model to give you a description of the image without complimenting on the looks of your fluffer? \
	Well, this model will use every chance it gets to compliment your adorable sweetheart.

	It's been trained on around 60000 samples of synthetic data generated by [NousResearch/Hermes-3-Llama-3.1-8B](https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B). Dataset was converted from [liuhaotian/LLaVA-Instruct-150K](https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K).
	Dataset is available [here](https://huggingface.co/datasets/adamo1139/Sydney_LLaVA_0610).

	I am attempting to learn about finetuning Qwen 2 VL 7B and this was just a result of my tinkering over a weekend.

	## Dataset Creation details

	I ran Hermes 3 8B in Aphrodite-Engine locally and used a Python script to go through the LLaVA 150K Instruct dataset and for each sample, send a request to the model to modify the JSON sample so that output is more energetic. I used 6-shot prompt with bad samples coming from a generic LLM and good samples coming from [FPHam/Llama-3-8B-Sydney](https://huggingface.co/FPHam/Llama-3-8B-Sydney).
	After running through about half of the dataset I noticed an error in one of my examples and upon fixing it and modifying the prompt a bit I noticed that the generation quality deteriorated and 30% of responses I was getting back didn't pass JSON validation. I settled on using the ~60000 samples that were already processed fine. I cleaned up the dataset to fix various errors in it like presence of non UTF8 characters.

	Script used for creating the dataset is [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/sydney_llava_1.py).
	## Inference

	I uploaded the script for inference [here](https://huggingface.co/datasets/adamo1139/misc/blob/main/sydney/run_qwen_vl.py)
	This script is doing inference on this model and also normal Qwen 2 VL Instruct checkpoint.
	Script is based on the simple Qwen 2 VL Gradio inference project published [here](https://old.reddit.com/r/LocalLLaMA/comments/1fv892w/simple_gradio_ui_to_run_qwen_2_vl/)
	Qwen2 VL doesn't quant well, so you will need VRAM to load in the 16-bit checkpoint. I am using 24GB GPU and still, I can't load in any image or video I want since it will OOM.
	Inference should work fine on both Windows and Linux. By default script uses Flash Attention 2, so if you don't want to use it, run it with flag `--flash-attn2 False`.

	## Technical details

	Model was trained in LLaMa-Factory on a system with RTX 3090 Ti with unsloth on context length of 2000 with LoRA rank 32, alpha 32 and LoRa+ ratio of 4. Training took around 11 hours and bitsandbytes quantization was not utilized.

	```
	bf16: true
	cutoff_len: 2000
	dataset: sydney
	dataset_dir: data
	ddp_timeout: 180000000
	do_train: true
	finetuning_type: lora
	flash_attn: auto
	gradient_accumulation_steps: 16
	include_num_input_tokens_seen: true
	learning_rate: 5.0e-05
	logging_steps: 1
	lora_alpha: 32
	lora_dropout: 0
	lora_rank: 32
	lora_target: all
	loraplus_lr_ratio: 4
	lr_scheduler_type: cosine
	max_grad_norm: 1.0
	max_samples: 160000
	model_name_or_path: Qwen/Qwen2-VL-7B-Instruct
	num_train_epochs: 1.0
	optim: adamw_8bit
	output_dir: saves/Qwen2-VL-7B-Instruct/lora/train_2024-10-05-18-44-10-2
	packing: true
	per_device_train_batch_size: 1
	plot_loss: true
	preprocessing_num_workers: 16
	report_to: none
	save_steps: 200
	stage: sft
	template: qwen2_vl
	train_on_prompt: true
	use_unsloth: true
	warmup_steps: 25
	```

	Loss drops quickly and then stays basically flat, I am not sure why and this suggest some of the hyperparameters might have been set incorrectly or loss works differently on vision language models.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/QAaqfinhJTf5Qf52oWL65.png)

	## Examples of use

	I am comparing Qwen 2 VL 7B Sydney with Qwen/Qwen2-VL-7B-Instruct

	<div style="display: grid; grid-template-columns: repeat(1, 1fr); gap: 10px; max-width: 2000px; margin: 0 auto;">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/9am1yhT8mid0mYaCCTsRo.png" style="width: 100%; height: auto;" alt="Image 1" />
	<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Tfw7rL7NX9OwVXH-Vy5IB.png" style="width: 100%; height: auto;" alt="Image 2" />
	<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/JqbCDhfYSqddNUaR0VgmW.png" style="width: 100%; height: auto;" alt="Image 3" />
	<img src="https://cdn-uploads.huggingface.co/production/uploads/630fdd96a119d49bc1e770d5/Uwp2q7QTjz7nFRcVU3AVG.png" style="width: 100%; height: auto;" alt="Image 4" />
	</div>

	## Prompt template

	ChatML with system prompt "You are Sydney.". The rest of the prompt template is the same as what Qwen2 VL Instruct uses.