RichardErkhov/YeungNLP_-_firefly-qwen1.5-en-7b-unsloth-gguf

Quantization made by Richard Erkhov.

firefly-qwen1.5-en-7b-unsloth - GGUF

Model creator: https://huggingface.co/YeungNLP/
Original model: https://huggingface.co/YeungNLP/firefly-qwen1.5-en-7b-unsloth/

Name	Quant method	Size
firefly-qwen1.5-en-7b-unsloth.Q2_K.gguf	Q2_K	2.89GB
firefly-qwen1.5-en-7b-unsloth.IQ3_XS.gguf	IQ3_XS	3.18GB
firefly-qwen1.5-en-7b-unsloth.IQ3_S.gguf	IQ3_S	3.32GB
firefly-qwen1.5-en-7b-unsloth.Q3_K_S.gguf	Q3_K_S	3.32GB
firefly-qwen1.5-en-7b-unsloth.IQ3_M.gguf	IQ3_M	3.48GB
firefly-qwen1.5-en-7b-unsloth.Q3_K.gguf	Q3_K	3.65GB
firefly-qwen1.5-en-7b-unsloth.Q3_K_M.gguf	Q3_K_M	3.65GB
firefly-qwen1.5-en-7b-unsloth.Q3_K_L.gguf	Q3_K_L	3.93GB
firefly-qwen1.5-en-7b-unsloth.IQ4_XS.gguf	IQ4_XS	4.02GB
firefly-qwen1.5-en-7b-unsloth.Q4_0.gguf	Q4_0	4.2GB
firefly-qwen1.5-en-7b-unsloth.IQ4_NL.gguf	IQ4_NL	4.22GB
firefly-qwen1.5-en-7b-unsloth.Q4_K_S.gguf	Q4_K_S	4.23GB
firefly-qwen1.5-en-7b-unsloth.Q4_K.gguf	Q4_K	3.95GB
firefly-qwen1.5-en-7b-unsloth.Q4_K_M.gguf	Q4_K_M	0.95GB
firefly-qwen1.5-en-7b-unsloth.Q4_1.gguf	Q4_1	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q5_0.gguf	Q5_0	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q5_K_S.gguf	Q5_K_S	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q5_K.gguf	Q5_K	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q5_K_M.gguf	Q5_K_M	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q5_1.gguf	Q5_1	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q6_K.gguf	Q6_K	0.01GB
firefly-qwen1.5-en-7b-unsloth.Q8_0.gguf	Q8_0	0.01GB

Original model description:

library_name: transformers license: apache-2.0 basemodel: Qwen/Qwen1.5-7B

Unsloth x Qwen2

Unsloth can speed up training LLM and reduce memory usage, but currently it only supports Llama3, Mistral, Gemma, ORPR, Phi-3 and TinyLlama. We can't train Qwen2 with Unsloth, even though Qwen2 is popular in community.

It's exciting that we succeed to make Unsloth support Qwen2, it can speed up training and reduce much memory usage. If you want to train Qwen2 with Unsloth, you can use our repo rather than the official one. And we will commit our code to the official repo.

Install our Unsloth:

pip install git+https://github.com/yangjianxin1/unsloth.git

Firefly already supports training Qwen2 with Unsloth, and the subsequent models are trained with Firefly, you can try it.

Model Card for Firefly-Qwen1.5-Unsloth

firefly-qwen1.5-en-7b-unsloth and firefly-qwen1.5-en-7b-dpo-v0.1-unloth are trained based on Qwen1.5-7B to act as a helpful and harmless AI assistant. We use Firefly to train our models on a single V100 GPU with QLoRA and Unsloth. firefly-qwen1.5-en-7b-unsloth is fine-tuned based on Qwen1.5-7B with English instruction data, and firefly-qwen1.5-en-7b-dpo-v0.1-unsloth is trained with Direct Preference Optimization (DPO) based on firefly-qwen1.5-en-7b-unsloth.

Our models outperform official Qwen1.5-7B-Chat, Gemma-7B-it, Zephyr-7B-Beta on Open LLM Leaderboard.

Although our models are trained with English data, you can also try to chat with models in Chinese because Qwen1.5 is also good at Chinese. But we have not evaluated the performance in Chinese yet.

We advise you to install transformers>=4.37.0.

Performance

We have evaluated the training gain of Qwen1.5-7B, we use QLoRA and Unsloth to train model for 20 steps on a single V100. The result can be listed as follows. Unsloth can reduce GPU memory by 39.13% and training time by 32.12%, and the training speed can increase by 47.32%.

max_seq_length	per_device_train_batch_size	gradient_accumulation_steps	use_unsloth	rank	GPU	Time
1024	1	16	false	8	13.72GB	448s
1024	1	16	true	8	8.43GB(-38.56%)	308s(-31.25%)
1024	1	16	false	64	16.01GB	452s
1024	1	16	true	64	11.07GB(-30.86%)	311s(-31.19%)
2048	1	16	false	64	18.55GB	840s
2048	1	16	true	64	12.99GB(-29.97%)	596s(-29.05%)
1024	4	4	false	64	24.70GB	357s
1024	4	4	true	64	14.36GB(-41.86%)	253s(-29.13%)
2048	4	4	false	64	32.51GB	741s
2048	4	4	true	64	19.79GB(-39.13%)	503s(-32.12%)

We evaluate our sft and dpo models on Open LLM Leaderboard, they achieve good performance.

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
firefly-gemma-7b	62.93	62.12	79.77	61.57	49.41	75.45	49.28
firefly-qwen1.5-en-7b-dpo-v0.1-unsloth	62.65	56.14	75.5	60.87	58.09	70.72	54.59
zephyr-7b-beta	61.95	62.03	84.36	61.07	57.45	77.74	29.04
firefly-qwen1.5-en-7b-unsloth	61.81	54.27	76.22	61.55	50.62	70.48	57.7
vicuna-13b-v1.5	55.41	57.08	81.24	56.67	51.51	74.66	11.3
Xwin-LM-13B-V0.1	55.29	62.54	82.8	56.53	45.96	74.27	9.63
Qwen1.5-7B-Chat	55.15	55.89	78.56	61.65	53.54	67.72	13.57
gemma-7b-it	53.56	51.45	71.96	53.52	47.29	67.96	29.19

Usage

The chat templates of our chat models are the same as Official Qwen1.5-7B-Chat:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
hello, who are you?<|im_end|>
<|im_start|>assistant
I am a AI program developed by Firefly<|im_end|>

You can use script to inference in Firefly.

You can also use the following code:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name_or_path = "YeungNLP/firefly-qwen1.5-en-7b-unsloth"
model = AutoModelForCausalLM.from_pretrained(
    model_name_or_path,
    trust_remote_code=True,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    device_map='auto',
)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path)

prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. "
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to('cuda')

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=1500,
    top_p = 0.9,
    temperature = 0.35,
    repetition_penalty = 1.0,
    eos_token_id=tokenizer.encode('<|im_end|>', add_special_tokens=False)
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Details

Both in SFT and DPO stages, We only use a single V100 GPU with QLoRA and Unsloth, and we use Firefly to train our models.

Training Setting

The following hyperparameters are used during SFT:

num_epochs: 1
learning_rate: 2e-4
total_train_batch_size: 32
max_seq_length: 2048
optimizer: paged_adamw_32bit
lr_scheduler_type: constant_with_warmup
warmup_steps: 600
lora_rank: 64
lora_alpha: 16
lora_dropout: 0.05
gradient_checkpointing: true
fp16: true

The following hyperparameters were used during DPO:

num_epochs: 1
learning_rate: 2e-4
total_train_batch_size: 32
max_seq_length: 2048
max_prompt_length: 500
optimizer: paged_adamw_32bit
lr_scheduler_type: constant_with_warmup
warmup_steps: 100
lora_rank: 64
lora_alpha: 16
lora_dropout: 0.05
gradient_checkpointing: true
fp16: true

Training metrics

The table below shows the full set of DPO training metrics:

Epoch	Step	Loss	Rewards/accuracies	Rewards/margins	Rewards/chosen	Rewards/rejected	Logits/chosen	Logits/rejected	Logps/chosen	Logps/rejected
0.05	100	0.6128	0.6572	0.3914	-0.0622	-0.4537	1.107	1.1104	-283.7632	-264.5925
0.1	200	0.6066	0.6913	0.662	-0.3589	-1.0209	0.9433	0.9431	-279.0002	-268.6432
0.16	300	0.5803	0.7069	0.876	-0.3849	-1.2609	0.8411	0.8537	-289.9482	-274.3425
0.21	400	0.5624	0.7169	0.9575	-0.2447	-1.2022	0.7615	0.7497	-293.8072	-274.4167
0.26	500	0.5863	0.7	0.8908	-0.5283	-1.4191	0.537	0.5085	-284.3388	-267.9294
0.31	600	0.5612	0.7166	1.0791	-0.592	-1.6711	0.7121	0.7219	-293.2425	-278.5992
0.37	700	0.5741	0.7234	1.0742	-0.8469	-1.9211	0.6002	0.5769	-300.8099	-285.9137
0.42	800	0.582	0.7141	1.0414	-1.1658	-2.2072	0.7191	0.5934	-300.458	-286.1
0.47	900	0.5694	0.7178	1.2055	-1.7372	-2.9426	0.4226	0.316	-305.5303	-290.7548
0.52	1000	0.5827	0.7134	1.1063	-1.354	-2.4603	0.535	0.4022	-302.7598	-286.636
0.58	1100	0.5553	0.7306	1.3631	-1.5861	-2.9492	0.7636	0.6559	-312.9375	-290.3474
0.63	1200	0.5633	0.7341	1.2689	-1.7187	-2.9876	0.6555	0.5894	-315.0179	-298.2406
0.68	1300	0.5705	0.7284	1.3501	-1.7762	-3.1263	0.7419	0.6874	-310.9056	-294.2934
0.73	1400	0.5458	0.7347	1.4555	-2.2377	-3.6932	0.7279	0.6564	-309.141	-299.1613
0.79	1500	0.5797	0.7222	1.2937	-2.4483	-3.742	0.8444	0.771	-321.578	-298.111
0.84	1600	0.5572	0.7319	1.4824	-2.9344	-4.4168	0.9202	0.8605	-323.4034	-307.0114
0.89	1700	0.5518	0.7281	1.4263	-2.7301	-4.1564	0.9257	0.8785	-313.694	-298.1267
0.94	1800	0.5572	0.7272	1.5121	-2.9505	-4.4627	0.7899	0.7503	-314.1552	-305.9873
0.99	1900	0.5763	0.7241	1.4982	-2.7064	-4.2047	0.7841	0.7023	-310.6677	-299.5064