|
--- |
|
library_name: transformers |
|
license: apache-2.0 |
|
basemodel: Qwen/Qwen1.5-14B |
|
--- |
|
|
|
## Model Card for Firefly-Qwen1.5-14B-En-Alpha |
|
|
|
[firefly-qwen1.5-en-14b-alpha](https://huggingface.co/YeungNLP/firefly-qwen1.5-en-14b-alpha) is a preview version model of our new model. |
|
It outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge)' single-turn task. |
|
|
|
**Note: More importantly, it is not trained with neither SFT nor RLHF, maybe we will share our method later.** |
|
|
|
What's exciting is that our experimental method can achieve good performance, even though it's still in a very preliminary stage. |
|
|
|
Although our model is trained with English data, you can also try to chat with models in Chinese because Qwen1.5 is also good at Chinese. But we have not evaluated |
|
the performance in Chinese yet. |
|
|
|
We advise you to install transformers>=4.37.0. |
|
|
|
Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model with the max length of 1024, it may limit the performance of this model. |
|
|
|
## Performance |
|
We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**. |
|
|
|
We evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) with 805 questions, our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat). |
|
The win rate is **52.17% : 47.83%**. |
|
|
|
| Task | Ours wins | Qwen1.5-14B-Chat wins | |
|
|---------------|-----------|-----------------------| |
|
| helpful_base | **67** | 62 | |
|
| koala | **80** | 76 | |
|
| oasst | **100** | 88 | |
|
| selfinstruct | **127** | 125 | |
|
| vicuna | **46** | 34 | |
|
| total | **420** | 385 | |
|
|
|
We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat), |
|
we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks. |
|
We conjecture that it may be caused by the training length, and we will dive into this phenomenon later. |
|
|
|
Overall Performances on MT-Bench: |
|
|
|
| Task | Ours | Qwen1.5-14B-Chat | |
|
|-------------------|----------|-------------------| |
|
| Avg Score | 7.03 | **7.21** | |
|
| Single-turn Score | **8.01** | 7.66 | |
|
| Multi-turn Score | 6.05 | **6.75** | |
|
|
|
Performances on MT-Bench' single-turn tasks: |
|
|
|
| Task | Ours | Qwen1.5-14B-Chat | |
|
|---------------|----------|------------------| |
|
| writing | **9.1** | 8.9 | |
|
| roleplay | **8.5** | 8.3 | |
|
| extraction | **8.6** | 8.2 | |
|
| stem | **8.8** | 8.5 | |
|
| humanities | **9** | 8.8 | |
|
| reasoning | **6.8** | 5.3 | |
|
| math | **7.5** | 7.1 | |
|
| coding | 5.8 | **6.2** | |
|
|
|
Performances on MT-Bench' multi-turn tasks: |
|
|
|
| Task | Ours | Qwen1.5-14B-Chat | |
|
|----------------|----------|--------------------| |
|
| writing | 6.5 | **7.7** | |
|
| roleplay | 7.7 | **8.3** | |
|
| extraction | 5.1 | **6.7** | |
|
| stem | 6.3 | **6.9** | |
|
| humanities | 8.3 | **8.8** | |
|
| reasoning | 4.7 | **5.7** | |
|
| math | 4.9 | **5.5** | |
|
| coding | **4.9** | 4.4 | |
|
|
|
|
|
## Usage |
|
The chat templates of our chat models are the same as Official Qwen1.5-14B-Chat: |
|
```text |
|
<|im_start|>system |
|
You are a helpful assistant.<|im_end|> |
|
<|im_start|>user |
|
hello, who are you?<|im_end|> |
|
<|im_start|>assistant |
|
I am a AI program developed by Firefly<|im_end|> |
|
``` |
|
|
|
You can use script to inference in [Firefly](https://github.com/yangjianxin1/Firefly/blob/master/script/chat/chat.py). |
|
|
|
You can also use the following code: |
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
import torch |
|
|
|
model_name_or_path = "YeungNLP/firefly-qwen1.5-en-14b-alpha" |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name_or_path, |
|
trust_remote_code=True, |
|
low_cpu_mem_usage=True, |
|
torch_dtype=torch.float16, |
|
device_map='auto', |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path) |
|
|
|
prompt = "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions. " |
|
messages = [ |
|
{"role": "system", "content": "You are a helpful assistant."}, |
|
{"role": "user", "content": prompt} |
|
] |
|
text = tokenizer.apply_chat_template( |
|
messages, |
|
tokenize=False, |
|
add_generation_prompt=True |
|
) |
|
model_inputs = tokenizer([text], return_tensors="pt").to('cuda') |
|
|
|
generated_ids = model.generate( |
|
model_inputs.input_ids, |
|
max_new_tokens=1500, |
|
top_p = 0.8, |
|
temperature = 0.6, |
|
repetition_penalty = 1.0, |
|
eos_token_id=tokenizer.encode('<|im_end|>', add_special_tokens=False) |
|
) |
|
generated_ids = [ |
|
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) |
|
] |
|
|
|
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] |
|
print(response) |
|
``` |