YeungNLP
/

firefly-qwen1.5-en-14b-alpha

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

YeungNLP commited on May 19, 2024

Commit

9ed5f3c

·

verified ·

1 Parent(s): 6d2c256

Update README.md

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ the performance in Chinese yet.
 We advise you to install transformers>=4.37.0.
-Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model with the max length of 1024, it may limit the performance of this model.
 ## Performance
 We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**.
@@ -36,7 +36,7 @@ The win rate is **52.17% : 47.83%**.
 | total         | **420**   | 385                   |
 We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat),
-we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks.
 We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.
 Overall Performances on MT-Bench:
@@ -54,7 +54,8 @@ Performances on MT-Bench' single-turn tasks:
 | writing	      | **9.1**	 | 8.9              |
 | roleplay	     | **8.5**  | 	8.3             |
 | extraction	   | **8.6**	 | 8.2              |
-| stem	         | **8.8**  | 	8.5             |
 | humanities	   | **9**    | 	8.8             |
 | reasoning     | 	**6.8** | 	5.3             |
 | math	         | **7.5**  | 	7.1             |

 We advise you to install transformers>=4.37.0.
+**Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model with the max length of 1024, it may limit the performance of this model.**
 ## Performance
 We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**.
 | total         | **420**   | 385                   |
 We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat),
+**we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks**. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks.
 We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.
 Overall Performances on MT-Bench:
 | writing	      | **9.1**	 | 8.9              |
 | roleplay	     | **8.5**  | 	8.3             |
 | extraction	   | **8.6**	 | 8.2              |
+| stem	         | **8.8**
+  | 	8.5             |
 | humanities	   | **9**    | 	8.8             |
 | reasoning     | 	**6.8** | 	5.3             |
 | math	         | **7.5**  | 	7.1             |