YeungNLP commited on
Commit
9ed5f3c
·
verified ·
1 Parent(s): 6d2c256

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -18,7 +18,7 @@ the performance in Chinese yet.
18
 
19
  We advise you to install transformers>=4.37.0.
20
 
21
- Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model with the max length of 1024, it may limit the performance of this model.
22
 
23
  ## Performance
24
  We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**.
@@ -36,7 +36,7 @@ The win rate is **52.17% : 47.83%**.
36
  | total | **420** | 385 |
37
 
38
  We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat),
39
- we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks.
40
  We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.
41
 
42
  Overall Performances on MT-Bench:
@@ -54,7 +54,8 @@ Performances on MT-Bench' single-turn tasks:
54
  | writing | **9.1** | 8.9 |
55
  | roleplay | **8.5** | 8.3 |
56
  | extraction | **8.6** | 8.2 |
57
- | stem | **8.8** | 8.5 |
 
58
  | humanities | **9** | 8.8 |
59
  | reasoning | **6.8** | 5.3 |
60
  | math | **7.5** | 7.1 |
 
18
 
19
  We advise you to install transformers>=4.37.0.
20
 
21
+ **Because this is a validation experiment and our training resources are limited, we use QLoRA to train this model with the max length of 1024, it may limit the performance of this model.**
22
 
23
  ## Performance
24
  We automatically evaluate models on [AlpacaEval 2.0](https://github.com/tatsu-lab/alpaca_eval) and [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge) with **gpt-4o**.
 
36
  | total | **420** | 385 |
37
 
38
  We also evaluate models on [MT-Bench](https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge). Though the overall performance of our model is not as good as [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat),
39
+ **we find that our model outperforms [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all single-turn tasks**. Our model is worse than [Qwen1.5-14B-Chat](https://huggingface.co/Qwen/Qwen1.5-14B-Chat) in almost all multi-turn tasks.
40
  We conjecture that it may be caused by the training length, and we will dive into this phenomenon later.
41
 
42
  Overall Performances on MT-Bench:
 
54
  | writing | **9.1** | 8.9 |
55
  | roleplay | **8.5** | 8.3 |
56
  | extraction | **8.6** | 8.2 |
57
+ | stem | **8.8**
58
+ | 8.5 |
59
  | humanities | **9** | 8.8 |
60
  | reasoning | **6.8** | 5.3 |
61
  | math | **7.5** | 7.1 |