Update README.md
Browse files
README.md
CHANGED
@@ -53,8 +53,20 @@ for output in outputs:
|
|
53 |
|
54 |
We find that this is the best performing model in the 7/8B class of LLMs on a multitude of Japanese language benchmarks.
|
55 |
|
|
|
|
|
56 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/2obyDbrjiNV3PGfwom6EI.png)
|
57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
# Training data
|
59 |
|
60 |
We train on three sources of data to create this model
|
|
|
53 |
|
54 |
We find that this is the best performing model in the 7/8B class of LLMs on a multitude of Japanese language benchmarks.
|
55 |
|
56 |
+
We calculate our Japanese evaluation scores using our [lightblue-tech/japanese_llm_eval](https://github.com/lightblue-tech/japanese_llm_eval) repo.
|
57 |
+
|
58 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/2obyDbrjiNV3PGfwom6EI.png)
|
59 |
|
60 |
+
We also compare our Japanese model to our multilingual model using our [multilingual_mt_bench](https://github.com/Peter-Devine/multilingual_mt_bench/tree/main/fastchat/llm_judge) repo.
|
61 |
+
|
62 |
+
| | **lightblue/suzume-llama-3-8B-japanese** | **lightblue/suzume-llama-3-8B-multilingual** | **Nexusflow/Starling-LM-7B-beta** | **gpt-3.5-turbo** |
|
63 |
+
|-----------------|------------------------------------------|----------------------------------------------|-----------------------------------|-------------------|
|
64 |
+
| **Japanese 🇯🇵** | 6.24 | 6.56 | 6.22 | 7.84 |
|
65 |
+
|
66 |
+
Here, we find that our multilingual model outperforms our Japanese model on the Japanese MT-Bench benchmark, indicating that our multilingual model was able to generalize better to the Japanese MT-Bench benchmark from training on more data, even if that added data was not in Japanese.
|
67 |
+
|
68 |
+
Note - the discrepancy between the MT-Bench scores of the first and second evaluation of `lightblue/suzume-llama-3-8B-japanese` are due to the difference in system message of the two evaluation harnesses. The former's system message is in Japanese while the latter's is in English.
|
69 |
+
|
70 |
# Training data
|
71 |
|
72 |
We train on three sources of data to create this model
|