Jian-Gang commited on
Commit
7839fa8
1 Parent(s): 56ad8c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -2
README.md CHANGED
@@ -37,9 +37,9 @@ We performed continued pre-training in English and SEA languages on [Llama-3.1-7
37
  For tokenisation, the model employs the default tokenizer used in Llama 3.1 70B Instruct.
38
 
39
  ### Benchmark Performance
40
- We evaluated Llama3.1 70B CPT SEA-LIONv3 base model on general language capabilities.
41
 
42
- #### General Language Capabilities
43
  For the evaluation of general language capabilities, we employed the [SEA-HELM (also known as BHASA) evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
44
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarisation (Abssum), Causal Reasoning (Causal) and Natural Language Inference (NLI).
45
 
@@ -51,6 +51,8 @@ Following the implementation of IFEval in OpenLLM leaderboard, we also implement
51
 
52
  **SEA-IFEval**
53
 
 
 
54
  SEA-IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. Additionally, accuracy is normalised by the proportion of responses in the correct language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
55
 
56
  For more details on Llama3.1 70B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA-HELM leaderboard, https://leaderboard.sea-lion.ai/.
 
37
  For tokenisation, the model employs the default tokenizer used in Llama 3.1 70B Instruct.
38
 
39
  ### Benchmark Performance
40
+ We evaluated Llama3.1 70B CPT SEA-LIONv3 base model on general language capabilities and constraint-following behaviour.
41
 
42
+ #### General Language Capabilities and Constraint-following Behaviour
43
  For the evaluation of general language capabilities, we employed the [SEA-HELM (also known as BHASA) evaluation benchmark](https://arxiv.org/abs/2309.06085v2) across a variety of tasks.
44
  These tasks include Question Answering (QA), Sentiment Analysis (Sentiment), Toxicity Detection (Toxicity), Translation in both directions (Eng>Lang & Lang>Eng), Abstractive Summarisation (Abssum), Causal Reasoning (Causal) and Natural Language Inference (NLI).
45
 
 
51
 
52
  **SEA-IFEval**
53
 
54
+ Based on [IFEval](https://arxiv.org/abs/2311.07911), the linguists and native speakers in the team worked together to filter, localise and translate the datasets into the respective target languages to ensure that the examples remained reasonable, meaningful and natural.
55
+
56
  SEA-IFEval evaluates a model's ability to adhere to constraints provided in the prompt, for example beginning a response with a specific word/phrase or answering with a certain number of sections. Additionally, accuracy is normalised by the proportion of responses in the correct language (if the model performs the task correctly but responds in the wrong language, it is judged to have failed the task).
57
 
58
  For more details on Llama3.1 70B CPT SEA-LIONv3 base benchmark performance, please refer to the SEA-HELM leaderboard, https://leaderboard.sea-lion.ai/.