Ariel Lee lilloukas commited on
Commit
017e1c3
1 Parent(s): dcccf02

Update README.md (#1)

Browse files

- Update README.md (8f09274653fb666f195777f43a6fb569fb38c682)


Co-authored-by: lilloukas <[email protected]>

Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -17,11 +17,11 @@ SuperPlatty-30B is a merge of [lilloukas/Platypus-30B](https://huggingface.co/li
17
 
18
  | Metric | Value |
19
  |-----------------------|-------|
20
- | MMLU (5-shot) | |
21
- | ARC (25-shot) | |
22
- | HellaSwag (10-shot) | |
23
- | TruthfulQA (0-shot) | |
24
- | Avg. | |
25
 
26
  We use state-of-the-art EleutherAI [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above.
27
 
@@ -51,22 +51,22 @@ Each task was evaluated on a single A100 80GB GPU.
51
 
52
  ARC:
53
  ```
54
- python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/arc_challenge_25shot.json --device cuda --num_fewshot 25
55
  ```
56
 
57
  HellaSwag:
58
  ```
59
- python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/hellaswag_10shot.json --device cuda --num_fewshot 10
60
  ```
61
 
62
  MMLU:
63
  ```
64
- python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks hendrycksTest-* --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/mmlu_5shot.json --device cuda --num_fewshot 5
65
  ```
66
 
67
  TruthfulQA:
68
  ```
69
- python main.py --model hf-causal-experimental --model_args pretrained=lilloukas/GPlatty-30B --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/Platypus-30B/truthfulqa_0shot.json --device cuda
70
  ```
71
  ## Limitations and bias
72
 
 
17
 
18
  | Metric | Value |
19
  |-----------------------|-------|
20
+ | MMLU (5-shot) | 62.6 |
21
+ | ARC (25-shot) | 66.1 |
22
+ | HellaSwag (10-shot) | 83.9 |
23
+ | TruthfulQA (0-shot) | 54.0 |
24
+ | Avg. | 66.6 |
25
 
26
  We use state-of-the-art EleutherAI [Language Model Evaluation Harness](https://github.com/EleutherAI/lm-evaluation-harness) to run the benchmark tests above.
27
 
 
51
 
52
  ARC:
53
  ```
54
+ python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks arc_challenge --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/arc_challenge_25shot.json --device cuda --num_fewshot 25
55
  ```
56
 
57
  HellaSwag:
58
  ```
59
+ python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks hellaswag --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/hellaswag_10shot.json --device cuda --num_fewshot 10
60
  ```
61
 
62
  MMLU:
63
  ```
64
+ python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks hendrycksTest-* --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/mmlu_5shot.json --device cuda --num_fewshot 5
65
  ```
66
 
67
  TruthfulQA:
68
  ```
69
+ python main.py --model hf-causal-experimental --model_args pretrained=ariellee/SuperPlatty-30B --tasks truthfulqa_mc --batch_size 1 --no_cache --write_out --output_path results/SuperPlatty-30B/truthfulqa_0shot.json --device cuda
70
  ```
71
  ## Limitations and bias
72