README.md · appvoid/arco-2 at main

metadata

license: apache-2.0

This is a passthrough of arco with an experimental model. It improved on arc challenge, only missing 1.2 points to get to the level of modern 3b baseline performance.
If you prefer answering multilingual, general knowledge, trivially simple questions chose qwen or llama. If you prefer solving trivially simple english tasks while being half the size, chose arco.

	
		
	
	
		prompt
	
there is no prompt intentionally set.

	
		
	
	
		benchmarks
	
zero-shot results from state-of-the-art small language models

	
		
Parameters
Model
MMLU
ARC-C
HellaSwag
PIQA
Winogrande
Average

0.5b
qwen 2
44.13
28.92
49.05
69.31
56.99
49.68

0.3b
smollm
25.52
37.71
56.41
71.93
59.27
50.17

0.5b
danube 3
24.81
36.18
60.46
73.78
61.01
51.25

0.5b
qwen 2.5
47.29
31.83
52.17
70.29
57.06
51.72

0.5b
arco
26.17
37.29
62.88
74.37
62.27
52.60

0.5b
arco 2
25.51
38.82
63.02
74.70
61.25
52.66

	

	
		
	
	
		supporters
	

	
		
	
	
		trivia
	
arco also means "arc optimized" hence the focus on this cognitive-based benchmark.

Parameters	Model	MMLU	ARC-C	HellaSwag	PIQA	Winogrande	Average
0.5b	qwen 2	44.13	28.92	49.05	69.31	56.99	49.68
0.3b	smollm	25.52	37.71	56.41	71.93	59.27	50.17
0.5b	danube 3	24.81	36.18	60.46	73.78	61.01	51.25
0.5b	qwen 2.5	47.29	31.83	52.17	70.29	57.06	51.72
0.5b	arco	26.17	37.29	62.88	74.37	62.27	52.60
0.5b	arco 2	25.51	38.82	63.02	74.70	61.25	52.66