yzabc007 commited on
Commit
190ad0c
1 Parent(s): 37b3751

Update space

Browse files
src/results/models_2024-10-08-03:10:26.811832.jsonl ADDED
@@ -0,0 +1,1770 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "config": {
4
+ "model_name": "ChatGPT-4o-latest (2024-09-03)",
5
+ "organization": "OpenAI",
6
+ "license": "Proprietary",
7
+ "knowledge_cutoff": "2023/10"
8
+ },
9
+ "results": {
10
+ "OVERALL": {
11
+ "Score": 0.974329609,
12
+ "Standard Deviation": 0.005024959031
13
+ },
14
+ "Geometry": {
15
+ "Score": 0.976028578,
16
+ "Standard Deviation": 0.01507912373
17
+ },
18
+ "Algebra": {
19
+ "Score": 0.951199453,
20
+ "Standard Deviation": 0.08452452108
21
+ },
22
+ "Probability": {
23
+ "Score": 0.842116641,
24
+ "Standard Deviation": 0.006267759054
25
+ },
26
+ "Logical": {
27
+ "Score": 0.828490728,
28
+ "Standard Deviation": 0.009134213144
29
+ },
30
+ "Social": {
31
+ "Score": 0.815902987,
32
+ "Standard Deviation": 0.0196254222
33
+ }
34
+ }
35
+ },
36
+ {
37
+ "config": {
38
+ "model_name": "gpt-4o-2024-08-06",
39
+ "organization": "OpenAI",
40
+ "license": "Proprietary",
41
+ "knowledge_cutoff": "2023/10"
42
+ },
43
+ "results": {
44
+ "OVERALL": {
45
+ "Score": 0.846571548,
46
+ "Standard Deviation": 0.03394056554
47
+ },
48
+ "Geometry": {
49
+ "Score": 0.99773096,
50
+ "Standard Deviation": 0.002835555172
51
+ },
52
+ "Algebra": {
53
+ "Score": 1.0,
54
+ "Standard Deviation": 0.0
55
+ },
56
+ "Probability": {
57
+ "Score": 0.78855795,
58
+ "Standard Deviation": 0.008188675452
59
+ },
60
+ "Logical": {
61
+ "Score": 0.668635768,
62
+ "Standard Deviation": 0.03466314094
63
+ },
64
+ "Social": {
65
+ "Score": 0.680417314,
66
+ "Standard Deviation": 0.00656867063
67
+ }
68
+ }
69
+ },
70
+ {
71
+ "config": {
72
+ "model_name": "gpt-4o-2024-05-13",
73
+ "organization": "OpenAI",
74
+ "license": "Proprietary",
75
+ "knowledge_cutoff": "2023/10"
76
+ },
77
+ "results": {
78
+ "OVERALL": {
79
+ "Score": 0.846334477,
80
+ "Standard Deviation": 0.09377911572
81
+ },
82
+ "Geometry": {
83
+ "Score": 0.972472377,
84
+ "Standard Deviation": 0.01648274205
85
+ },
86
+ "Algebra": {
87
+ "Score": 0.995511298,
88
+ "Standard Deviation": 0.004097802515
89
+ },
90
+ "Probability": {
91
+ "Score": 0.812149974,
92
+ "Standard Deviation": 0.007669585485
93
+ },
94
+ "Logical": {
95
+ "Score": 0.755019692,
96
+ "Standard Deviation": 0.008149588572
97
+ },
98
+ "Social": {
99
+ "Score": 0.609875087,
100
+ "Standard Deviation": 0.038729239
101
+ }
102
+ }
103
+ },
104
+ {
105
+ "config": {
106
+ "model_name": "gpt-4-turbo-2024-04-09",
107
+ "organization": "OpenAI",
108
+ "license": "Proprietary",
109
+ "knowledge_cutoff": "2023/12"
110
+ },
111
+ "results": {
112
+ "OVERALL": {
113
+ "Score": 0.855357972,
114
+ "Standard Deviation": 0.1016986368
115
+ },
116
+ "Geometry": {
117
+ "Score": 0.95374588,
118
+ "Standard Deviation": 0.03109307166
119
+ },
120
+ "Algebra": {
121
+ "Score": 0.930945223,
122
+ "Standard Deviation": 0.06705136813
123
+ },
124
+ "Probability": {
125
+ "Score": 0.750705448,
126
+ "Standard Deviation": 0.05944483103
127
+ },
128
+ "Logical": {
129
+ "Score": 0.77906699,
130
+ "Standard Deviation": 0.007406734161
131
+ },
132
+ "Social": {
133
+ "Score": 0.715935163,
134
+ "Standard Deviation": 0.1209141409
135
+ }
136
+ }
137
+ },
138
+ {
139
+ "config": {
140
+ "model_name": "gemini-1.5-pro-001",
141
+ "organization": "Google",
142
+ "license": "Proprietary",
143
+ "knowledge_cutoff": "2024-01"
144
+ },
145
+ "results": {
146
+ "OVERALL": {
147
+ "Score": 0.797187842,
148
+ "Standard Deviation": 0.0272375249
149
+ },
150
+ "Geometry": {
151
+ "Score": 0.9947169,
152
+ "Standard Deviation": 0.009150597621
153
+ },
154
+ "Algebra": {
155
+ "Score": 0.857464301,
156
+ "Standard Deviation": 0.05014285338
157
+ },
158
+ "Probability": {
159
+ "Score": 0.651781767,
160
+ "Standard Deviation": 0.04156998547
161
+ },
162
+ "Logical": {
163
+ "Score": 0.739745471,
164
+ "Standard Deviation": 0.01631532019
165
+ },
166
+ "Social": {
167
+ "Score": 0.649601885,
168
+ "Standard Deviation": 0.104854889
169
+ }
170
+ }
171
+ },
172
+ {
173
+ "config": {
174
+ "model_name": "qwen2-72b-instruct",
175
+ "organization": "Alibaba",
176
+ "license": "Qianwen LICENSE",
177
+ "knowledge_cutoff": "2024-02"
178
+ },
179
+ "results": {
180
+ "OVERALL": {
181
+ "Score": 0.737918558,
182
+ "Standard Deviation": 0.09069077339
183
+ },
184
+ "Geometry": {
185
+ "Score": 0.796870305,
186
+ "Standard Deviation": 0.0509025346
187
+ },
188
+ "Algebra": {
189
+ "Score": 0.836194231,
190
+ "Standard Deviation": 0.04517093028
191
+ },
192
+ "Probability": {
193
+ "Score": 0.788068004,
194
+ "Standard Deviation": 0.007288989044
195
+ },
196
+ "Logical": {
197
+ "Score": 0.619300904,
198
+ "Standard Deviation": 0.06377931612
199
+ },
200
+ "Social": {
201
+ "Score": 0.652578786,
202
+ "Standard Deviation": 0.04259293171
203
+ }
204
+ }
205
+ },
206
+ {
207
+ "config": {
208
+ "model_name": "gpt-4o-mini-2024-07-18",
209
+ "organization": "OpenAI",
210
+ "license": "Proprietary",
211
+ "knowledge_cutoff": "2024-07"
212
+ },
213
+ "results": {
214
+ "OVERALL": {
215
+ "Score": 0.847694133,
216
+ "Standard Deviation": 0.02164304402
217
+ },
218
+ "Geometry": {
219
+ "Score": 0.946650435,
220
+ "Standard Deviation": 0.01831236482
221
+ },
222
+ "Algebra": {
223
+ "Score": 0.796243022,
224
+ "Standard Deviation": 0.05537539202
225
+ },
226
+ "Probability": {
227
+ "Score": 0.798402685,
228
+ "Standard Deviation": 0.009404491967
229
+ },
230
+ "Logical": {
231
+ "Score": 0.727009735,
232
+ "Standard Deviation": 0.02628110141
233
+ },
234
+ "Social": {
235
+ "Score": 0.691949855,
236
+ "Standard Deviation": 0.02072934333
237
+ }
238
+ }
239
+ },
240
+ {
241
+ "config": {
242
+ "model_name": "claude-3.5-sonnet",
243
+ "organization": "Anthropic",
244
+ "license": "Proprietary",
245
+ "knowledge_cutoff": "2024-03"
246
+ },
247
+ "results": {
248
+ "OVERALL": {
249
+ "Score": 0.839004422,
250
+ "Standard Deviation": 0.1461079564
251
+ },
252
+ "Geometry": {
253
+ "Score": 0.95316419,
254
+ "Standard Deviation": 0.02081192856
255
+ },
256
+ "Algebra": {
257
+ "Score": 0.759789952,
258
+ "Standard Deviation": 0.02611765096
259
+ },
260
+ "Probability": {
261
+ "Score": 0.707730127,
262
+ "Standard Deviation": 0.0394436664
263
+ },
264
+ "Logical": {
265
+ "Score": 0.77342666,
266
+ "Standard Deviation": 0.002892426458
267
+ },
268
+ "Social": {
269
+ "Score": 0.790002247,
270
+ "Standard Deviation": 0.1007410022
271
+ }
272
+ }
273
+ },
274
+ {
275
+ "config": {
276
+ "model_name": "o1-mini",
277
+ "organization": "OpenAI",
278
+ "license": "Proprietary",
279
+ "knowledge_cutoff": "2024-01"
280
+ },
281
+ "results": {
282
+ "OVERALL": {
283
+ "Score": 1.0,
284
+ "Standard Deviation": 0.0
285
+ },
286
+ "Geometry": {
287
+ "Score": "N/A",
288
+ "Standard Deviation": "N/A"
289
+ },
290
+ "Algebra": {
291
+ "Score": "N/A",
292
+ "Standard Deviation": "N/A"
293
+ },
294
+ "Probability": {
295
+ "Score": 1.0,
296
+ "Standard Deviation": 0.0
297
+ },
298
+ "Logical": {
299
+ "Score": 1.0,
300
+ "Standard Deviation": 0.0
301
+ },
302
+ "Social": {
303
+ "Score": 0.993974241,
304
+ "Standard Deviation": 0.001996882328
305
+ }
306
+ }
307
+ },
308
+ {
309
+ "config": {
310
+ "model_name": "o1-preview",
311
+ "organization": "OpenAI",
312
+ "license": "Proprietary",
313
+ "knowledge_cutoff": "2024-01"
314
+ },
315
+ "results": {
316
+ "OVERALL": {
317
+ "Score": 0.945884589,
318
+ "Standard Deviation": 0.01059250762
319
+ },
320
+ "Geometry": {
321
+ "Score": "N/A",
322
+ "Standard Deviation": "N/A"
323
+ },
324
+ "Algebra": {
325
+ "Score": "N/A",
326
+ "Standard Deviation": "N/A"
327
+ },
328
+ "Probability": {
329
+ "Score": 0.964666392,
330
+ "Standard Deviation": 0.003139983398
331
+ },
332
+ "Logical": {
333
+ "Score": 0.987950057,
334
+ "Standard Deviation": 0.004881220327
335
+ },
336
+ "Social": {
337
+ "Score": 1.0,
338
+ "Standard Deviation": 0.0
339
+ }
340
+ }
341
+ },
342
+ {
343
+ "config": {
344
+ "model_name": "gemini-1.5-flash-001",
345
+ "organization": "Google",
346
+ "license": "Proprietary",
347
+ "knowledge_cutoff": "2024-02"
348
+ },
349
+ "results": {
350
+ "OVERALL": {
351
+ "Score": 0.726493401,
352
+ "Standard Deviation": 0.01113913725
353
+ },
354
+ "Geometry": {
355
+ "Score": 0.804144103,
356
+ "Standard Deviation": 0.1327142178
357
+ },
358
+ "Algebra": {
359
+ "Score": 0.731776765,
360
+ "Standard Deviation": 0.02594657111
361
+ },
362
+ "Probability": {
363
+ "Score": 0.614461891,
364
+ "Standard Deviation": 0.04690131826
365
+ },
366
+ "Logical": {
367
+ "Score": 0.630805991,
368
+ "Standard Deviation": 0.04871350612
369
+ },
370
+ "Social": {
371
+ "Score": 0.555933822,
372
+ "Standard Deviation": 0.1029934524
373
+ }
374
+ }
375
+ },
376
+ {
377
+ "config": {
378
+ "model_name": "gpt4-1106",
379
+ "organization": "OpenAI",
380
+ "license": "Proprietary",
381
+ "knowledge_cutoff": "2024-04"
382
+ },
383
+ "results": {
384
+ "OVERALL": {
385
+ "Score": 0.816347784,
386
+ "Standard Deviation": 0.1566815755
387
+ },
388
+ "Geometry": {
389
+ "Score": 0.71843088,
390
+ "Standard Deviation": 0.04778038294
391
+ },
392
+ "Algebra": {
393
+ "Score": 0.712910417,
394
+ "Standard Deviation": 0.02581828898
395
+ },
396
+ "Probability": {
397
+ "Score": 0.623947619,
398
+ "Standard Deviation": 0.03502982933
399
+ },
400
+ "Logical": {
401
+ "Score": 0.637482274,
402
+ "Standard Deviation": 0.04158809888
403
+ },
404
+ "Social": {
405
+ "Score": 0.450609816,
406
+ "Standard Deviation": 0.05208655446
407
+ }
408
+ }
409
+ },
410
+ {
411
+ "config": {
412
+ "model_name": "gemma-2-27b-it",
413
+ "organization": "Google",
414
+ "license": "Gemma License",
415
+ "knowledge_cutoff": "2024-03"
416
+ },
417
+ "results": {
418
+ "OVERALL": {
419
+ "Score": 0.624169623,
420
+ "Standard Deviation": 0.1048365121
421
+ },
422
+ "Geometry": {
423
+ "Score": 0.60112744,
424
+ "Standard Deviation": 0.0469109952
425
+ },
426
+ "Algebra": {
427
+ "Score": 0.687955914,
428
+ "Standard Deviation": 0.01959958192
429
+ },
430
+ "Probability": {
431
+ "Score": 0.589524771,
432
+ "Standard Deviation": 0.03112689325
433
+ },
434
+ "Logical": {
435
+ "Score": 0.614978944,
436
+ "Standard Deviation": 0.05710657859
437
+ },
438
+ "Social": {
439
+ "Score": 0.487844257,
440
+ "Standard Deviation": 0.05857760809
441
+ }
442
+ }
443
+ },
444
+ {
445
+ "config": {
446
+ "model_name": "claude-3-opus",
447
+ "organization": "Anthropic",
448
+ "license": "Proprietary",
449
+ "knowledge_cutoff": "2024-01"
450
+ },
451
+ "results": {
452
+ "OVERALL": {
453
+ "Score": 0.650636271,
454
+ "Standard Deviation": 0.1197773541
455
+ },
456
+ "Geometry": {
457
+ "Score": 0.7215743,
458
+ "Standard Deviation": 0.04712598358
459
+ },
460
+ "Algebra": {
461
+ "Score": 0.68777327,
462
+ "Standard Deviation": 0.02382683713
463
+ },
464
+ "Probability": {
465
+ "Score": 0.626471421,
466
+ "Standard Deviation": 0.02911817976
467
+ },
468
+ "Logical": {
469
+ "Score": 0.692346381,
470
+ "Standard Deviation": 0.03617185198
471
+ },
472
+ "Social": {
473
+ "Score": 0.663410854,
474
+ "Standard Deviation": 0.09540220876
475
+ }
476
+ }
477
+ },
478
+ {
479
+ "config": {
480
+ "model_name": "gemma-2-9b-it-simpo",
481
+ "organization": "Google",
482
+ "license": "Gemma License",
483
+ "knowledge_cutoff": "2024-02"
484
+ },
485
+ "results": {
486
+ "OVERALL": {
487
+ "Score": "N/A",
488
+ "Standard Deviation": "N/A"
489
+ },
490
+ "Geometry": {
491
+ "Score": 0.582787508,
492
+ "Standard Deviation": 0.03965204074
493
+ },
494
+ "Algebra": {
495
+ "Score": 0.658648133,
496
+ "Standard Deviation": 0.02565919856
497
+ },
498
+ "Probability": {
499
+ "Score": 0.547861265,
500
+ "Standard Deviation": 0.02885209131
501
+ },
502
+ "Logical": {
503
+ "Score": 0.540720893,
504
+ "Standard Deviation": 0.01970134508
505
+ },
506
+ "Social": {
507
+ "Score": 0.635266187,
508
+ "Standard Deviation": 0.03620021751
509
+ }
510
+ }
511
+ },
512
+ {
513
+ "config": {
514
+ "model_name": "qwen1.5-72b-chat",
515
+ "organization": "Alibaba",
516
+ "license": "Qianwen LICENSE",
517
+ "knowledge_cutoff": "2024-03"
518
+ },
519
+ "results": {
520
+ "OVERALL": {
521
+ "Score": 0.519549796,
522
+ "Standard Deviation": 0.00903634343
523
+ },
524
+ "Geometry": {
525
+ "Score": 0.543139301,
526
+ "Standard Deviation": 0.03425202326
527
+ },
528
+ "Algebra": {
529
+ "Score": 0.635228729,
530
+ "Standard Deviation": 0.01944043425
531
+ },
532
+ "Probability": {
533
+ "Score": 0.486948658,
534
+ "Standard Deviation": 0.06064655315
535
+ },
536
+ "Logical": {
537
+ "Score": 0.284069394,
538
+ "Standard Deviation": 0.02686608506
539
+ },
540
+ "Social": {
541
+ "Score": 0.415007627,
542
+ "Standard Deviation": 0.03920053159
543
+ }
544
+ }
545
+ },
546
+ {
547
+ "config": {
548
+ "model_name": "qwen1.5-32b-chat",
549
+ "organization": "Alibaba",
550
+ "license": "Qianwen LICENSE",
551
+ "knowledge_cutoff": "2024-03"
552
+ },
553
+ "results": {
554
+ "OVERALL": {
555
+ "Score": 0.393789407,
556
+ "Standard Deviation": 0.05413770095
557
+ },
558
+ "Geometry": {
559
+ "Score": 0.51086835,
560
+ "Standard Deviation": 0.04052471998
561
+ },
562
+ "Algebra": {
563
+ "Score": 0.609003168,
564
+ "Standard Deviation": 0.04874143541
565
+ },
566
+ "Probability": {
567
+ "Score": 0.476300002,
568
+ "Standard Deviation": 0.05322403912
569
+ },
570
+ "Logical": {
571
+ "Score": 0.331781014,
572
+ "Standard Deviation": 0.004938997686
573
+ },
574
+ "Social": {
575
+ "Score": 0.380987334,
576
+ "Standard Deviation": 0.03762251776
577
+ }
578
+ }
579
+ },
580
+ {
581
+ "config": {
582
+ "model_name": "google-gemma-2-9b-it",
583
+ "organization": "Google",
584
+ "license": "Proprietary",
585
+ "knowledge_cutoff": "2024-01"
586
+ },
587
+ "results": {
588
+ "OVERALL": {
589
+ "Score": 0.489663449,
590
+ "Standard Deviation": 0.002595702019
591
+ },
592
+ "Geometry": {
593
+ "Score": 0.575371308,
594
+ "Standard Deviation": 0.03556220251
595
+ },
596
+ "Algebra": {
597
+ "Score": 0.597045661,
598
+ "Standard Deviation": 0.0313828123
599
+ },
600
+ "Probability": {
601
+ "Score": 0.589221807,
602
+ "Standard Deviation": 0.03110811656
603
+ },
604
+ "Logical": {
605
+ "Score": 0.587579897,
606
+ "Standard Deviation": 0.05512716783
607
+ },
608
+ "Social": {
609
+ "Score": 0.768337958,
610
+ "Standard Deviation": 0.04078610476
611
+ }
612
+ }
613
+ },
614
+ {
615
+ "config": {
616
+ "model_name": "yi-1.5-34b-chat",
617
+ "organization": "01 AI",
618
+ "license": "Proprietary",
619
+ "knowledge_cutoff": "2024-01"
620
+ },
621
+ "results": {
622
+ "OVERALL": {
623
+ "Score": 0.607812897,
624
+ "Standard Deviation": 0.1440881293
625
+ },
626
+ "Geometry": {
627
+ "Score": 0.566666724,
628
+ "Standard Deviation": 0.04001381658
629
+ },
630
+ "Algebra": {
631
+ "Score": 0.590997292,
632
+ "Standard Deviation": 0.03594087315
633
+ },
634
+ "Probability": {
635
+ "Score": 0.589524589,
636
+ "Standard Deviation": 0.03112618772
637
+ },
638
+ "Logical": {
639
+ "Score": 0.574105508,
640
+ "Standard Deviation": 0.03441737941
641
+ },
642
+ "Social": {
643
+ "Score": 0.516980832,
644
+ "Standard Deviation": 0.03369347985
645
+ }
646
+ }
647
+ },
648
+ {
649
+ "config": {
650
+ "model_name": "meta-llama-3.1-8b-instruct",
651
+ "organization": "Meta",
652
+ "license": "Llama 3.1 Community",
653
+ "knowledge_cutoff": "2024-02"
654
+ },
655
+ "results": {
656
+ "OVERALL": {
657
+ "Score": 0.505936324,
658
+ "Standard Deviation": 0.05286756493
659
+ },
660
+ "Geometry": {
661
+ "Score": 0.522442162,
662
+ "Standard Deviation": 0.03908236317
663
+ },
664
+ "Algebra": {
665
+ "Score": 0.582702645,
666
+ "Standard Deviation": 0.05002277711
667
+ },
668
+ "Probability": {
669
+ "Score": 0.495001149,
670
+ "Standard Deviation": 0.05244587037
671
+ },
672
+ "Logical": {
673
+ "Score": 0.443030561,
674
+ "Standard Deviation": 0.01343820628
675
+ },
676
+ "Social": {
677
+ "Score": 0.329195941,
678
+ "Standard Deviation": 0.03925019528
679
+ }
680
+ }
681
+ },
682
+ {
683
+ "config": {
684
+ "model_name": "gpt3.5-turbo-0125",
685
+ "organization": "OpenAI",
686
+ "license": "Proprietary",
687
+ "knowledge_cutoff": "2023-12"
688
+ },
689
+ "results": {
690
+ "OVERALL": {
691
+ "Score": 0.313398088,
692
+ "Standard Deviation": 0.09322528606
693
+ },
694
+ "Geometry": {
695
+ "Score": 0.678714519,
696
+ "Standard Deviation": 0.05926546762
697
+ },
698
+ "Algebra": {
699
+ "Score": 0.569296173,
700
+ "Standard Deviation": 0.05277281097
701
+ },
702
+ "Probability": {
703
+ "Score": 0.448460767,
704
+ "Standard Deviation": 0.05768095196
705
+ },
706
+ "Logical": {
707
+ "Score": 0.148521348,
708
+ "Standard Deviation": 0.04033712907
709
+ },
710
+ "Social": {
711
+ "Score": 0.235071541,
712
+ "Standard Deviation": 0.02632892457
713
+ }
714
+ }
715
+ },
716
+ {
717
+ "config": {
718
+ "model_name": "llama-3-70b-instruct",
719
+ "organization": "Meta",
720
+ "license": "Llama 3 Community",
721
+ "knowledge_cutoff": "2024-03"
722
+ },
723
+ "results": {
724
+ "OVERALL": {
725
+ "Score": 0.456689885,
726
+ "Standard Deviation": 0.01385989995
727
+ },
728
+ "Geometry": {
729
+ "Score": 0.516865529,
730
+ "Standard Deviation": 0.03858112564
731
+ },
732
+ "Algebra": {
733
+ "Score": 0.566756531,
734
+ "Standard Deviation": 0.03369826926
735
+ },
736
+ "Probability": {
737
+ "Score": 0.513857306,
738
+ "Standard Deviation": 0.05453699062
739
+ },
740
+ "Logical": {
741
+ "Score": 0.713796415,
742
+ "Standard Deviation": 0.02031215107
743
+ },
744
+ "Social": {
745
+ "Score": 0.45872939,
746
+ "Standard Deviation": 0.05347039576
747
+ }
748
+ }
749
+ },
750
+ {
751
+ "config": {
752
+ "model_name": "claude-3-sonnet",
753
+ "organization": "Anthropic",
754
+ "license": "Proprietary",
755
+ "knowledge_cutoff": "2024-02"
756
+ },
757
+ "results": {
758
+ "OVERALL": {
759
+ "Score": 0.520010833,
760
+ "Standard Deviation": 0.005030563799
761
+ },
762
+ "Geometry": {
763
+ "Score": 0.675613638,
764
+ "Standard Deviation": 0.05275594408
765
+ },
766
+ "Algebra": {
767
+ "Score": 0.552025728,
768
+ "Standard Deviation": 0.04122192409
769
+ },
770
+ "Probability": {
771
+ "Score": 0.516192848,
772
+ "Standard Deviation": 0.04152293217
773
+ },
774
+ "Logical": {
775
+ "Score": 0.588545747,
776
+ "Standard Deviation": 0.06068211943
777
+ },
778
+ "Social": {
779
+ "Score": 0.570437582,
780
+ "Standard Deviation": 0.08607040862
781
+ }
782
+ }
783
+ },
784
+ {
785
+ "config": {
786
+ "model_name": "qwen1.5-14b-chat",
787
+ "organization": "Alibaba",
788
+ "license": "Qianwen LICENSE",
789
+ "knowledge_cutoff": "2024-01"
790
+ },
791
+ "results": {
792
+ "OVERALL": {
793
+ "Score": 0.415328996,
794
+ "Standard Deviation": 0.0743938717
795
+ },
796
+ "Geometry": {
797
+ "Score": 0.452504016,
798
+ "Standard Deviation": 0.04225594393
799
+ },
800
+ "Algebra": {
801
+ "Score": 0.538655725,
802
+ "Standard Deviation": 0.03721542594
803
+ },
804
+ "Probability": {
805
+ "Score": 0.397185975,
806
+ "Standard Deviation": 0.05607695946
807
+ },
808
+ "Logical": {
809
+ "Score": 0.264573129,
810
+ "Standard Deviation": 0.03936133174
811
+ },
812
+ "Social": {
813
+ "Score": 0.287370142,
814
+ "Standard Deviation": 0.04264085315
815
+ }
816
+ }
817
+ },
818
+ {
819
+ "config": {
820
+ "model_name": "claude-3-haiku",
821
+ "organization": "Anthropic",
822
+ "license": "Proprietary",
823
+ "knowledge_cutoff": "2024-01"
824
+ },
825
+ "results": {
826
+ "OVERALL": {
827
+ "Score": 0.453901163,
828
+ "Standard Deviation": 0.003604084261
829
+ },
830
+ "Geometry": {
831
+ "Score": 0.607993912,
832
+ "Standard Deviation": 0.05793460748
833
+ },
834
+ "Algebra": {
835
+ "Score": 0.520054055,
836
+ "Standard Deviation": 0.03333544511
837
+ },
838
+ "Probability": {
839
+ "Score": 0.474460688,
840
+ "Standard Deviation": 0.0446501933
841
+ },
842
+ "Logical": {
843
+ "Score": 0.512815976,
844
+ "Standard Deviation": 0.0163264281
845
+ },
846
+ "Social": {
847
+ "Score": 0.551083976,
848
+ "Standard Deviation": 0.05374722539
849
+ }
850
+ }
851
+ },
852
+ {
853
+ "config": {
854
+ "model_name": "claude-2.1",
855
+ "organization": "Anthropic",
856
+ "license": "Proprietary",
857
+ "knowledge_cutoff": "2023-12"
858
+ },
859
+ "results": {
860
+ "OVERALL": {
861
+ "Score": 0.35814708,
862
+ "Standard Deviation": 0.09168134168
863
+ },
864
+ "Geometry": {
865
+ "Score": 0.62752395,
866
+ "Standard Deviation": 0.07232659398
867
+ },
868
+ "Algebra": {
869
+ "Score": 0.508849609,
870
+ "Standard Deviation": 0.0346897465
871
+ },
872
+ "Probability": {
873
+ "Score": 0.41477086,
874
+ "Standard Deviation": 0.05964060239
875
+ },
876
+ "Logical": {
877
+ "Score": 0.482923674,
878
+ "Standard Deviation": 0.01989147048
879
+ },
880
+ "Social": {
881
+ "Score": 0.333804568,
882
+ "Standard Deviation": 0.03775548253
883
+ }
884
+ }
885
+ },
886
+ {
887
+ "config": {
888
+ "model_name": "mistral-8x7b-instruct-v0.1",
889
+ "organization": "Mistral",
890
+ "license": "Apache 2.0",
891
+ "knowledge_cutoff": "2023-12"
892
+ },
893
+ "results": {
894
+ "OVERALL": {
895
+ "Score": 0.382659161,
896
+ "Standard Deviation": 0.07594496929
897
+ },
898
+ "Geometry": {
899
+ "Score": 0.432216097,
900
+ "Standard Deviation": 0.04747949254
901
+ },
902
+ "Algebra": {
903
+ "Score": 0.478314888,
904
+ "Standard Deviation": 0.01998797419
905
+ },
906
+ "Probability": {
907
+ "Score": 0.427144725,
908
+ "Standard Deviation": 0.0590923329
909
+ },
910
+ "Logical": {
911
+ "Score": 0.340041983,
912
+ "Standard Deviation": 0.008397574592
913
+ },
914
+ "Social": {
915
+ "Score": 0.251949622,
916
+ "Standard Deviation": 0.03346674405
917
+ }
918
+ }
919
+ },
920
+ {
921
+ "config": {
922
+ "model_name": "claude-2.0",
923
+ "organization": "Anthropic",
924
+ "license": "Proprietary",
925
+ "knowledge_cutoff": "2023-10"
926
+ },
927
+ "results": {
928
+ "OVERALL": {
929
+ "Score": 0.322718057,
930
+ "Standard Deviation": 0.08369883584
931
+ },
932
+ "Geometry": {
933
+ "Score": 0.604141967,
934
+ "Standard Deviation": 0.05116441826
935
+ },
936
+ "Algebra": {
937
+ "Score": 0.474350734,
938
+ "Standard Deviation": 0.01510393066
939
+ },
940
+ "Probability": {
941
+ "Score": 0.437950412,
942
+ "Standard Deviation": 0.05985594317
943
+ },
944
+ "Logical": {
945
+ "Score": 0.445620646,
946
+ "Standard Deviation": 0.01812614805
947
+ },
948
+ "Social": {
949
+ "Score": 0.469422836,
950
+ "Standard Deviation": 0.05999901796
951
+ }
952
+ }
953
+ },
954
+ {
955
+ "config": {
956
+ "model_name": "starling-lm-7b-beta",
957
+ "organization": "Nexusflow",
958
+ "license": "Apache-2.0",
959
+ "knowledge_cutoff": "2024-01"
960
+ },
961
+ "results": {
962
+ "OVERALL": {
963
+ "Score": 0.479391856,
964
+ "Standard Deviation": 0.04199990887
965
+ },
966
+ "Geometry": {
967
+ "Score": 0.446654388,
968
+ "Standard Deviation": 0.05637864999
969
+ },
970
+ "Algebra": {
971
+ "Score": 0.473952749,
972
+ "Standard Deviation": 0.01584301288
973
+ },
974
+ "Probability": {
975
+ "Score": 0.395197837,
976
+ "Standard Deviation": 0.05814798892
977
+ },
978
+ "Logical": {
979
+ "Score": 0.39927199,
980
+ "Standard Deviation": 0.02125277518
981
+ },
982
+ "Social": {
983
+ "Score": 0.380021662,
984
+ "Standard Deviation": 0.04622452748
985
+ }
986
+ }
987
+ },
988
+ {
989
+ "config": {
990
+ "model_name": "gemini-1.0-pro-001",
991
+ "organization": "Google",
992
+ "license": "Proprietary",
993
+ "knowledge_cutoff": "2023-11"
994
+ },
995
+ "results": {
996
+ "OVERALL": {
997
+ "Score": 0.449040654,
998
+ "Standard Deviation": 0.0450610177
999
+ },
1000
+ "Geometry": {
1001
+ "Score": 0.578347959,
1002
+ "Standard Deviation": 0.04242873607
1003
+ },
1004
+ "Algebra": {
1005
+ "Score": 0.462417786,
1006
+ "Standard Deviation": 0.01668313635
1007
+ },
1008
+ "Probability": {
1009
+ "Score": 0.289836324,
1010
+ "Standard Deviation": 0.05739831115
1011
+ },
1012
+ "Logical": {
1013
+ "Score": 0.191140355,
1014
+ "Standard Deviation": 0.03394652499
1015
+ },
1016
+ "Social": {
1017
+ "Score": 0.130790863,
1018
+ "Standard Deviation": 0.02800188173
1019
+ }
1020
+ }
1021
+ },
1022
+ {
1023
+ "config": {
1024
+ "model_name": "openchat-3.5-0106",
1025
+ "organization": "OpenChat",
1026
+ "license": "Apache-2.0",
1027
+ "knowledge_cutoff": "2024-01"
1028
+ },
1029
+ "results": {
1030
+ "OVERALL": {
1031
+ "Score": 0.363929888,
1032
+ "Standard Deviation": 0.08602347145
1033
+ },
1034
+ "Geometry": {
1035
+ "Score": 0.38715246,
1036
+ "Standard Deviation": 0.03701851946
1037
+ },
1038
+ "Algebra": {
1039
+ "Score": 0.441233712,
1040
+ "Standard Deviation": 0.01135753754
1041
+ },
1042
+ "Probability": {
1043
+ "Score": 0.38802618,
1044
+ "Standard Deviation": 0.05663879714
1045
+ },
1046
+ "Logical": {
1047
+ "Score": 0.336754383,
1048
+ "Standard Deviation": 0.01608478079
1049
+ },
1050
+ "Social": {
1051
+ "Score": 0.250891608,
1052
+ "Standard Deviation": 0.03253769914
1053
+ }
1054
+ }
1055
+ },
1056
+ {
1057
+ "config": {
1058
+ "model_name": "openchat-3.5",
1059
+ "organization": "OpenChat",
1060
+ "license": "Apache-2.0",
1061
+ "knowledge_cutoff": "2023-12"
1062
+ },
1063
+ "results": {
1064
+ "OVERALL": {
1065
+ "Score": 0.361341296,
1066
+ "Standard Deviation": 0.09034869493
1067
+ },
1068
+ "Geometry": {
1069
+ "Score": 0.401699069,
1070
+ "Standard Deviation": 0.03410726557
1071
+ },
1072
+ "Algebra": {
1073
+ "Score": 0.414095336,
1074
+ "Standard Deviation": 0.01881964261
1075
+ },
1076
+ "Probability": {
1077
+ "Score": 0.349601002,
1078
+ "Standard Deviation": 0.05077455539
1079
+ },
1080
+ "Logical": {
1081
+ "Score": 0.331069242,
1082
+ "Standard Deviation": 0.02180827173
1083
+ },
1084
+ "Social": {
1085
+ "Score": 0.319991655,
1086
+ "Standard Deviation": 0.04502478724
1087
+ }
1088
+ }
1089
+ },
1090
+ {
1091
+ "config": {
1092
+ "model_name": "command-r-(08-2024)",
1093
+ "organization": "Cohere",
1094
+ "license": "CC-BY-NC-4.0",
1095
+ "knowledge_cutoff": "2024-08"
1096
+ },
1097
+ "results": {
1098
+ "OVERALL": {
1099
+ "Score": 0.427605298,
1100
+ "Standard Deviation": 0.01747449163
1101
+ },
1102
+ "Geometry": {
1103
+ "Score": 0.448300727,
1104
+ "Standard Deviation": 0.04996362328
1105
+ },
1106
+ "Algebra": {
1107
+ "Score": 0.417519167,
1108
+ "Standard Deviation": 0.01822196902
1109
+ },
1110
+ "Probability": {
1111
+ "Score": 0.366336281,
1112
+ "Standard Deviation": 0.04716826942
1113
+ },
1114
+ "Logical": {
1115
+ "Score": 0.214657906,
1116
+ "Standard Deviation": 0.03003579835
1117
+ },
1118
+ "Social": {
1119
+ "Score": 0.276088379,
1120
+ "Standard Deviation": 0.03295234688
1121
+ }
1122
+ }
1123
+ },
1124
+ {
1125
+ "config": {
1126
+ "model_name": "gemma-1.1-7b-it",
1127
+ "organization": "Google",
1128
+ "license": "Gemma License",
1129
+ "knowledge_cutoff": "2023-11"
1130
+ },
1131
+ "results": {
1132
+ "OVERALL": {
1133
+ "Score": 0.339506922,
1134
+ "Standard Deviation": 0.1066279108
1135
+ },
1136
+ "Geometry": {
1137
+ "Score": 0.324170977,
1138
+ "Standard Deviation": 0.04668553765
1139
+ },
1140
+ "Algebra": {
1141
+ "Score": 0.398684697,
1142
+ "Standard Deviation": 0.01982398259
1143
+ },
1144
+ "Probability": {
1145
+ "Score": 0.293253175,
1146
+ "Standard Deviation": 0.05126192191
1147
+ },
1148
+ "Logical": {
1149
+ "Score": 0.317750796,
1150
+ "Standard Deviation": 0.01101933543
1151
+ },
1152
+ "Social": {
1153
+ "Score": 0.179073276,
1154
+ "Standard Deviation": 0.02009658805
1155
+ }
1156
+ }
1157
+ },
1158
+ {
1159
+ "config": {
1160
+ "model_name": "llama3-8b-instruct",
1161
+ "organization": "Meta",
1162
+ "license": "Llama 3 Community",
1163
+ "knowledge_cutoff": "2024-01"
1164
+ },
1165
+ "results": {
1166
+ "OVERALL": {
1167
+ "Score": 0.367722676,
1168
+ "Standard Deviation": 0.1071368221
1169
+ },
1170
+ "Geometry": {
1171
+ "Score": 0.367143758,
1172
+ "Standard Deviation": 0.04363680358
1173
+ },
1174
+ "Algebra": {
1175
+ "Score": 0.391480973,
1176
+ "Standard Deviation": 0.02757445266
1177
+ },
1178
+ "Probability": {
1179
+ "Score": 0.317616445,
1180
+ "Standard Deviation": 0.04300430361
1181
+ },
1182
+ "Logical": {
1183
+ "Score": 0.461607495,
1184
+ "Standard Deviation": 0.02185028842
1185
+ },
1186
+ "Social": {
1187
+ "Score": 0.336373622,
1188
+ "Standard Deviation": 0.05762408512
1189
+ }
1190
+ }
1191
+ },
1192
+ {
1193
+ "config": {
1194
+ "model_name": "gemma-2-2b-it",
1195
+ "organization": "Google",
1196
+ "license": "Gemma License",
1197
+ "knowledge_cutoff": "2023-12"
1198
+ },
1199
+ "results": {
1200
+ "OVERALL": {
1201
+ "Score": 0.502167612,
1202
+ "Standard Deviation": 0.04389786763
1203
+ },
1204
+ "Geometry": {
1205
+ "Score": 0.395006676,
1206
+ "Standard Deviation": 0.05882607713
1207
+ },
1208
+ "Algebra": {
1209
+ "Score": 0.379391887,
1210
+ "Standard Deviation": 0.01722410785
1211
+ },
1212
+ "Probability": {
1213
+ "Score": 0.331231097,
1214
+ "Standard Deviation": 0.05392499987
1215
+ },
1216
+ "Logical": {
1217
+ "Score": 0.367687789,
1218
+ "Standard Deviation": 0.02547968808
1219
+ },
1220
+ "Social": {
1221
+ "Score": 0.393482094,
1222
+ "Standard Deviation": 0.06450214024
1223
+ }
1224
+ }
1225
+ },
1226
+ {
1227
+ "config": {
1228
+ "model_name": "starling-lm-7b-alpha",
1229
+ "organization": "Nexusflow",
1230
+ "license": "Apache-2.0",
1231
+ "knowledge_cutoff": "2023-12"
1232
+ },
1233
+ "results": {
1234
+ "OVERALL": {
1235
+ "Score": 0.366628765,
1236
+ "Standard Deviation": 0.08405492929
1237
+ },
1238
+ "Geometry": {
1239
+ "Score": 0.336782578,
1240
+ "Standard Deviation": 0.04069449132
1241
+ },
1242
+ "Algebra": {
1243
+ "Score": 0.371551932,
1244
+ "Standard Deviation": 0.03367241745
1245
+ },
1246
+ "Probability": {
1247
+ "Score": 0.331472505,
1248
+ "Standard Deviation": 0.04833324282
1249
+ },
1250
+ "Logical": {
1251
+ "Score": 0.260869624,
1252
+ "Standard Deviation": 0.03562735237
1253
+ },
1254
+ "Social": {
1255
+ "Score": 0.271975534,
1256
+ "Standard Deviation": 0.04266753408
1257
+ }
1258
+ }
1259
+ },
1260
+ {
1261
+ "config": {
1262
+ "model_name": "qwen1.5-4b-chat",
1263
+ "organization": "Alibaba",
1264
+ "license": "Qianwen LICENSE",
1265
+ "knowledge_cutoff": "2024-02"
1266
+ },
1267
+ "results": {
1268
+ "OVERALL": {
1269
+ "Score": 0.111876411,
1270
+ "Standard Deviation": 0.04241022785
1271
+ },
1272
+ "Geometry": {
1273
+ "Score": 0.215834522,
1274
+ "Standard Deviation": 0.0363766363
1275
+ },
1276
+ "Algebra": {
1277
+ "Score": 0.305589811,
1278
+ "Standard Deviation": 0.02354198912
1279
+ },
1280
+ "Probability": {
1281
+ "Score": 0.149365327,
1282
+ "Standard Deviation": 0.03489672675
1283
+ },
1284
+ "Logical": {
1285
+ "Score": 0.116210168,
1286
+ "Standard Deviation": 0.005927966496
1287
+ },
1288
+ "Social": {
1289
+ "Score": 0.18195615,
1290
+ "Standard Deviation": 0.02269805277
1291
+ }
1292
+ }
1293
+ },
1294
+ {
1295
+ "config": {
1296
+ "model_name": "command-r-(04-2024)",
1297
+ "organization": "Cohere",
1298
+ "license": "CC-BY-NC-4.0",
1299
+ "knowledge_cutoff": "2024-04"
1300
+ },
1301
+ "results": {
1302
+ "OVERALL": {
1303
+ "Score": 0.388783887,
1304
+ "Standard Deviation": 0.07417186783
1305
+ },
1306
+ "Geometry": {
1307
+ "Score": 0.300416698,
1308
+ "Standard Deviation": 0.03485612736
1309
+ },
1310
+ "Algebra": {
1311
+ "Score": 0.293120231,
1312
+ "Standard Deviation": 0.032926484
1313
+ },
1314
+ "Probability": {
1315
+ "Score": 0.281271304,
1316
+ "Standard Deviation": 0.05697149867
1317
+ },
1318
+ "Logical": {
1319
+ "Score": 0.276189906,
1320
+ "Standard Deviation": 0.03562914754
1321
+ },
1322
+ "Social": {
1323
+ "Score": 0.283882949,
1324
+ "Standard Deviation": 0.03336901148
1325
+ }
1326
+ }
1327
+ },
1328
+ {
1329
+ "config": {
1330
+ "model_name": "vicuna-33b",
1331
+ "organization": "LMSYS",
1332
+ "license": "Non-commercial",
1333
+ "knowledge_cutoff": "2023-12"
1334
+ },
1335
+ "results": {
1336
+ "OVERALL": {
1337
+ "Score": 0.316543555,
1338
+ "Standard Deviation": 0.08922095647
1339
+ },
1340
+ "Geometry": {
1341
+ "Score": 0.208284679,
1342
+ "Standard Deviation": 0.03937771461
1343
+ },
1344
+ "Algebra": {
1345
+ "Score": 0.248994048,
1346
+ "Standard Deviation": 0.02668175054
1347
+ },
1348
+ "Probability": {
1349
+ "Score": 0.222313995,
1350
+ "Standard Deviation": 0.03978859759
1351
+ },
1352
+ "Logical": {
1353
+ "Score": 0.180291222,
1354
+ "Standard Deviation": 0.021886267
1355
+ },
1356
+ "Social": {
1357
+ "Score": 0.257623798,
1358
+ "Standard Deviation": 0.02653724437
1359
+ }
1360
+ }
1361
+ },
1362
+ {
1363
+ "config": {
1364
+ "model_name": "gemma-7b-it",
1365
+ "organization": "Google",
1366
+ "license": "Gemma License",
1367
+ "knowledge_cutoff": "2023-12"
1368
+ },
1369
+ "results": {
1370
+ "OVERALL": {
1371
+ "Score": 0.285077558,
1372
+ "Standard Deviation": 0.08871758453
1373
+ },
1374
+ "Geometry": {
1375
+ "Score": 0.244791417,
1376
+ "Standard Deviation": 0.0289612078
1377
+ },
1378
+ "Algebra": {
1379
+ "Score": 0.250614794,
1380
+ "Standard Deviation": 0.01991678295
1381
+ },
1382
+ "Probability": {
1383
+ "Score": 0.174313053,
1384
+ "Standard Deviation": 0.03765424728
1385
+ },
1386
+ "Logical": {
1387
+ "Score": 0.197505536,
1388
+ "Standard Deviation": 0.02050298885
1389
+ },
1390
+ "Social": {
1391
+ "Score": 0.202138025,
1392
+ "Standard Deviation": 0.02098346639
1393
+ }
1394
+ }
1395
+ },
1396
+ {
1397
+ "config": {
1398
+ "model_name": "mistral-7b-instruct-2",
1399
+ "organization": "Mistral",
1400
+ "license": "Apache 2.0",
1401
+ "knowledge_cutoff": "2023-12"
1402
+ },
1403
+ "results": {
1404
+ "OVERALL": {
1405
+ "Score": 0.427513868,
1406
+ "Standard Deviation": 0.05553921135
1407
+ },
1408
+ "Geometry": {
1409
+ "Score": 0.216402626,
1410
+ "Standard Deviation": 0.03338414918
1411
+ },
1412
+ "Algebra": {
1413
+ "Score": 0.233777838,
1414
+ "Standard Deviation": 0.0155226054
1415
+ },
1416
+ "Probability": {
1417
+ "Score": 0.25118175,
1418
+ "Standard Deviation": 0.04065514593
1419
+ },
1420
+ "Logical": {
1421
+ "Score": 0.224469136,
1422
+ "Standard Deviation": 0.03404706752
1423
+ },
1424
+ "Social": {
1425
+ "Score": 0.209386782,
1426
+ "Standard Deviation": 0.02738569921
1427
+ }
1428
+ }
1429
+ },
1430
+ {
1431
+ "config": {
1432
+ "model_name": "mistral-7b-instruct-1",
1433
+ "organization": "Mistral",
1434
+ "license": "Apache 2.0",
1435
+ "knowledge_cutoff": "2023-12"
1436
+ },
1437
+ "results": {
1438
+ "OVERALL": {
1439
+ "Score": 0.23016314,
1440
+ "Standard Deviation": 0.07137625271
1441
+ },
1442
+ "Geometry": {
1443
+ "Score": 0.161799938,
1444
+ "Standard Deviation": 0.03595278559
1445
+ },
1446
+ "Algebra": {
1447
+ "Score": 0.210341624,
1448
+ "Standard Deviation": 0.01736539119
1449
+ },
1450
+ "Probability": {
1451
+ "Score": 0.238417922,
1452
+ "Standard Deviation": 0.03744211933
1453
+ },
1454
+ "Logical": {
1455
+ "Score": 0.142636601,
1456
+ "Standard Deviation": 0.02080406365
1457
+ },
1458
+ "Social": {
1459
+ "Score": 0.117646827,
1460
+ "Standard Deviation": 0.009321202779
1461
+ }
1462
+ }
1463
+ },
1464
+ {
1465
+ "config": {
1466
+ "model_name": "vicuna-13b",
1467
+ "organization": "LMSYS",
1468
+ "license": "Non-commercial",
1469
+ "knowledge_cutoff": "2023-11"
1470
+ },
1471
+ "results": {
1472
+ "OVERALL": {
1473
+ "Score": 0.201892849,
1474
+ "Standard Deviation": 0.06021749802
1475
+ },
1476
+ "Geometry": {
1477
+ "Score": 0.200941928,
1478
+ "Standard Deviation": 0.03366817781
1479
+ },
1480
+ "Algebra": {
1481
+ "Score": 0.196123323,
1482
+ "Standard Deviation": 0.0135715643
1483
+ },
1484
+ "Probability": {
1485
+ "Score": 0.141214079,
1486
+ "Standard Deviation": 0.02721328211
1487
+ },
1488
+ "Logical": {
1489
+ "Score": 0.148598631,
1490
+ "Standard Deviation": 0.02241523892
1491
+ },
1492
+ "Social": {
1493
+ "Score": 0.124655135,
1494
+ "Standard Deviation": 0.01122382671
1495
+ }
1496
+ }
1497
+ },
1498
+ {
1499
+ "config": {
1500
+ "model_name": "zephyr-7b-beta",
1501
+ "organization": "HuggingFace",
1502
+ "license": "MIT",
1503
+ "knowledge_cutoff": "2023-10"
1504
+ },
1505
+ "results": {
1506
+ "OVERALL": {
1507
+ "Score": 0.102705119,
1508
+ "Standard Deviation": 0.03683757312
1509
+ },
1510
+ "Geometry": {
1511
+ "Score": 0.114005544,
1512
+ "Standard Deviation": 0.03144354365
1513
+ },
1514
+ "Algebra": {
1515
+ "Score": 0.141766633,
1516
+ "Standard Deviation": 0.03179520129
1517
+ },
1518
+ "Probability": {
1519
+ "Score": 0.089050714,
1520
+ "Standard Deviation": 0.002136754266
1521
+ },
1522
+ "Logical": {
1523
+ "Score": 0.069520789,
1524
+ "Standard Deviation": 0.004477840857
1525
+ },
1526
+ "Social": {
1527
+ "Score": 0.0,
1528
+ "Standard Deviation": 0.0
1529
+ }
1530
+ }
1531
+ },
1532
+ {
1533
+ "config": {
1534
+ "model_name": "gemma-1.1-2b-it",
1535
+ "organization": "Google",
1536
+ "license": "Gemma License",
1537
+ "knowledge_cutoff": "2023-12"
1538
+ },
1539
+ "results": {
1540
+ "OVERALL": {
1541
+ "Score": 0.257700845,
1542
+ "Standard Deviation": 0.07369021445
1543
+ },
1544
+ "Geometry": {
1545
+ "Score": 0.183974034,
1546
+ "Standard Deviation": 0.0215548886
1547
+ },
1548
+ "Algebra": {
1549
+ "Score": 0.13422252,
1550
+ "Standard Deviation": 0.01922819511
1551
+ },
1552
+ "Probability": {
1553
+ "Score": 0.095628657,
1554
+ "Standard Deviation": 0.007536076456
1555
+ },
1556
+ "Logical": {
1557
+ "Score": 0.094965074,
1558
+ "Standard Deviation": 0.005019175487
1559
+ },
1560
+ "Social": {
1561
+ "Score": 0.167796727,
1562
+ "Standard Deviation": 0.01666541942
1563
+ }
1564
+ }
1565
+ },
1566
+ {
1567
+ "config": {
1568
+ "model_name": "llama2-7b-chat",
1569
+ "organization": "Meta",
1570
+ "license": "Llama 2 Community",
1571
+ "knowledge_cutoff": "2023-10"
1572
+ },
1573
+ "results": {
1574
+ "OVERALL": {
1575
+ "Score": 0.260189428,
1576
+ "Standard Deviation": 0.08019299364
1577
+ },
1578
+ "Geometry": {
1579
+ "Score": 0.087067276,
1580
+ "Standard Deviation": 0.04274343402
1581
+ },
1582
+ "Algebra": {
1583
+ "Score": 0.12308805,
1584
+ "Standard Deviation": 0.01856053622
1585
+ },
1586
+ "Probability": {
1587
+ "Score": 0.087515438,
1588
+ "Standard Deviation": 0.006315053573
1589
+ },
1590
+ "Logical": {
1591
+ "Score": 0.17312827,
1592
+ "Standard Deviation": 0.01867044092
1593
+ },
1594
+ "Social": {
1595
+ "Score": 0.152905272,
1596
+ "Standard Deviation": 0.007166957097
1597
+ }
1598
+ }
1599
+ },
1600
+ {
1601
+ "config": {
1602
+ "model_name": "gemma-2b-it",
1603
+ "organization": "Google",
1604
+ "license": "Gemma License",
1605
+ "knowledge_cutoff": "2023-11"
1606
+ },
1607
+ "results": {
1608
+ "OVERALL": {
1609
+ "Score": 0.234172069,
1610
+ "Standard Deviation": 0.06522685718
1611
+ },
1612
+ "Geometry": {
1613
+ "Score": 0.198571153,
1614
+ "Standard Deviation": 0.01699161031
1615
+ },
1616
+ "Algebra": {
1617
+ "Score": 0.109883009,
1618
+ "Standard Deviation": 0.01520005833
1619
+ },
1620
+ "Probability": {
1621
+ "Score": 0.06467432,
1622
+ "Standard Deviation": 0.002117497231
1623
+ },
1624
+ "Logical": {
1625
+ "Score": 0.039624492,
1626
+ "Standard Deviation": 0.007606972686
1627
+ },
1628
+ "Social": {
1629
+ "Score": 0.087452913,
1630
+ "Standard Deviation": 0.008170146562
1631
+ }
1632
+ }
1633
+ },
1634
+ {
1635
+ "config": {
1636
+ "model_name": "llama2-13b-chat",
1637
+ "organization": "Meta",
1638
+ "license": "Llama 2 Community",
1639
+ "knowledge_cutoff": "2023-12"
1640
+ },
1641
+ "results": {
1642
+ "OVERALL": {
1643
+ "Score": 0.263305684,
1644
+ "Standard Deviation": 0.07283640689
1645
+ },
1646
+ "Geometry": {
1647
+ "Score": 0.072729954,
1648
+ "Standard Deviation": 0.02315988261
1649
+ },
1650
+ "Algebra": {
1651
+ "Score": 0.080371692,
1652
+ "Standard Deviation": 0.01277569453
1653
+ },
1654
+ "Probability": {
1655
+ "Score": 0.117757344,
1656
+ "Standard Deviation": 0.02418619619
1657
+ },
1658
+ "Logical": {
1659
+ "Score": 0.193149889,
1660
+ "Standard Deviation": 0.01776690764
1661
+ },
1662
+ "Social": {
1663
+ "Score": 0.149125922,
1664
+ "Standard Deviation": 0.01157416827
1665
+ }
1666
+ }
1667
+ },
1668
+ {
1669
+ "config": {
1670
+ "model_name": "vicuna-7b",
1671
+ "organization": "LMSYS",
1672
+ "license": "Non-commercial",
1673
+ "knowledge_cutoff": "2023-11"
1674
+ },
1675
+ "results": {
1676
+ "OVERALL": {
1677
+ "Score": 0.198839786,
1678
+ "Standard Deviation": 0.05725381576
1679
+ },
1680
+ "Geometry": {
1681
+ "Score": 0.083457058,
1682
+ "Standard Deviation": 0.02520989111
1683
+ },
1684
+ "Algebra": {
1685
+ "Score": 0.070883882,
1686
+ "Standard Deviation": 0.007315853253
1687
+ },
1688
+ "Probability": {
1689
+ "Score": 0.080987673,
1690
+ "Standard Deviation": 0.005474288861
1691
+ },
1692
+ "Logical": {
1693
+ "Score": 0.100065588,
1694
+ "Standard Deviation": 0.003561886452
1695
+ },
1696
+ "Social": {
1697
+ "Score": 0.111076414,
1698
+ "Standard Deviation": 0.004805626512
1699
+ }
1700
+ }
1701
+ },
1702
+ {
1703
+ "config": {
1704
+ "model_name": "koala-13b",
1705
+ "organization": "UC Berkeley",
1706
+ "license": "Non-commercial",
1707
+ "knowledge_cutoff": "2023-10"
1708
+ },
1709
+ "results": {
1710
+ "OVERALL": {
1711
+ "Score": 0.09387188,
1712
+ "Standard Deviation": 0.02642167489
1713
+ },
1714
+ "Geometry": {
1715
+ "Score": 0.017374001,
1716
+ "Standard Deviation": 0.01747053557
1717
+ },
1718
+ "Algebra": {
1719
+ "Score": 0.018129197,
1720
+ "Standard Deviation": 0.01054371383
1721
+ },
1722
+ "Probability": {
1723
+ "Score": 0.043654362,
1724
+ "Standard Deviation": 0.004288231886
1725
+ },
1726
+ "Logical": {
1727
+ "Score": 0.074694053,
1728
+ "Standard Deviation": 0.002674646998
1729
+ },
1730
+ "Social": {
1731
+ "Score": 0.096983835,
1732
+ "Standard Deviation": 0.007847059783
1733
+ }
1734
+ }
1735
+ },
1736
+ {
1737
+ "config": {
1738
+ "model_name": "openassistant-pythia-12b",
1739
+ "organization": "OpenAssistant",
1740
+ "license": "Non-commercial",
1741
+ "knowledge_cutoff": "2023-09"
1742
+ },
1743
+ "results": {
1744
+ "OVERALL": {
1745
+ "Score": 0.0,
1746
+ "Standard Deviation": 0.0
1747
+ },
1748
+ "Geometry": {
1749
+ "Score": 0.0,
1750
+ "Standard Deviation": 0.0
1751
+ },
1752
+ "Algebra": {
1753
+ "Score": 0.0,
1754
+ "Standard Deviation": 0.0
1755
+ },
1756
+ "Probability": {
1757
+ "Score": 0.0,
1758
+ "Standard Deviation": 0.0
1759
+ },
1760
+ "Logical": {
1761
+ "Score": 0.0,
1762
+ "Standard Deviation": 0.0
1763
+ },
1764
+ "Social": {
1765
+ "Score": 0.030792528,
1766
+ "Standard Deviation": 0.007518796391
1767
+ }
1768
+ }
1769
+ }
1770
+ ]
src/results/models_2024-10-08-03:25:44.801310.jsonl ADDED
@@ -0,0 +1,2082 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "config": {
4
+ "model_name": "ChatGPT-4o-latest (2024-09-03)",
5
+ "organization": "OpenAI",
6
+ "license": "Proprietary",
7
+ "knowledge_cutoff": "2023/10"
8
+ },
9
+ "results": {
10
+ "OVERALL": {
11
+ "Average Score": 0.974329609,
12
+ "Standard Deviation": 0.005024959031,
13
+ "Rank": 2
14
+ },
15
+ "Geometry": {
16
+ "Average Score": 0.976028578,
17
+ "Standard Deviation": 0.01507912373,
18
+ "Rank": 3
19
+ },
20
+ "Algebra": {
21
+ "Average Score": 0.951199453,
22
+ "Standard Deviation": 0.08452452108,
23
+ "Rank": 3
24
+ },
25
+ "Probability": {
26
+ "Average Score": 0.842116641,
27
+ "Standard Deviation": 0.006267759054,
28
+ "Rank": 3
29
+ },
30
+ "Logical": {
31
+ "Average Score": 0.828490728,
32
+ "Standard Deviation": 0.009134213144,
33
+ "Rank": 3
34
+ },
35
+ "Social": {
36
+ "Average Score": 0.815902987,
37
+ "Standard Deviation": 0.0196254222,
38
+ "Rank": 4
39
+ }
40
+ }
41
+ },
42
+ {
43
+ "config": {
44
+ "model_name": "gpt-4o-2024-08-06",
45
+ "organization": "OpenAI",
46
+ "license": "Proprietary",
47
+ "knowledge_cutoff": "2023/10"
48
+ },
49
+ "results": {
50
+ "OVERALL": {
51
+ "Average Score": 0.846571548,
52
+ "Standard Deviation": 0.03394056554,
53
+ "Rank": 6
54
+ },
55
+ "Geometry": {
56
+ "Average Score": 0.99773096,
57
+ "Standard Deviation": 0.002835555172,
58
+ "Rank": 1
59
+ },
60
+ "Algebra": {
61
+ "Average Score": 1.0,
62
+ "Standard Deviation": 0.0,
63
+ "Rank": 1
64
+ },
65
+ "Probability": {
66
+ "Average Score": 0.78855795,
67
+ "Standard Deviation": 0.008188675452,
68
+ "Rank": 6
69
+ },
70
+ "Logical": {
71
+ "Average Score": 0.668635768,
72
+ "Standard Deviation": 0.03466314094,
73
+ "Rank": 11
74
+ },
75
+ "Social": {
76
+ "Average Score": 0.680417314,
77
+ "Standard Deviation": 0.00656867063,
78
+ "Rank": 9
79
+ }
80
+ }
81
+ },
82
+ {
83
+ "config": {
84
+ "model_name": "gpt-4o-2024-05-13",
85
+ "organization": "OpenAI",
86
+ "license": "Proprietary",
87
+ "knowledge_cutoff": "2023/10"
88
+ },
89
+ "results": {
90
+ "OVERALL": {
91
+ "Average Score": 0.846334477,
92
+ "Standard Deviation": 0.09377911572,
93
+ "Rank": 7
94
+ },
95
+ "Geometry": {
96
+ "Average Score": 0.972472377,
97
+ "Standard Deviation": 0.01648274205,
98
+ "Rank": 4
99
+ },
100
+ "Algebra": {
101
+ "Average Score": 0.995511298,
102
+ "Standard Deviation": 0.004097802515,
103
+ "Rank": 2
104
+ },
105
+ "Probability": {
106
+ "Average Score": 0.812149974,
107
+ "Standard Deviation": 0.007669585485,
108
+ "Rank": 4
109
+ },
110
+ "Logical": {
111
+ "Average Score": 0.755019692,
112
+ "Standard Deviation": 0.008149588572,
113
+ "Rank": 6
114
+ },
115
+ "Social": {
116
+ "Average Score": 0.609875087,
117
+ "Standard Deviation": 0.038729239,
118
+ "Rank": 14
119
+ }
120
+ }
121
+ },
122
+ {
123
+ "config": {
124
+ "model_name": "gpt-4-turbo-2024-04-09",
125
+ "organization": "OpenAI",
126
+ "license": "Proprietary",
127
+ "knowledge_cutoff": "2023/12"
128
+ },
129
+ "results": {
130
+ "OVERALL": {
131
+ "Average Score": 0.855357972,
132
+ "Standard Deviation": 0.1016986368,
133
+ "Rank": 4
134
+ },
135
+ "Geometry": {
136
+ "Average Score": 0.95374588,
137
+ "Standard Deviation": 0.03109307166,
138
+ "Rank": 5
139
+ },
140
+ "Algebra": {
141
+ "Average Score": 0.930945223,
142
+ "Standard Deviation": 0.06705136813,
143
+ "Rank": 4
144
+ },
145
+ "Probability": {
146
+ "Average Score": 0.750705448,
147
+ "Standard Deviation": 0.05944483103,
148
+ "Rank": 8
149
+ },
150
+ "Logical": {
151
+ "Average Score": 0.77906699,
152
+ "Standard Deviation": 0.007406734161,
153
+ "Rank": 4
154
+ },
155
+ "Social": {
156
+ "Average Score": 0.715935163,
157
+ "Standard Deviation": 0.1209141409,
158
+ "Rank": 7
159
+ }
160
+ }
161
+ },
162
+ {
163
+ "config": {
164
+ "model_name": "gemini-1.5-pro-001",
165
+ "organization": "Google",
166
+ "license": "Proprietary",
167
+ "knowledge_cutoff": "2024-01"
168
+ },
169
+ "results": {
170
+ "OVERALL": {
171
+ "Average Score": 0.797187842,
172
+ "Standard Deviation": 0.0272375249,
173
+ "Rank": 10
174
+ },
175
+ "Geometry": {
176
+ "Average Score": 0.9947169,
177
+ "Standard Deviation": 0.009150597621,
178
+ "Rank": 2
179
+ },
180
+ "Algebra": {
181
+ "Average Score": 0.857464301,
182
+ "Standard Deviation": 0.05014285338,
183
+ "Rank": 5
184
+ },
185
+ "Probability": {
186
+ "Average Score": 0.651781767,
187
+ "Standard Deviation": 0.04156998547,
188
+ "Rank": 12
189
+ },
190
+ "Logical": {
191
+ "Average Score": 0.739745471,
192
+ "Standard Deviation": 0.01631532019,
193
+ "Rank": 7
194
+ },
195
+ "Social": {
196
+ "Average Score": 0.649601885,
197
+ "Standard Deviation": 0.104854889,
198
+ "Rank": 12
199
+ }
200
+ }
201
+ },
202
+ {
203
+ "config": {
204
+ "model_name": "qwen2-72b-instruct",
205
+ "organization": "Alibaba",
206
+ "license": "Qianwen LICENSE",
207
+ "knowledge_cutoff": "2024-02"
208
+ },
209
+ "results": {
210
+ "OVERALL": {
211
+ "Average Score": 0.737918558,
212
+ "Standard Deviation": 0.09069077339,
213
+ "Rank": 11
214
+ },
215
+ "Geometry": {
216
+ "Average Score": 0.796870305,
217
+ "Standard Deviation": 0.0509025346,
218
+ "Rank": 9
219
+ },
220
+ "Algebra": {
221
+ "Average Score": 0.836194231,
222
+ "Standard Deviation": 0.04517093028,
223
+ "Rank": 6
224
+ },
225
+ "Probability": {
226
+ "Average Score": 0.788068004,
227
+ "Standard Deviation": 0.007288989044,
228
+ "Rank": 7
229
+ },
230
+ "Logical": {
231
+ "Average Score": 0.619300904,
232
+ "Standard Deviation": 0.06377931612,
233
+ "Rank": 15
234
+ },
235
+ "Social": {
236
+ "Average Score": 0.652578786,
237
+ "Standard Deviation": 0.04259293171,
238
+ "Rank": 11
239
+ }
240
+ }
241
+ },
242
+ {
243
+ "config": {
244
+ "model_name": "gpt-4o-mini-2024-07-18",
245
+ "organization": "OpenAI",
246
+ "license": "Proprietary",
247
+ "knowledge_cutoff": "2024-07"
248
+ },
249
+ "results": {
250
+ "OVERALL": {
251
+ "Average Score": 0.847694133,
252
+ "Standard Deviation": 0.02164304402,
253
+ "Rank": 5
254
+ },
255
+ "Geometry": {
256
+ "Average Score": 0.946650435,
257
+ "Standard Deviation": 0.01831236482,
258
+ "Rank": 7
259
+ },
260
+ "Algebra": {
261
+ "Average Score": 0.796243022,
262
+ "Standard Deviation": 0.05537539202,
263
+ "Rank": 7
264
+ },
265
+ "Probability": {
266
+ "Average Score": 0.798402685,
267
+ "Standard Deviation": 0.009404491967,
268
+ "Rank": 5
269
+ },
270
+ "Logical": {
271
+ "Average Score": 0.727009735,
272
+ "Standard Deviation": 0.02628110141,
273
+ "Rank": 8
274
+ },
275
+ "Social": {
276
+ "Average Score": 0.691949855,
277
+ "Standard Deviation": 0.02072934333,
278
+ "Rank": 8
279
+ }
280
+ }
281
+ },
282
+ {
283
+ "config": {
284
+ "model_name": "claude-3.5-sonnet",
285
+ "organization": "Anthropic",
286
+ "license": "Proprietary",
287
+ "knowledge_cutoff": "2024-03"
288
+ },
289
+ "results": {
290
+ "OVERALL": {
291
+ "Average Score": 0.839004422,
292
+ "Standard Deviation": 0.1461079564,
293
+ "Rank": 8
294
+ },
295
+ "Geometry": {
296
+ "Average Score": 0.95316419,
297
+ "Standard Deviation": 0.02081192856,
298
+ "Rank": 6
299
+ },
300
+ "Algebra": {
301
+ "Average Score": 0.759789952,
302
+ "Standard Deviation": 0.02611765096,
303
+ "Rank": 8
304
+ },
305
+ "Probability": {
306
+ "Average Score": 0.707730127,
307
+ "Standard Deviation": 0.0394436664,
308
+ "Rank": 10
309
+ },
310
+ "Logical": {
311
+ "Average Score": 0.77342666,
312
+ "Standard Deviation": 0.002892426458,
313
+ "Rank": 5
314
+ },
315
+ "Social": {
316
+ "Average Score": 0.790002247,
317
+ "Standard Deviation": 0.1007410022,
318
+ "Rank": 5
319
+ }
320
+ }
321
+ },
322
+ {
323
+ "config": {
324
+ "model_name": "o1-mini",
325
+ "organization": "OpenAI",
326
+ "license": "Proprietary",
327
+ "knowledge_cutoff": "2024-01"
328
+ },
329
+ "results": {
330
+ "OVERALL": {
331
+ "Average Score": 1.0,
332
+ "Standard Deviation": 0.0,
333
+ "Rank": 1
334
+ },
335
+ "Geometry": {
336
+ "Average Score": "N/A",
337
+ "Standard Deviation": "N/A",
338
+ "Rank": "N/A"
339
+ },
340
+ "Algebra": {
341
+ "Average Score": "N/A",
342
+ "Standard Deviation": "N/A",
343
+ "Rank": "N/A"
344
+ },
345
+ "Probability": {
346
+ "Average Score": 1.0,
347
+ "Standard Deviation": 0.0,
348
+ "Rank": 1
349
+ },
350
+ "Logical": {
351
+ "Average Score": 1.0,
352
+ "Standard Deviation": 0.0,
353
+ "Rank": 1
354
+ },
355
+ "Social": {
356
+ "Average Score": 0.993974241,
357
+ "Standard Deviation": 0.001996882328,
358
+ "Rank": 2
359
+ }
360
+ }
361
+ },
362
+ {
363
+ "config": {
364
+ "model_name": "o1-preview",
365
+ "organization": "OpenAI",
366
+ "license": "Proprietary",
367
+ "knowledge_cutoff": "2024-01"
368
+ },
369
+ "results": {
370
+ "OVERALL": {
371
+ "Average Score": 0.945884589,
372
+ "Standard Deviation": 0.01059250762,
373
+ "Rank": 3
374
+ },
375
+ "Geometry": {
376
+ "Average Score": "N/A",
377
+ "Standard Deviation": "N/A",
378
+ "Rank": "N/A"
379
+ },
380
+ "Algebra": {
381
+ "Average Score": "N/A",
382
+ "Standard Deviation": "N/A",
383
+ "Rank": "N/A"
384
+ },
385
+ "Probability": {
386
+ "Average Score": 0.964666392,
387
+ "Standard Deviation": 0.003139983398,
388
+ "Rank": 2
389
+ },
390
+ "Logical": {
391
+ "Average Score": 0.987950057,
392
+ "Standard Deviation": 0.004881220327,
393
+ "Rank": 2
394
+ },
395
+ "Social": {
396
+ "Average Score": 1.0,
397
+ "Standard Deviation": 0.0,
398
+ "Rank": 1
399
+ }
400
+ }
401
+ },
402
+ {
403
+ "config": {
404
+ "model_name": "gemini-1.5-flash-001",
405
+ "organization": "Google",
406
+ "license": "Proprietary",
407
+ "knowledge_cutoff": "2024-02"
408
+ },
409
+ "results": {
410
+ "OVERALL": {
411
+ "Average Score": 0.726493401,
412
+ "Standard Deviation": 0.01113913725,
413
+ "Rank": 12
414
+ },
415
+ "Geometry": {
416
+ "Average Score": 0.804144103,
417
+ "Standard Deviation": 0.1327142178,
418
+ "Rank": 8
419
+ },
420
+ "Algebra": {
421
+ "Average Score": 0.731776765,
422
+ "Standard Deviation": 0.02594657111,
423
+ "Rank": 11
424
+ },
425
+ "Probability": {
426
+ "Average Score": 0.614461891,
427
+ "Standard Deviation": 0.04690131826,
428
+ "Rank": 15
429
+ },
430
+ "Logical": {
431
+ "Average Score": 0.630805991,
432
+ "Standard Deviation": 0.04871350612,
433
+ "Rank": 13
434
+ },
435
+ "Social": {
436
+ "Average Score": 0.555933822,
437
+ "Standard Deviation": 0.1029934524,
438
+ "Rank": 16
439
+ }
440
+ }
441
+ },
442
+ {
443
+ "config": {
444
+ "model_name": "gpt4-1106",
445
+ "organization": "OpenAI",
446
+ "license": "Proprietary",
447
+ "knowledge_cutoff": "2024-04"
448
+ },
449
+ "results": {
450
+ "OVERALL": {
451
+ "Average Score": 0.816347784,
452
+ "Standard Deviation": 0.1566815755,
453
+ "Rank": 9
454
+ },
455
+ "Geometry": {
456
+ "Average Score": 0.71843088,
457
+ "Standard Deviation": 0.04778038294,
458
+ "Rank": 13
459
+ },
460
+ "Algebra": {
461
+ "Average Score": 0.712910417,
462
+ "Standard Deviation": 0.02581828898,
463
+ "Rank": 12
464
+ },
465
+ "Probability": {
466
+ "Average Score": 0.623947619,
467
+ "Standard Deviation": 0.03502982933,
468
+ "Rank": 14
469
+ },
470
+ "Logical": {
471
+ "Average Score": 0.637482274,
472
+ "Standard Deviation": 0.04158809888,
473
+ "Rank": 12
474
+ },
475
+ "Social": {
476
+ "Average Score": 0.450609816,
477
+ "Standard Deviation": 0.05208655446,
478
+ "Rank": 23
479
+ }
480
+ }
481
+ },
482
+ {
483
+ "config": {
484
+ "model_name": "gemma-2-27b-it",
485
+ "organization": "Google",
486
+ "license": "Gemma License",
487
+ "knowledge_cutoff": "2024-03"
488
+ },
489
+ "results": {
490
+ "OVERALL": {
491
+ "Average Score": 0.624169623,
492
+ "Standard Deviation": 0.1048365121,
493
+ "Rank": 15
494
+ },
495
+ "Geometry": {
496
+ "Average Score": 0.60112744,
497
+ "Standard Deviation": 0.0469109952,
498
+ "Rank": 19
499
+ },
500
+ "Algebra": {
501
+ "Average Score": 0.687955914,
502
+ "Standard Deviation": 0.01959958192,
503
+ "Rank": 13
504
+ },
505
+ "Probability": {
506
+ "Average Score": 0.589524771,
507
+ "Standard Deviation": 0.03112689325,
508
+ "Rank": 16
509
+ },
510
+ "Logical": {
511
+ "Average Score": 0.614978944,
512
+ "Standard Deviation": 0.05710657859,
513
+ "Rank": 16
514
+ },
515
+ "Social": {
516
+ "Average Score": 0.487844257,
517
+ "Standard Deviation": 0.05857760809,
518
+ "Rank": 20
519
+ }
520
+ }
521
+ },
522
+ {
523
+ "config": {
524
+ "model_name": "claude-3-opus",
525
+ "organization": "Anthropic",
526
+ "license": "Proprietary",
527
+ "knowledge_cutoff": "2024-01"
528
+ },
529
+ "results": {
530
+ "OVERALL": {
531
+ "Average Score": 0.650636271,
532
+ "Standard Deviation": 0.1197773541,
533
+ "Rank": 14
534
+ },
535
+ "Geometry": {
536
+ "Average Score": 0.7215743,
537
+ "Standard Deviation": 0.04712598358,
538
+ "Rank": 12
539
+ },
540
+ "Algebra": {
541
+ "Average Score": 0.68777327,
542
+ "Standard Deviation": 0.02382683713,
543
+ "Rank": 14
544
+ },
545
+ "Probability": {
546
+ "Average Score": 0.626471421,
547
+ "Standard Deviation": 0.02911817976,
548
+ "Rank": 13
549
+ },
550
+ "Logical": {
551
+ "Average Score": 0.692346381,
552
+ "Standard Deviation": 0.03617185198,
553
+ "Rank": 10
554
+ },
555
+ "Social": {
556
+ "Average Score": 0.663410854,
557
+ "Standard Deviation": 0.09540220876,
558
+ "Rank": 10
559
+ }
560
+ }
561
+ },
562
+ {
563
+ "config": {
564
+ "model_name": "gemma-2-9b-it-simpo",
565
+ "organization": "Google",
566
+ "license": "Gemma License",
567
+ "knowledge_cutoff": "2024-02"
568
+ },
569
+ "results": {
570
+ "OVERALL": {
571
+ "Average Score": "N/A",
572
+ "Standard Deviation": "N/A",
573
+ "Rank": "N/A"
574
+ },
575
+ "Geometry": {
576
+ "Average Score": 0.582787508,
577
+ "Standard Deviation": 0.03965204074,
578
+ "Rank": 20
579
+ },
580
+ "Algebra": {
581
+ "Average Score": 0.658648133,
582
+ "Standard Deviation": 0.02565919856,
583
+ "Rank": 15
584
+ },
585
+ "Probability": {
586
+ "Average Score": 0.547861265,
587
+ "Standard Deviation": 0.02885209131,
588
+ "Rank": 19
589
+ },
590
+ "Logical": {
591
+ "Average Score": 0.540720893,
592
+ "Standard Deviation": 0.01970134508,
593
+ "Rank": 20
594
+ },
595
+ "Social": {
596
+ "Average Score": 0.635266187,
597
+ "Standard Deviation": 0.03620021751,
598
+ "Rank": 13
599
+ }
600
+ }
601
+ },
602
+ {
603
+ "config": {
604
+ "model_name": "qwen1.5-72b-chat",
605
+ "organization": "Alibaba",
606
+ "license": "Qianwen LICENSE",
607
+ "knowledge_cutoff": "2024-03"
608
+ },
609
+ "results": {
610
+ "OVERALL": {
611
+ "Average Score": 0.519549796,
612
+ "Standard Deviation": 0.00903634343,
613
+ "Rank": 18
614
+ },
615
+ "Geometry": {
616
+ "Average Score": 0.543139301,
617
+ "Standard Deviation": 0.03425202326,
618
+ "Rank": 24
619
+ },
620
+ "Algebra": {
621
+ "Average Score": 0.635228729,
622
+ "Standard Deviation": 0.01944043425,
623
+ "Rank": 16
624
+ },
625
+ "Probability": {
626
+ "Average Score": 0.486948658,
627
+ "Standard Deviation": 0.06064655315,
628
+ "Rank": 23
629
+ },
630
+ "Logical": {
631
+ "Average Score": 0.284069394,
632
+ "Standard Deviation": 0.02686608506,
633
+ "Rank": 33
634
+ },
635
+ "Social": {
636
+ "Average Score": 0.415007627,
637
+ "Standard Deviation": 0.03920053159,
638
+ "Rank": 24
639
+ }
640
+ }
641
+ },
642
+ {
643
+ "config": {
644
+ "model_name": "qwen1.5-32b-chat",
645
+ "organization": "Alibaba",
646
+ "license": "Qianwen LICENSE",
647
+ "knowledge_cutoff": "2024-03"
648
+ },
649
+ "results": {
650
+ "OVERALL": {
651
+ "Average Score": 0.393789407,
652
+ "Standard Deviation": 0.05413770095,
653
+ "Rank": 29
654
+ },
655
+ "Geometry": {
656
+ "Average Score": 0.51086835,
657
+ "Standard Deviation": 0.04052471998,
658
+ "Rank": 27
659
+ },
660
+ "Algebra": {
661
+ "Average Score": 0.609003168,
662
+ "Standard Deviation": 0.04874143541,
663
+ "Rank": 17
664
+ },
665
+ "Probability": {
666
+ "Average Score": 0.476300002,
667
+ "Standard Deviation": 0.05322403912,
668
+ "Rank": 24
669
+ },
670
+ "Logical": {
671
+ "Average Score": 0.331781014,
672
+ "Standard Deviation": 0.004938997686,
673
+ "Rank": 30
674
+ },
675
+ "Social": {
676
+ "Average Score": 0.380987334,
677
+ "Standard Deviation": 0.03762251776,
678
+ "Rank": 26
679
+ }
680
+ }
681
+ },
682
+ {
683
+ "config": {
684
+ "model_name": "google-gemma-2-9b-it",
685
+ "organization": "Google",
686
+ "license": "Proprietary",
687
+ "knowledge_cutoff": "2024-01"
688
+ },
689
+ "results": {
690
+ "OVERALL": {
691
+ "Average Score": 0.489663449,
692
+ "Standard Deviation": 0.002595702019,
693
+ "Rank": 21
694
+ },
695
+ "Geometry": {
696
+ "Average Score": 0.575371308,
697
+ "Standard Deviation": 0.03556220251,
698
+ "Rank": 22
699
+ },
700
+ "Algebra": {
701
+ "Average Score": 0.597045661,
702
+ "Standard Deviation": 0.0313828123,
703
+ "Rank": 18
704
+ },
705
+ "Probability": {
706
+ "Average Score": 0.589221807,
707
+ "Standard Deviation": 0.03110811656,
708
+ "Rank": 18
709
+ },
710
+ "Logical": {
711
+ "Average Score": 0.587579897,
712
+ "Standard Deviation": 0.05512716783,
713
+ "Rank": 18
714
+ },
715
+ "Social": {
716
+ "Average Score": 0.768337958,
717
+ "Standard Deviation": 0.04078610476,
718
+ "Rank": 6
719
+ }
720
+ }
721
+ },
722
+ {
723
+ "config": {
724
+ "model_name": "yi-1.5-34b-chat",
725
+ "organization": "01 AI",
726
+ "license": "Proprietary",
727
+ "knowledge_cutoff": "2024-01"
728
+ },
729
+ "results": {
730
+ "OVERALL": {
731
+ "Average Score": 0.607812897,
732
+ "Standard Deviation": 0.1440881293,
733
+ "Rank": 16
734
+ },
735
+ "Geometry": {
736
+ "Average Score": 0.566666724,
737
+ "Standard Deviation": 0.04001381658,
738
+ "Rank": 23
739
+ },
740
+ "Algebra": {
741
+ "Average Score": 0.590997292,
742
+ "Standard Deviation": 0.03594087315,
743
+ "Rank": 19
744
+ },
745
+ "Probability": {
746
+ "Average Score": 0.589524589,
747
+ "Standard Deviation": 0.03112618772,
748
+ "Rank": 17
749
+ },
750
+ "Logical": {
751
+ "Average Score": 0.574105508,
752
+ "Standard Deviation": 0.03441737941,
753
+ "Rank": 19
754
+ },
755
+ "Social": {
756
+ "Average Score": 0.516980832,
757
+ "Standard Deviation": 0.03369347985,
758
+ "Rank": 19
759
+ }
760
+ }
761
+ },
762
+ {
763
+ "config": {
764
+ "model_name": "meta-llama-3.1-8b-instruct",
765
+ "organization": "Meta",
766
+ "license": "Llama 3.1 Community",
767
+ "knowledge_cutoff": "2024-02"
768
+ },
769
+ "results": {
770
+ "OVERALL": {
771
+ "Average Score": 0.505936324,
772
+ "Standard Deviation": 0.05286756493,
773
+ "Rank": 19
774
+ },
775
+ "Geometry": {
776
+ "Average Score": 0.522442162,
777
+ "Standard Deviation": 0.03908236317,
778
+ "Rank": 25
779
+ },
780
+ "Algebra": {
781
+ "Average Score": 0.582702645,
782
+ "Standard Deviation": 0.05002277711,
783
+ "Rank": 20
784
+ },
785
+ "Probability": {
786
+ "Average Score": 0.495001149,
787
+ "Standard Deviation": 0.05244587037,
788
+ "Rank": 22
789
+ },
790
+ "Logical": {
791
+ "Average Score": 0.443030561,
792
+ "Standard Deviation": 0.01343820628,
793
+ "Rank": 25
794
+ },
795
+ "Social": {
796
+ "Average Score": 0.329195941,
797
+ "Standard Deviation": 0.03925019528,
798
+ "Rank": 30
799
+ }
800
+ }
801
+ },
802
+ {
803
+ "config": {
804
+ "model_name": "gpt3.5-turbo-0125",
805
+ "organization": "OpenAI",
806
+ "license": "Proprietary",
807
+ "knowledge_cutoff": "2023-12"
808
+ },
809
+ "results": {
810
+ "OVERALL": {
811
+ "Average Score": 0.313398088,
812
+ "Standard Deviation": 0.09322528606,
813
+ "Rank": 40
814
+ },
815
+ "Geometry": {
816
+ "Average Score": 0.678714519,
817
+ "Standard Deviation": 0.05926546762,
818
+ "Rank": 14
819
+ },
820
+ "Algebra": {
821
+ "Average Score": 0.569296173,
822
+ "Standard Deviation": 0.05277281097,
823
+ "Rank": 21
824
+ },
825
+ "Probability": {
826
+ "Average Score": 0.448460767,
827
+ "Standard Deviation": 0.05768095196,
828
+ "Rank": 26
829
+ },
830
+ "Logical": {
831
+ "Average Score": 0.148521348,
832
+ "Standard Deviation": 0.04033712907,
833
+ "Rank": 45
834
+ },
835
+ "Social": {
836
+ "Average Score": 0.235071541,
837
+ "Standard Deviation": 0.02632892457,
838
+ "Rank": 39
839
+ }
840
+ }
841
+ },
842
+ {
843
+ "config": {
844
+ "model_name": "llama-3-70b-instruct",
845
+ "organization": "Meta",
846
+ "license": "Llama 3 Community",
847
+ "knowledge_cutoff": "2024-03"
848
+ },
849
+ "results": {
850
+ "OVERALL": {
851
+ "Average Score": 0.456689885,
852
+ "Standard Deviation": 0.01385989995,
853
+ "Rank": 23
854
+ },
855
+ "Geometry": {
856
+ "Average Score": 0.516865529,
857
+ "Standard Deviation": 0.03858112564,
858
+ "Rank": 26
859
+ },
860
+ "Algebra": {
861
+ "Average Score": 0.566756531,
862
+ "Standard Deviation": 0.03369826926,
863
+ "Rank": 22
864
+ },
865
+ "Probability": {
866
+ "Average Score": 0.513857306,
867
+ "Standard Deviation": 0.05453699062,
868
+ "Rank": 21
869
+ },
870
+ "Logical": {
871
+ "Average Score": 0.713796415,
872
+ "Standard Deviation": 0.02031215107,
873
+ "Rank": 9
874
+ },
875
+ "Social": {
876
+ "Average Score": 0.45872939,
877
+ "Standard Deviation": 0.05347039576,
878
+ "Rank": 22
879
+ }
880
+ }
881
+ },
882
+ {
883
+ "config": {
884
+ "model_name": "claude-3-sonnet",
885
+ "organization": "Anthropic",
886
+ "license": "Proprietary",
887
+ "knowledge_cutoff": "2024-02"
888
+ },
889
+ "results": {
890
+ "OVERALL": {
891
+ "Average Score": 0.520010833,
892
+ "Standard Deviation": 0.005030563799,
893
+ "Rank": 17
894
+ },
895
+ "Geometry": {
896
+ "Average Score": 0.675613638,
897
+ "Standard Deviation": 0.05275594408,
898
+ "Rank": 15
899
+ },
900
+ "Algebra": {
901
+ "Average Score": 0.552025728,
902
+ "Standard Deviation": 0.04122192409,
903
+ "Rank": 23
904
+ },
905
+ "Probability": {
906
+ "Average Score": 0.516192848,
907
+ "Standard Deviation": 0.04152293217,
908
+ "Rank": 20
909
+ },
910
+ "Logical": {
911
+ "Average Score": 0.588545747,
912
+ "Standard Deviation": 0.06068211943,
913
+ "Rank": 17
914
+ },
915
+ "Social": {
916
+ "Average Score": 0.570437582,
917
+ "Standard Deviation": 0.08607040862,
918
+ "Rank": 15
919
+ }
920
+ }
921
+ },
922
+ {
923
+ "config": {
924
+ "model_name": "qwen1.5-14b-chat",
925
+ "organization": "Alibaba",
926
+ "license": "Qianwen LICENSE",
927
+ "knowledge_cutoff": "2024-01"
928
+ },
929
+ "results": {
930
+ "OVERALL": {
931
+ "Average Score": 0.415328996,
932
+ "Standard Deviation": 0.0743938717,
933
+ "Rank": 28
934
+ },
935
+ "Geometry": {
936
+ "Average Score": 0.452504016,
937
+ "Standard Deviation": 0.04225594393,
938
+ "Rank": 28
939
+ },
940
+ "Algebra": {
941
+ "Average Score": 0.538655725,
942
+ "Standard Deviation": 0.03721542594,
943
+ "Rank": 24
944
+ },
945
+ "Probability": {
946
+ "Average Score": 0.397185975,
947
+ "Standard Deviation": 0.05607695946,
948
+ "Rank": 30
949
+ },
950
+ "Logical": {
951
+ "Average Score": 0.264573129,
952
+ "Standard Deviation": 0.03936133174,
953
+ "Rank": 35
954
+ },
955
+ "Social": {
956
+ "Average Score": 0.287370142,
957
+ "Standard Deviation": 0.04264085315,
958
+ "Rank": 32
959
+ }
960
+ }
961
+ },
962
+ {
963
+ "config": {
964
+ "model_name": "claude-3-haiku",
965
+ "organization": "Anthropic",
966
+ "license": "Proprietary",
967
+ "knowledge_cutoff": "2024-01"
968
+ },
969
+ "results": {
970
+ "OVERALL": {
971
+ "Average Score": 0.453901163,
972
+ "Standard Deviation": 0.003604084261,
973
+ "Rank": 24
974
+ },
975
+ "Geometry": {
976
+ "Average Score": 0.607993912,
977
+ "Standard Deviation": 0.05793460748,
978
+ "Rank": 17
979
+ },
980
+ "Algebra": {
981
+ "Average Score": 0.520054055,
982
+ "Standard Deviation": 0.03333544511,
983
+ "Rank": 25
984
+ },
985
+ "Probability": {
986
+ "Average Score": 0.474460688,
987
+ "Standard Deviation": 0.0446501933,
988
+ "Rank": 25
989
+ },
990
+ "Logical": {
991
+ "Average Score": 0.512815976,
992
+ "Standard Deviation": 0.0163264281,
993
+ "Rank": 21
994
+ },
995
+ "Social": {
996
+ "Average Score": 0.551083976,
997
+ "Standard Deviation": 0.05374722539,
998
+ "Rank": 17
999
+ }
1000
+ }
1001
+ },
1002
+ {
1003
+ "config": {
1004
+ "model_name": "claude-2.1",
1005
+ "organization": "Anthropic",
1006
+ "license": "Proprietary",
1007
+ "knowledge_cutoff": "2023-12"
1008
+ },
1009
+ "results": {
1010
+ "OVERALL": {
1011
+ "Average Score": 0.35814708,
1012
+ "Standard Deviation": 0.09168134168,
1013
+ "Rank": 36
1014
+ },
1015
+ "Geometry": {
1016
+ "Average Score": 0.62752395,
1017
+ "Standard Deviation": 0.07232659398,
1018
+ "Rank": 16
1019
+ },
1020
+ "Algebra": {
1021
+ "Average Score": 0.508849609,
1022
+ "Standard Deviation": 0.0346897465,
1023
+ "Rank": 26
1024
+ },
1025
+ "Probability": {
1026
+ "Average Score": 0.41477086,
1027
+ "Standard Deviation": 0.05964060239,
1028
+ "Rank": 29
1029
+ },
1030
+ "Logical": {
1031
+ "Average Score": 0.482923674,
1032
+ "Standard Deviation": 0.01989147048,
1033
+ "Rank": 22
1034
+ },
1035
+ "Social": {
1036
+ "Average Score": 0.333804568,
1037
+ "Standard Deviation": 0.03775548253,
1038
+ "Rank": 29
1039
+ }
1040
+ }
1041
+ },
1042
+ {
1043
+ "config": {
1044
+ "model_name": "mistral-8x7b-instruct-v0.1",
1045
+ "organization": "Mistral",
1046
+ "license": "Apache 2.0",
1047
+ "knowledge_cutoff": "2023-12"
1048
+ },
1049
+ "results": {
1050
+ "OVERALL": {
1051
+ "Average Score": 0.382659161,
1052
+ "Standard Deviation": 0.07594496929,
1053
+ "Rank": 31
1054
+ },
1055
+ "Geometry": {
1056
+ "Average Score": 0.432216097,
1057
+ "Standard Deviation": 0.04747949254,
1058
+ "Rank": 31
1059
+ },
1060
+ "Algebra": {
1061
+ "Average Score": 0.478314888,
1062
+ "Standard Deviation": 0.01998797419,
1063
+ "Rank": 27
1064
+ },
1065
+ "Probability": {
1066
+ "Average Score": 0.427144725,
1067
+ "Standard Deviation": 0.0590923329,
1068
+ "Rank": 28
1069
+ },
1070
+ "Logical": {
1071
+ "Average Score": 0.340041983,
1072
+ "Standard Deviation": 0.008397574592,
1073
+ "Rank": 28
1074
+ },
1075
+ "Social": {
1076
+ "Average Score": 0.251949622,
1077
+ "Standard Deviation": 0.03346674405,
1078
+ "Rank": 37
1079
+ }
1080
+ }
1081
+ },
1082
+ {
1083
+ "config": {
1084
+ "model_name": "claude-2.0",
1085
+ "organization": "Anthropic",
1086
+ "license": "Proprietary",
1087
+ "knowledge_cutoff": "2023-10"
1088
+ },
1089
+ "results": {
1090
+ "OVERALL": {
1091
+ "Average Score": 0.322718057,
1092
+ "Standard Deviation": 0.08369883584,
1093
+ "Rank": 38
1094
+ },
1095
+ "Geometry": {
1096
+ "Average Score": 0.604141967,
1097
+ "Standard Deviation": 0.05116441826,
1098
+ "Rank": 18
1099
+ },
1100
+ "Algebra": {
1101
+ "Average Score": 0.474350734,
1102
+ "Standard Deviation": 0.01510393066,
1103
+ "Rank": 28
1104
+ },
1105
+ "Probability": {
1106
+ "Average Score": 0.437950412,
1107
+ "Standard Deviation": 0.05985594317,
1108
+ "Rank": 27
1109
+ },
1110
+ "Logical": {
1111
+ "Average Score": 0.445620646,
1112
+ "Standard Deviation": 0.01812614805,
1113
+ "Rank": 24
1114
+ },
1115
+ "Social": {
1116
+ "Average Score": 0.469422836,
1117
+ "Standard Deviation": 0.05999901796,
1118
+ "Rank": 21
1119
+ }
1120
+ }
1121
+ },
1122
+ {
1123
+ "config": {
1124
+ "model_name": "starling-lm-7b-beta",
1125
+ "organization": "Nexusflow",
1126
+ "license": "Apache-2.0",
1127
+ "knowledge_cutoff": "2024-01"
1128
+ },
1129
+ "results": {
1130
+ "OVERALL": {
1131
+ "Average Score": 0.479391856,
1132
+ "Standard Deviation": 0.04199990887,
1133
+ "Rank": 22
1134
+ },
1135
+ "Geometry": {
1136
+ "Average Score": 0.446654388,
1137
+ "Standard Deviation": 0.05637864999,
1138
+ "Rank": 30
1139
+ },
1140
+ "Algebra": {
1141
+ "Average Score": 0.473952749,
1142
+ "Standard Deviation": 0.01584301288,
1143
+ "Rank": 29
1144
+ },
1145
+ "Probability": {
1146
+ "Average Score": 0.395197837,
1147
+ "Standard Deviation": 0.05814798892,
1148
+ "Rank": 31
1149
+ },
1150
+ "Logical": {
1151
+ "Average Score": 0.39927199,
1152
+ "Standard Deviation": 0.02125277518,
1153
+ "Rank": 26
1154
+ },
1155
+ "Social": {
1156
+ "Average Score": 0.380021662,
1157
+ "Standard Deviation": 0.04622452748,
1158
+ "Rank": 27
1159
+ }
1160
+ }
1161
+ },
1162
+ {
1163
+ "config": {
1164
+ "model_name": "gemini-1.0-pro-001",
1165
+ "organization": "Google",
1166
+ "license": "Proprietary",
1167
+ "knowledge_cutoff": "2023-11"
1168
+ },
1169
+ "results": {
1170
+ "OVERALL": {
1171
+ "Average Score": 0.449040654,
1172
+ "Standard Deviation": 0.0450610177,
1173
+ "Rank": 25
1174
+ },
1175
+ "Geometry": {
1176
+ "Average Score": 0.578347959,
1177
+ "Standard Deviation": 0.04242873607,
1178
+ "Rank": 21
1179
+ },
1180
+ "Algebra": {
1181
+ "Average Score": 0.462417786,
1182
+ "Standard Deviation": 0.01668313635,
1183
+ "Rank": 30
1184
+ },
1185
+ "Probability": {
1186
+ "Average Score": 0.289836324,
1187
+ "Standard Deviation": 0.05739831115,
1188
+ "Rank": 39
1189
+ },
1190
+ "Logical": {
1191
+ "Average Score": 0.191140355,
1192
+ "Standard Deviation": 0.03394652499,
1193
+ "Rank": 41
1194
+ },
1195
+ "Social": {
1196
+ "Average Score": 0.130790863,
1197
+ "Standard Deviation": 0.02800188173,
1198
+ "Rank": 47
1199
+ }
1200
+ }
1201
+ },
1202
+ {
1203
+ "config": {
1204
+ "model_name": "openchat-3.5-0106",
1205
+ "organization": "OpenChat",
1206
+ "license": "Apache-2.0",
1207
+ "knowledge_cutoff": "2024-01"
1208
+ },
1209
+ "results": {
1210
+ "OVERALL": {
1211
+ "Average Score": 0.363929888,
1212
+ "Standard Deviation": 0.08602347145,
1213
+ "Rank": 34
1214
+ },
1215
+ "Geometry": {
1216
+ "Average Score": 0.38715246,
1217
+ "Standard Deviation": 0.03701851946,
1218
+ "Rank": 34
1219
+ },
1220
+ "Algebra": {
1221
+ "Average Score": 0.441233712,
1222
+ "Standard Deviation": 0.01135753754,
1223
+ "Rank": 31
1224
+ },
1225
+ "Probability": {
1226
+ "Average Score": 0.38802618,
1227
+ "Standard Deviation": 0.05663879714,
1228
+ "Rank": 32
1229
+ },
1230
+ "Logical": {
1231
+ "Average Score": 0.336754383,
1232
+ "Standard Deviation": 0.01608478079,
1233
+ "Rank": 29
1234
+ },
1235
+ "Social": {
1236
+ "Average Score": 0.250891608,
1237
+ "Standard Deviation": 0.03253769914,
1238
+ "Rank": 38
1239
+ }
1240
+ }
1241
+ },
1242
+ {
1243
+ "config": {
1244
+ "model_name": "openchat-3.5",
1245
+ "organization": "OpenChat",
1246
+ "license": "Apache-2.0",
1247
+ "knowledge_cutoff": "2023-12"
1248
+ },
1249
+ "results": {
1250
+ "OVERALL": {
1251
+ "Average Score": 0.361341296,
1252
+ "Standard Deviation": 0.09034869493,
1253
+ "Rank": 35
1254
+ },
1255
+ "Geometry": {
1256
+ "Average Score": 0.401699069,
1257
+ "Standard Deviation": 0.03410726557,
1258
+ "Rank": 32
1259
+ },
1260
+ "Algebra": {
1261
+ "Average Score": 0.414095336,
1262
+ "Standard Deviation": 0.01881964261,
1263
+ "Rank": 33
1264
+ },
1265
+ "Probability": {
1266
+ "Average Score": 0.349601002,
1267
+ "Standard Deviation": 0.05077455539,
1268
+ "Rank": 34
1269
+ },
1270
+ "Logical": {
1271
+ "Average Score": 0.331069242,
1272
+ "Standard Deviation": 0.02180827173,
1273
+ "Rank": 31
1274
+ },
1275
+ "Social": {
1276
+ "Average Score": 0.319991655,
1277
+ "Standard Deviation": 0.04502478724,
1278
+ "Rank": 31
1279
+ }
1280
+ }
1281
+ },
1282
+ {
1283
+ "config": {
1284
+ "model_name": "command-r-(08-2024)",
1285
+ "organization": "Cohere",
1286
+ "license": "CC-BY-NC-4.0",
1287
+ "knowledge_cutoff": "2024-08"
1288
+ },
1289
+ "results": {
1290
+ "OVERALL": {
1291
+ "Average Score": 0.427605298,
1292
+ "Standard Deviation": 0.01747449163,
1293
+ "Rank": 26
1294
+ },
1295
+ "Geometry": {
1296
+ "Average Score": 0.448300727,
1297
+ "Standard Deviation": 0.04996362328,
1298
+ "Rank": 29
1299
+ },
1300
+ "Algebra": {
1301
+ "Average Score": 0.417519167,
1302
+ "Standard Deviation": 0.01822196902,
1303
+ "Rank": 32
1304
+ },
1305
+ "Probability": {
1306
+ "Average Score": 0.366336281,
1307
+ "Standard Deviation": 0.04716826942,
1308
+ "Rank": 33
1309
+ },
1310
+ "Logical": {
1311
+ "Average Score": 0.214657906,
1312
+ "Standard Deviation": 0.03003579835,
1313
+ "Rank": 38
1314
+ },
1315
+ "Social": {
1316
+ "Average Score": 0.276088379,
1317
+ "Standard Deviation": 0.03295234688,
1318
+ "Rank": 34
1319
+ }
1320
+ }
1321
+ },
1322
+ {
1323
+ "config": {
1324
+ "model_name": "gemma-1.1-7b-it",
1325
+ "organization": "Google",
1326
+ "license": "Gemma License",
1327
+ "knowledge_cutoff": "2023-11"
1328
+ },
1329
+ "results": {
1330
+ "OVERALL": {
1331
+ "Average Score": 0.339506922,
1332
+ "Standard Deviation": 0.1066279108,
1333
+ "Rank": 37
1334
+ },
1335
+ "Geometry": {
1336
+ "Average Score": 0.324170977,
1337
+ "Standard Deviation": 0.04668553765,
1338
+ "Rank": 37
1339
+ },
1340
+ "Algebra": {
1341
+ "Average Score": 0.398684697,
1342
+ "Standard Deviation": 0.01982398259,
1343
+ "Rank": 34
1344
+ },
1345
+ "Probability": {
1346
+ "Average Score": 0.293253175,
1347
+ "Standard Deviation": 0.05126192191,
1348
+ "Rank": 38
1349
+ },
1350
+ "Logical": {
1351
+ "Average Score": 0.317750796,
1352
+ "Standard Deviation": 0.01101933543,
1353
+ "Rank": 32
1354
+ },
1355
+ "Social": {
1356
+ "Average Score": 0.179073276,
1357
+ "Standard Deviation": 0.02009658805,
1358
+ "Rank": 43
1359
+ }
1360
+ }
1361
+ },
1362
+ {
1363
+ "config": {
1364
+ "model_name": "llama3-8b-instruct",
1365
+ "organization": "Meta",
1366
+ "license": "Llama 3 Community",
1367
+ "knowledge_cutoff": "2024-01"
1368
+ },
1369
+ "results": {
1370
+ "OVERALL": {
1371
+ "Average Score": 0.367722676,
1372
+ "Standard Deviation": 0.1071368221,
1373
+ "Rank": 32
1374
+ },
1375
+ "Geometry": {
1376
+ "Average Score": 0.367143758,
1377
+ "Standard Deviation": 0.04363680358,
1378
+ "Rank": 35
1379
+ },
1380
+ "Algebra": {
1381
+ "Average Score": 0.391480973,
1382
+ "Standard Deviation": 0.02757445266,
1383
+ "Rank": 35
1384
+ },
1385
+ "Probability": {
1386
+ "Average Score": 0.317616445,
1387
+ "Standard Deviation": 0.04300430361,
1388
+ "Rank": 37
1389
+ },
1390
+ "Logical": {
1391
+ "Average Score": 0.461607495,
1392
+ "Standard Deviation": 0.02185028842,
1393
+ "Rank": 23
1394
+ },
1395
+ "Social": {
1396
+ "Average Score": 0.336373622,
1397
+ "Standard Deviation": 0.05762408512,
1398
+ "Rank": 28
1399
+ }
1400
+ }
1401
+ },
1402
+ {
1403
+ "config": {
1404
+ "model_name": "gemma-2-2b-it",
1405
+ "organization": "Google",
1406
+ "license": "Gemma License",
1407
+ "knowledge_cutoff": "2023-12"
1408
+ },
1409
+ "results": {
1410
+ "OVERALL": {
1411
+ "Average Score": 0.502167612,
1412
+ "Standard Deviation": 0.04389786763,
1413
+ "Rank": 20
1414
+ },
1415
+ "Geometry": {
1416
+ "Average Score": 0.395006676,
1417
+ "Standard Deviation": 0.05882607713,
1418
+ "Rank": 33
1419
+ },
1420
+ "Algebra": {
1421
+ "Average Score": 0.379391887,
1422
+ "Standard Deviation": 0.01722410785,
1423
+ "Rank": 36
1424
+ },
1425
+ "Probability": {
1426
+ "Average Score": 0.331231097,
1427
+ "Standard Deviation": 0.05392499987,
1428
+ "Rank": 36
1429
+ },
1430
+ "Logical": {
1431
+ "Average Score": 0.367687789,
1432
+ "Standard Deviation": 0.02547968808,
1433
+ "Rank": 27
1434
+ },
1435
+ "Social": {
1436
+ "Average Score": 0.393482094,
1437
+ "Standard Deviation": 0.06450214024,
1438
+ "Rank": 25
1439
+ }
1440
+ }
1441
+ },
1442
+ {
1443
+ "config": {
1444
+ "model_name": "starling-lm-7b-alpha",
1445
+ "organization": "Nexusflow",
1446
+ "license": "Apache-2.0",
1447
+ "knowledge_cutoff": "2023-12"
1448
+ },
1449
+ "results": {
1450
+ "OVERALL": {
1451
+ "Average Score": 0.366628765,
1452
+ "Standard Deviation": 0.08405492929,
1453
+ "Rank": 33
1454
+ },
1455
+ "Geometry": {
1456
+ "Average Score": 0.336782578,
1457
+ "Standard Deviation": 0.04069449132,
1458
+ "Rank": 36
1459
+ },
1460
+ "Algebra": {
1461
+ "Average Score": 0.371551932,
1462
+ "Standard Deviation": 0.03367241745,
1463
+ "Rank": 37
1464
+ },
1465
+ "Probability": {
1466
+ "Average Score": 0.331472505,
1467
+ "Standard Deviation": 0.04833324282,
1468
+ "Rank": 35
1469
+ },
1470
+ "Logical": {
1471
+ "Average Score": 0.260869624,
1472
+ "Standard Deviation": 0.03562735237,
1473
+ "Rank": 36
1474
+ },
1475
+ "Social": {
1476
+ "Average Score": 0.271975534,
1477
+ "Standard Deviation": 0.04266753408,
1478
+ "Rank": 35
1479
+ }
1480
+ }
1481
+ },
1482
+ {
1483
+ "config": {
1484
+ "model_name": "qwen1.5-4b-chat",
1485
+ "organization": "Alibaba",
1486
+ "license": "Qianwen LICENSE",
1487
+ "knowledge_cutoff": "2024-02"
1488
+ },
1489
+ "results": {
1490
+ "OVERALL": {
1491
+ "Average Score": 0.111876411,
1492
+ "Standard Deviation": 0.04241022785,
1493
+ "Rank": 49
1494
+ },
1495
+ "Geometry": {
1496
+ "Average Score": 0.215834522,
1497
+ "Standard Deviation": 0.0363766363,
1498
+ "Rank": 41
1499
+ },
1500
+ "Algebra": {
1501
+ "Average Score": 0.305589811,
1502
+ "Standard Deviation": 0.02354198912,
1503
+ "Rank": 38
1504
+ },
1505
+ "Probability": {
1506
+ "Average Score": 0.149365327,
1507
+ "Standard Deviation": 0.03489672675,
1508
+ "Rank": 45
1509
+ },
1510
+ "Logical": {
1511
+ "Average Score": 0.116210168,
1512
+ "Standard Deviation": 0.005927966496,
1513
+ "Rank": 47
1514
+ },
1515
+ "Social": {
1516
+ "Average Score": 0.18195615,
1517
+ "Standard Deviation": 0.02269805277,
1518
+ "Rank": 42
1519
+ }
1520
+ }
1521
+ },
1522
+ {
1523
+ "config": {
1524
+ "model_name": "command-r-(04-2024)",
1525
+ "organization": "Cohere",
1526
+ "license": "CC-BY-NC-4.0",
1527
+ "knowledge_cutoff": "2024-04"
1528
+ },
1529
+ "results": {
1530
+ "OVERALL": {
1531
+ "Average Score": 0.388783887,
1532
+ "Standard Deviation": 0.07417186783,
1533
+ "Rank": 30
1534
+ },
1535
+ "Geometry": {
1536
+ "Average Score": 0.300416698,
1537
+ "Standard Deviation": 0.03485612736,
1538
+ "Rank": 38
1539
+ },
1540
+ "Algebra": {
1541
+ "Average Score": 0.293120231,
1542
+ "Standard Deviation": 0.032926484,
1543
+ "Rank": 39
1544
+ },
1545
+ "Probability": {
1546
+ "Average Score": 0.281271304,
1547
+ "Standard Deviation": 0.05697149867,
1548
+ "Rank": 40
1549
+ },
1550
+ "Logical": {
1551
+ "Average Score": 0.276189906,
1552
+ "Standard Deviation": 0.03562914754,
1553
+ "Rank": 34
1554
+ },
1555
+ "Social": {
1556
+ "Average Score": 0.283882949,
1557
+ "Standard Deviation": 0.03336901148,
1558
+ "Rank": 33
1559
+ }
1560
+ }
1561
+ },
1562
+ {
1563
+ "config": {
1564
+ "model_name": "vicuna-33b",
1565
+ "organization": "LMSYS",
1566
+ "license": "Non-commercial",
1567
+ "knowledge_cutoff": "2023-12"
1568
+ },
1569
+ "results": {
1570
+ "OVERALL": {
1571
+ "Average Score": 0.316543555,
1572
+ "Standard Deviation": 0.08922095647,
1573
+ "Rank": 39
1574
+ },
1575
+ "Geometry": {
1576
+ "Average Score": 0.208284679,
1577
+ "Standard Deviation": 0.03937771461,
1578
+ "Rank": 42
1579
+ },
1580
+ "Algebra": {
1581
+ "Average Score": 0.248994048,
1582
+ "Standard Deviation": 0.02668175054,
1583
+ "Rank": 41
1584
+ },
1585
+ "Probability": {
1586
+ "Average Score": 0.222313995,
1587
+ "Standard Deviation": 0.03978859759,
1588
+ "Rank": 43
1589
+ },
1590
+ "Logical": {
1591
+ "Average Score": 0.180291222,
1592
+ "Standard Deviation": 0.021886267,
1593
+ "Rank": 42
1594
+ },
1595
+ "Social": {
1596
+ "Average Score": 0.257623798,
1597
+ "Standard Deviation": 0.02653724437,
1598
+ "Rank": 36
1599
+ }
1600
+ }
1601
+ },
1602
+ {
1603
+ "config": {
1604
+ "model_name": "gemma-7b-it",
1605
+ "organization": "Google",
1606
+ "license": "Gemma License",
1607
+ "knowledge_cutoff": "2023-12"
1608
+ },
1609
+ "results": {
1610
+ "OVERALL": {
1611
+ "Average Score": 0.285077558,
1612
+ "Standard Deviation": 0.08871758453,
1613
+ "Rank": 41
1614
+ },
1615
+ "Geometry": {
1616
+ "Average Score": 0.244791417,
1617
+ "Standard Deviation": 0.0289612078,
1618
+ "Rank": 39
1619
+ },
1620
+ "Algebra": {
1621
+ "Average Score": 0.250614794,
1622
+ "Standard Deviation": 0.01991678295,
1623
+ "Rank": 40
1624
+ },
1625
+ "Probability": {
1626
+ "Average Score": 0.174313053,
1627
+ "Standard Deviation": 0.03765424728,
1628
+ "Rank": 44
1629
+ },
1630
+ "Logical": {
1631
+ "Average Score": 0.197505536,
1632
+ "Standard Deviation": 0.02050298885,
1633
+ "Rank": 39
1634
+ },
1635
+ "Social": {
1636
+ "Average Score": 0.202138025,
1637
+ "Standard Deviation": 0.02098346639,
1638
+ "Rank": 41
1639
+ }
1640
+ }
1641
+ },
1642
+ {
1643
+ "config": {
1644
+ "model_name": "mistral-7b-instruct-2",
1645
+ "organization": "Mistral",
1646
+ "license": "Apache 2.0",
1647
+ "knowledge_cutoff": "2023-12"
1648
+ },
1649
+ "results": {
1650
+ "OVERALL": {
1651
+ "Average Score": 0.427513868,
1652
+ "Standard Deviation": 0.05553921135,
1653
+ "Rank": 27
1654
+ },
1655
+ "Geometry": {
1656
+ "Average Score": 0.216402626,
1657
+ "Standard Deviation": 0.03338414918,
1658
+ "Rank": 40
1659
+ },
1660
+ "Algebra": {
1661
+ "Average Score": 0.233777838,
1662
+ "Standard Deviation": 0.0155226054,
1663
+ "Rank": 42
1664
+ },
1665
+ "Probability": {
1666
+ "Average Score": 0.25118175,
1667
+ "Standard Deviation": 0.04065514593,
1668
+ "Rank": 41
1669
+ },
1670
+ "Logical": {
1671
+ "Average Score": 0.224469136,
1672
+ "Standard Deviation": 0.03404706752,
1673
+ "Rank": 37
1674
+ },
1675
+ "Social": {
1676
+ "Average Score": 0.209386782,
1677
+ "Standard Deviation": 0.02738569921,
1678
+ "Rank": 40
1679
+ }
1680
+ }
1681
+ },
1682
+ {
1683
+ "config": {
1684
+ "model_name": "mistral-7b-instruct-1",
1685
+ "organization": "Mistral",
1686
+ "license": "Apache 2.0",
1687
+ "knowledge_cutoff": "2023-12"
1688
+ },
1689
+ "results": {
1690
+ "OVERALL": {
1691
+ "Average Score": 0.23016314,
1692
+ "Standard Deviation": 0.07137625271,
1693
+ "Rank": 46
1694
+ },
1695
+ "Geometry": {
1696
+ "Average Score": 0.161799938,
1697
+ "Standard Deviation": 0.03595278559,
1698
+ "Rank": 46
1699
+ },
1700
+ "Algebra": {
1701
+ "Average Score": 0.210341624,
1702
+ "Standard Deviation": 0.01736539119,
1703
+ "Rank": 43
1704
+ },
1705
+ "Probability": {
1706
+ "Average Score": 0.238417922,
1707
+ "Standard Deviation": 0.03744211933,
1708
+ "Rank": 42
1709
+ },
1710
+ "Logical": {
1711
+ "Average Score": 0.142636601,
1712
+ "Standard Deviation": 0.02080406365,
1713
+ "Rank": 46
1714
+ },
1715
+ "Social": {
1716
+ "Average Score": 0.117646827,
1717
+ "Standard Deviation": 0.009321202779,
1718
+ "Rank": 49
1719
+ }
1720
+ }
1721
+ },
1722
+ {
1723
+ "config": {
1724
+ "model_name": "vicuna-13b",
1725
+ "organization": "LMSYS",
1726
+ "license": "Non-commercial",
1727
+ "knowledge_cutoff": "2023-11"
1728
+ },
1729
+ "results": {
1730
+ "OVERALL": {
1731
+ "Average Score": 0.201892849,
1732
+ "Standard Deviation": 0.06021749802,
1733
+ "Rank": 47
1734
+ },
1735
+ "Geometry": {
1736
+ "Average Score": 0.200941928,
1737
+ "Standard Deviation": 0.03366817781,
1738
+ "Rank": 43
1739
+ },
1740
+ "Algebra": {
1741
+ "Average Score": 0.196123323,
1742
+ "Standard Deviation": 0.0135715643,
1743
+ "Rank": 44
1744
+ },
1745
+ "Probability": {
1746
+ "Average Score": 0.141214079,
1747
+ "Standard Deviation": 0.02721328211,
1748
+ "Rank": 46
1749
+ },
1750
+ "Logical": {
1751
+ "Average Score": 0.148598631,
1752
+ "Standard Deviation": 0.02241523892,
1753
+ "Rank": 44
1754
+ },
1755
+ "Social": {
1756
+ "Average Score": 0.124655135,
1757
+ "Standard Deviation": 0.01122382671,
1758
+ "Rank": 48
1759
+ }
1760
+ }
1761
+ },
1762
+ {
1763
+ "config": {
1764
+ "model_name": "zephyr-7b-beta",
1765
+ "organization": "HuggingFace",
1766
+ "license": "MIT",
1767
+ "knowledge_cutoff": "2023-10"
1768
+ },
1769
+ "results": {
1770
+ "OVERALL": {
1771
+ "Average Score": 0.102705119,
1772
+ "Standard Deviation": 0.03683757312,
1773
+ "Rank": 50
1774
+ },
1775
+ "Geometry": {
1776
+ "Average Score": 0.114005544,
1777
+ "Standard Deviation": 0.03144354365,
1778
+ "Rank": 47
1779
+ },
1780
+ "Algebra": {
1781
+ "Average Score": 0.141766633,
1782
+ "Standard Deviation": 0.03179520129,
1783
+ "Rank": 45
1784
+ },
1785
+ "Probability": {
1786
+ "Average Score": 0.089050714,
1787
+ "Standard Deviation": 0.002136754266,
1788
+ "Rank": 49
1789
+ },
1790
+ "Logical": {
1791
+ "Average Score": 0.069520789,
1792
+ "Standard Deviation": 0.004477840857,
1793
+ "Rank": 51
1794
+ },
1795
+ "Social": {
1796
+ "Average Score": 0.0,
1797
+ "Standard Deviation": 0.0,
1798
+ "Rank": 54
1799
+ }
1800
+ }
1801
+ },
1802
+ {
1803
+ "config": {
1804
+ "model_name": "gemma-1.1-2b-it",
1805
+ "organization": "Google",
1806
+ "license": "Gemma License",
1807
+ "knowledge_cutoff": "2023-12"
1808
+ },
1809
+ "results": {
1810
+ "OVERALL": {
1811
+ "Average Score": 0.257700845,
1812
+ "Standard Deviation": 0.07369021445,
1813
+ "Rank": 44
1814
+ },
1815
+ "Geometry": {
1816
+ "Average Score": 0.183974034,
1817
+ "Standard Deviation": 0.0215548886,
1818
+ "Rank": 45
1819
+ },
1820
+ "Algebra": {
1821
+ "Average Score": 0.13422252,
1822
+ "Standard Deviation": 0.01922819511,
1823
+ "Rank": 46
1824
+ },
1825
+ "Probability": {
1826
+ "Average Score": 0.095628657,
1827
+ "Standard Deviation": 0.007536076456,
1828
+ "Rank": 48
1829
+ },
1830
+ "Logical": {
1831
+ "Average Score": 0.094965074,
1832
+ "Standard Deviation": 0.005019175487,
1833
+ "Rank": 49
1834
+ },
1835
+ "Social": {
1836
+ "Average Score": 0.167796727,
1837
+ "Standard Deviation": 0.01666541942,
1838
+ "Rank": 44
1839
+ }
1840
+ }
1841
+ },
1842
+ {
1843
+ "config": {
1844
+ "model_name": "llama2-7b-chat",
1845
+ "organization": "Meta",
1846
+ "license": "Llama 2 Community",
1847
+ "knowledge_cutoff": "2023-10"
1848
+ },
1849
+ "results": {
1850
+ "OVERALL": {
1851
+ "Average Score": 0.260189428,
1852
+ "Standard Deviation": 0.08019299364,
1853
+ "Rank": 43
1854
+ },
1855
+ "Geometry": {
1856
+ "Average Score": 0.087067276,
1857
+ "Standard Deviation": 0.04274343402,
1858
+ "Rank": 48
1859
+ },
1860
+ "Algebra": {
1861
+ "Average Score": 0.12308805,
1862
+ "Standard Deviation": 0.01856053622,
1863
+ "Rank": 47
1864
+ },
1865
+ "Probability": {
1866
+ "Average Score": 0.087515438,
1867
+ "Standard Deviation": 0.006315053573,
1868
+ "Rank": 50
1869
+ },
1870
+ "Logical": {
1871
+ "Average Score": 0.17312827,
1872
+ "Standard Deviation": 0.01867044092,
1873
+ "Rank": 43
1874
+ },
1875
+ "Social": {
1876
+ "Average Score": 0.152905272,
1877
+ "Standard Deviation": 0.007166957097,
1878
+ "Rank": 45
1879
+ }
1880
+ }
1881
+ },
1882
+ {
1883
+ "config": {
1884
+ "model_name": "gemma-2b-it",
1885
+ "organization": "Google",
1886
+ "license": "Gemma License",
1887
+ "knowledge_cutoff": "2023-11"
1888
+ },
1889
+ "results": {
1890
+ "OVERALL": {
1891
+ "Average Score": 0.234172069,
1892
+ "Standard Deviation": 0.06522685718,
1893
+ "Rank": 45
1894
+ },
1895
+ "Geometry": {
1896
+ "Average Score": 0.198571153,
1897
+ "Standard Deviation": 0.01699161031,
1898
+ "Rank": 44
1899
+ },
1900
+ "Algebra": {
1901
+ "Average Score": 0.109883009,
1902
+ "Standard Deviation": 0.01520005833,
1903
+ "Rank": 48
1904
+ },
1905
+ "Probability": {
1906
+ "Average Score": 0.06467432,
1907
+ "Standard Deviation": 0.002117497231,
1908
+ "Rank": 52
1909
+ },
1910
+ "Logical": {
1911
+ "Average Score": 0.039624492,
1912
+ "Standard Deviation": 0.007606972686,
1913
+ "Rank": 52
1914
+ },
1915
+ "Social": {
1916
+ "Average Score": 0.087452913,
1917
+ "Standard Deviation": 0.008170146562,
1918
+ "Rank": 52
1919
+ }
1920
+ }
1921
+ },
1922
+ {
1923
+ "config": {
1924
+ "model_name": "llama2-13b-chat",
1925
+ "organization": "Meta",
1926
+ "license": "Llama 2 Community",
1927
+ "knowledge_cutoff": "2023-12"
1928
+ },
1929
+ "results": {
1930
+ "OVERALL": {
1931
+ "Average Score": 0.263305684,
1932
+ "Standard Deviation": 0.07283640689,
1933
+ "Rank": 42
1934
+ },
1935
+ "Geometry": {
1936
+ "Average Score": 0.072729954,
1937
+ "Standard Deviation": 0.02315988261,
1938
+ "Rank": 50
1939
+ },
1940
+ "Algebra": {
1941
+ "Average Score": 0.080371692,
1942
+ "Standard Deviation": 0.01277569453,
1943
+ "Rank": 49
1944
+ },
1945
+ "Probability": {
1946
+ "Average Score": 0.117757344,
1947
+ "Standard Deviation": 0.02418619619,
1948
+ "Rank": 47
1949
+ },
1950
+ "Logical": {
1951
+ "Average Score": 0.193149889,
1952
+ "Standard Deviation": 0.01776690764,
1953
+ "Rank": 40
1954
+ },
1955
+ "Social": {
1956
+ "Average Score": 0.149125922,
1957
+ "Standard Deviation": 0.01157416827,
1958
+ "Rank": 46
1959
+ }
1960
+ }
1961
+ },
1962
+ {
1963
+ "config": {
1964
+ "model_name": "vicuna-7b",
1965
+ "organization": "LMSYS",
1966
+ "license": "Non-commercial",
1967
+ "knowledge_cutoff": "2023-11"
1968
+ },
1969
+ "results": {
1970
+ "OVERALL": {
1971
+ "Average Score": 0.198839786,
1972
+ "Standard Deviation": 0.05725381576,
1973
+ "Rank": 48
1974
+ },
1975
+ "Geometry": {
1976
+ "Average Score": 0.083457058,
1977
+ "Standard Deviation": 0.02520989111,
1978
+ "Rank": 49
1979
+ },
1980
+ "Algebra": {
1981
+ "Average Score": 0.070883882,
1982
+ "Standard Deviation": 0.007315853253,
1983
+ "Rank": 50
1984
+ },
1985
+ "Probability": {
1986
+ "Average Score": 0.080987673,
1987
+ "Standard Deviation": 0.005474288861,
1988
+ "Rank": 51
1989
+ },
1990
+ "Logical": {
1991
+ "Average Score": 0.100065588,
1992
+ "Standard Deviation": 0.003561886452,
1993
+ "Rank": 48
1994
+ },
1995
+ "Social": {
1996
+ "Average Score": 0.111076414,
1997
+ "Standard Deviation": 0.004805626512,
1998
+ "Rank": 50
1999
+ }
2000
+ }
2001
+ },
2002
+ {
2003
+ "config": {
2004
+ "model_name": "koala-13b",
2005
+ "organization": "UC Berkeley",
2006
+ "license": "Non-commercial",
2007
+ "knowledge_cutoff": "2023-10"
2008
+ },
2009
+ "results": {
2010
+ "OVERALL": {
2011
+ "Average Score": 0.09387188,
2012
+ "Standard Deviation": 0.02642167489,
2013
+ "Rank": 51
2014
+ },
2015
+ "Geometry": {
2016
+ "Average Score": 0.017374001,
2017
+ "Standard Deviation": 0.01747053557,
2018
+ "Rank": 51
2019
+ },
2020
+ "Algebra": {
2021
+ "Average Score": 0.018129197,
2022
+ "Standard Deviation": 0.01054371383,
2023
+ "Rank": 51
2024
+ },
2025
+ "Probability": {
2026
+ "Average Score": 0.043654362,
2027
+ "Standard Deviation": 0.004288231886,
2028
+ "Rank": 53
2029
+ },
2030
+ "Logical": {
2031
+ "Average Score": 0.074694053,
2032
+ "Standard Deviation": 0.002674646998,
2033
+ "Rank": 50
2034
+ },
2035
+ "Social": {
2036
+ "Average Score": 0.096983835,
2037
+ "Standard Deviation": 0.007847059783,
2038
+ "Rank": 51
2039
+ }
2040
+ }
2041
+ },
2042
+ {
2043
+ "config": {
2044
+ "model_name": "openassistant-pythia-12b",
2045
+ "organization": "OpenAssistant",
2046
+ "license": "Non-commercial",
2047
+ "knowledge_cutoff": "2023-09"
2048
+ },
2049
+ "results": {
2050
+ "OVERALL": {
2051
+ "Average Score": 0.0,
2052
+ "Standard Deviation": 0.0,
2053
+ "Rank": 52
2054
+ },
2055
+ "Geometry": {
2056
+ "Average Score": 0.0,
2057
+ "Standard Deviation": 0.0,
2058
+ "Rank": 52
2059
+ },
2060
+ "Algebra": {
2061
+ "Average Score": 0.0,
2062
+ "Standard Deviation": 0.0,
2063
+ "Rank": 52
2064
+ },
2065
+ "Probability": {
2066
+ "Average Score": 0.0,
2067
+ "Standard Deviation": 0.0,
2068
+ "Rank": 54
2069
+ },
2070
+ "Logical": {
2071
+ "Average Score": 0.0,
2072
+ "Standard Deviation": 0.0,
2073
+ "Rank": 53
2074
+ },
2075
+ "Social": {
2076
+ "Average Score": 0.030792528,
2077
+ "Standard Deviation": 0.007518796391,
2078
+ "Rank": 53
2079
+ }
2080
+ }
2081
+ }
2082
+ ]