Update README.md
Browse files
README.md
CHANGED
@@ -20,7 +20,46 @@ Thanks so much for your attention, a report with all the technical details leadi
|
|
20 |
|
21 |
## Evaluation
|
22 |
First, we evaluate Hammer series on the Berkeley Function-Calling Leaderboard (BFCL):
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
In addition, we evaluated our Hammer2.0 series (0.5b, 1.5b, 3b, 7b) on other academic benchmarks to further show our model's generalization ability:
|
26 |
|
|
|
20 |
|
21 |
## Evaluation
|
22 |
First, we evaluate Hammer series on the Berkeley Function-Calling Leaderboard (BFCL):
|
23 |
+
| | | | | | | | | | | | | | | | Multi Turn | | | | | | Hallucination Measurement | | | | | | |
|
24 |
+
|:----:|:-----------:|:---------------------------------------------:|:--------------:|:------:|:--------:|:--------:|:-----------------:|:---------------:|:------:|:--------:|:--------:|:-----------------:|:-----------:|:------:|:----------:|:--------:|:-----------------:|:-----------:|:----:|:---------:|:-------------------------:|:------------:|:---------:|:---------:|-------------|---------------|------------------------|
|
25 |
+
| | | | Non-live (AST) | | | | | Non-live (Exec) | | | | | Live (AST) | | | | | Multi turn | | | | | | | | | |
|
26 |
+
| Rank | Overall Acc | Model | AST Summary | Simple | Multiple | Parallel | Multiple Parallel | Exec Summary | Simple | Multiple | Parallel | Multiple Parallel | Overall Acc | Simple | Multiple | Parallel | Multiple Parallel | Overall Acc | Base | Miss Func | Miss Param | Long Context | Composite | Relevance | Irrelevance | Organization | License |
|
27 |
+
| 1 | 59.49 | GPT-4-turbo-2024-04-09 (FC) | 82.65 | 60.58 | 91 | 90 | 89 | 83.8 | 88.71 | 88 | 86 | 72.5 | 73.39 | 67.83 | 74.45 | 75 | 62.5 | 21.62 | 33.5 | 3.5 | 20 | 29.5 | N/A | 70.73 | 79.79 | OpenAI | Proprietary |
|
28 |
+
| 2 | 59.29 | GPT-4o-2024-08-06 (FC) | 85.52 | 73.58 | 92.5 | 91.5 | 84.5 | 82.96 | 85.36 | 90 | 84 | 72.5 | 71.79 | 67.83 | 69.43 | 75 | 66.67 | 21.25 | 31 | 5 | 19.5 | 29.5 | N/A | 63.41 | 82.91 | OpenAI | Proprietary |
|
29 |
+
| 3 | 59.13 | xLAM-8x22b-r (FC) | 89.75 | 77 | 95.5 | 92.5 | 94 | 89.32 | 98.29 | 94 | 90 | 75 | 72.81 | 70.93 | 77.72 | 75 | 75 | 15.62 | 21.5 | 3.5 | 17 | 20.5 | N/A | 97.56 | 75.23 | Salesforce | cc-by-nc-4.0 |
|
30 |
+
| 4 | 58.45 | GPT-4o-mini-2024-07-18 (FC) | 82.83 | 67.83 | 90.5 | 89.5 | 83.5 | 81.8 | 83.21 | 92 | 82 | 70 | 67.53 | 67.83 | 69.82 | 81.25 | 70.83 | 25.75 | 36.5 | 9.5 | 24.5 | 32.5 | N/A | 82.93 | 71.83 | OpenAI | Proprietary |
|
31 |
+
| 5 | 57.94 | xLAM-8x7b-r (FC) | 88.44 | 77.25 | 95.5 | 92 | 89 | 85.89 | 91.57 | 94 | 88 | 70 | 71.97 | 68.99 | 76.18 | 50 | 75 | 15.75 | 18.5 | 8 | 15.5 | 21 | N/A | 92.68 | 72.35 | Salesforce | cc-by-nc-4.0 |
|
32 |
+
| 6 | 57.21 | GPT-4o-mini-2024-07-18 (Prompt) | 86.54 | 79.67 | 89.5 | 89 | 88 | 87.95 | 98.29 | 94 | 82 | 77.5 | 72.77 | 72.09 | 73.77 | 81.25 | 70.83 | 11.62 | 15 | 1.5 | 13 | 17 | N/A | 80.49 | 79.2 | OpenAI | Proprietary |
|
33 |
+
| | 56.96 | MadeAgents/Hammer2.0-7b (FC) | 90.33 | 79.83 | 95 | 94 | 92.5 | 82.2 | 83.29 | 92 | 86 | 67.5 | 68.99 | 67.83 | 76.28 | 75 | 70.83 | 16.5 | 21.5 | 7.5 | 19 | 18 | N/A | 92.68 | 68.88 | MadeAgents | cc-by-nc-4.0 |
|
34 |
+
| 7 | 55.82 | mistral-large-2407 (FC) | 84.12 | 57.5 | 94 | 93 | 92 | 83.09 | 76.86 | 92 | 86 | 77.5 | 67.17 | 79.07 | 78.88 | 87.5 | 75 | 20.5 | 29 | 13 | 19.5 | 20.5 | N/A | 78.05 | 48.93 | Mistral AI | Proprietary |
|
35 |
+
| 8 | 55.67 | GPT-4-turbo-2024-04-09 (Prompt) | 91.31 | 82.25 | 94.5 | 95 | 93.5 | 88.12 | 99 | 96 | 80 | 77.5 | 67.97 | 78.68 | 83.12 | 81.25 | 75 | 10.62 | 12.5 | 5.5 | 11 | 13.5 | N/A | 82.93 | 61.82 | OpenAI | Proprietary |
|
36 |
+
| 9 | 54.83 | Claude-3.5-Sonnet-20240620 (FC) | 70.35 | 75.42 | 93.5 | 62 | 50.5 | 66.34 | 95.36 | 86 | 44 | 40 | 71.39 | 72.48 | 70.68 | 68.75 | 75 | 23.5 | 30.5 | 8 | 27 | 28.5 | N/A | 63.41 | 75.91 | Anthropic | Proprietary |
|
37 |
+
| 10 | 53.66 | GPT-4o-2024-08-06 (Prompt) | 80.9 | 64.08 | 86.5 | 88 | 85 | 77.89 | 70.57 | 88 | 78 | 75 | 73.88 | 67.44 | 67.21 | 56.25 | 58.33 | 6.12 | 9 | 1 | 7.5 | 7 | N/A | 53.66 | 89.56 | OpenAI | Proprietary |
|
38 |
+
| 11 | 53.43 | o1-mini-2024-09-12 (Prompt) | 75.48 | 68.92 | 89 | 73.5 | 70.5 | 76.86 | 78.93 | 88 | 78 | 62.5 | 71.17 | 62.79 | 65.09 | 68.75 | 58.33 | 11 | 16 | 2 | 12.5 | 13.5 | N/A | 46.34 | 88.07 | OpenAI | Proprietary |
|
39 |
+
| 12 | 53.01 | Gemini-1.5-Flash-Preview-0514 (FC) | 77.1 | 65.42 | 94.5 | 71.5 | 77 | 71.23 | 57.93 | 84 | 78 | 65 | 71.17 | 62.79 | 72.61 | 56.25 | 54.17 | 13.12 | 17.5 | 4 | 15.5 | 15.5 | N/A | 60.98 | 76.15 | Google | Proprietary |
|
40 |
+
| 13 | 52.53 | Gemini-1.5-Pro-Preview-0514 (FC) | 75.54 | 50.17 | 89.5 | 83.5 | 79 | 77.46 | 71.86 | 86 | 82 | 70 | 69.26 | 60.08 | 66.35 | 75 | 54.17 | 10.87 | 15.5 | 1.5 | 11 | 15.5 | N/A | 60.98 | 80.56 | Google | Proprietary |
|
41 |
+
| | 51.94 | MadeAgents/Hammer2.0-1.5b (FC) | 84.31 | 75.25 | 92.5 | 87.5 | 82 | 81.8 | 83.71 | 90 | 86 | 67.5 | 63.17 | 64.73 | 67.31 | 50 | 66.67 | 11.38 | 14 | 7 | 12 | 12.5 | N/A | 92.68 | 61.83 | MadeAgents | cc-by-nc-4.0 |
|
42 |
+
| 14 | 51.93 | GPT-3.5-Turbo-0125 (FC) | 84.52 | 74.08 | 93 | 87.5 | 83.5 | 81.66 | 95.14 | 88 | 86 | 57.5 | 59 | 65.5 | 74.16 | 56.25 | 54.17 | 19.12 | 30 | 7.5 | 23 | 16 | N/A | 97.56 | 35.83 | OpenAI | Proprietary |
|
43 |
+
| 15 | 51.78 | FireFunction-v2 (FC) | 85.71 | 78.83 | 92 | 91 | 81 | 84.23 | 94.43 | 88 | 82 | 72.5 | 61.71 | 69.38 | 70.97 | 56.25 | 54.17 | 11.62 | 21.5 | 1.5 | 17.5 | 6 | N/A | 87.8 | 52.94 | Fireworks | Apache 2.0 |
|
44 |
+
| 16 | 51.78 | Open-Mistral-Nemo-2407 (FC) | 80.98 | 60.92 | 92 | 85.5 | 85.5 | 81.46 | 91.36 | 86 | 86 | 62.5 | 61.44 | 68.22 | 67.98 | 75 | 62.5 | 14.25 | 21 | 10 | 13.5 | 12.5 | N/A | 65.85 | 59.14 | Mistral AI | Proprietary |
|
45 |
+
| 17 | 51.45 | xLAM-7b-fc-r (FC) | 86.83 | 77.33 | 92.5 | 91.5 | 86 | 85.02 | 91.57 | 88 | 88 | 72.5 | 68.81 | 63.57 | 63.36 | 56.25 | 50 | 0 | 0 | 0 | 0 | 0 | N/A | 80.49 | 79.76 | Salesforce | cc-by-nc-4.0 |
|
46 |
+
| 18 | 51.01 | Gorilla-OpenFunctions-v2 (FC) | 87.29 | 77.67 | 95 | 89 | 87.5 | 84.96 | 95.86 | 96 | 78 | 70 | 68.59 | 63.95 | 63.93 | 62.5 | 45.83 | 0 | 0 | 0 | 0 | 0 | N/A | 85.37 | 73.13 | Gorilla LLM | Apache 2.0 |
|
47 |
+
| | 49.88 | MadeAgents/Hammer2.0-3b (FC) | 86.77 | 77.08 | 92.5 | 89.5 | 88 | 80.25 | 81.5 | 86 | 86 | 67.5 | 66.06 | 63.95 | 72.81 | 56.25 | 66.67 | 0.5 | 1 | 0 | 0.5 | 0.5 | N/A | 92.68 | 68.59 | MadeAgents | cc-by-nc-4.0 |
|
48 |
+
| 19 | 49.63 | Claude-3-Opus-20240229 (FC tools-2024-04-04) | 58.4 | 74.08 | 89.5 | 35 | 35 | 63.16 | 84.64 | 86 | 52 | 30 | 70.5 | 64.73 | 70.4 | 43.75 | 20.83 | 15.62 | 22 | 4 | 14.5 | 22 | N/A | 73.17 | 76.4 | Anthropic | Proprietary |
|
49 |
+
| 20 | 49.55 | Meta-Llama-3-70B-Instruct (Prompt) | 87.21 | 75.83 | 94.5 | 91.5 | 87 | 87.41 | 94.14 | 94 | 84 | 77.5 | 63.39 | 69.77 | 78.01 | 75 | 66.67 | 1.12 | 1.5 | 1.5 | 1 | 0.5 | N/A | 92.68 | 50.63 | Meta | Meta Llama 3 Community |
|
50 |
+
| 21 | 48.14 | Command-R-Plus (Prompt) (Original) | 75.54 | 71.17 | 85 | 80 | 66 | 77.57 | 91.29 | 86 | 78 | 55 | 67.88 | 65.12 | 71.26 | 75 | 58.33 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 75.61 | 69.31 | Cohere For AI | cc-by-nc-4.0 |
|
51 |
+
| 22 | 47.66 | Granite-20b-FunctionCalling (FC) | 82.67 | 73.17 | 92 | 84 | 81.5 | 82.96 | 85.36 | 90 | 84 | 72.5 | 55.89 | 57.36 | 54.1 | 37.5 | 54.17 | 3.63 | 4.5 | 1.5 | 3.5 | 5 | N/A | 95.12 | 72.43 | IBM | Apache-2.0 |
|
52 |
+
| 23 | 45.88 | Hermes-2-Pro-Llama-3-70B (FC) | 81.73 | 65.92 | 80.5 | 90.5 | 90 | 81.29 | 80.64 | 88 | 84 | 72.5 | 58.6 | 66.67 | 62.49 | 50 | 66.67 | 0.25 | 0.5 | 0 | 0 | 0.5 | N/A | 80.49 | 53.8 | NousResearch | apache-2.0 |
|
53 |
+
| 24 | 45.4 | xLAM-1b-fc-r (FC) | 79.17 | 73.17 | 89.5 | 77.5 | 76.5 | 80.5 | 78 | 88 | 86 | 70 | 57.57 | 56.59 | 56.12 | 50 | 58.33 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | N/A | 95.12 | 61.26 | Salesforce | cc-by-nc-4.0 |
|
54 |
+
| 25 | 45.22 | Command-R-Plus (FC) (Original) | 77.65 | 69.58 | 88 | 82.5 | 70.5 | 77.41 | 89.14 | 86 | 82 | 52.5 | 54.24 | 58.91 | 56.89 | 50 | 54.17 | 6.12 | 9.5 | 0 | 6.5 | 8.5 | N/A | 92.68 | 52.75 | Cohere For AI | cc-by-nc-4.0 |
|
55 |
+
| 26 | 44.28 | Hermes-2-Pro-Llama-3-8B (FC) | 77.17 | 64.17 | 91 | 79.5 | 74 | 74.05 | 68.71 | 90 | 80 | 57.5 | 57.8 | 60.47 | 58.92 | 43.75 | 41.67 | 1.88 | 2.5 | 0.5 | 2.5 | 2 | N/A | 53.66 | 55.16 | NousResearch | apache-2.0 |
|
56 |
+
| 27 | 44.23 | Hermes-2-Pro-Mistral-7B (FC) | 73.17 | 62.67 | 85.5 | 77 | 67.5 | 74.25 | 60.5 | 90 | 84 | 62.5 | 54.11 | 59.3 | 57.47 | 43.75 | 33.33 | 9.88 | 12 | 6.5 | 10 | 11 | N/A | 75.61 | 38.55 | NousResearch | apache-2.0 |
|
57 |
+
| 28 | 43.9 | Hermes-2-Theta-Llama-3-8B (FC) | 73.56 | 61.25 | 82.5 | 75.5 | 75 | 72.54 | 69.14 | 88 | 78 | 55 | 59.57 | 55.81 | 53.13 | 43.75 | 41.67 | 1 | 1.5 | 0 | 1 | 1.5 | N/A | 51.22 | 62.66 | NousResearch | apache-2.0 |
|
58 |
+
| 29 | 43 | Open-Mixtral-8x22b (FC) | 56.12 | 50.5 | 95 | 8.5 | 70.5 | 59.7 | 77.79 | 92 | 24 | 45 | 65.3 | 68.99 | 70.49 | 12.5 | 54.17 | 8.88 | 12.5 | 6.5 | 8 | 8.5 | N/A | 85.37 | 44.2 | Mistral AI | Proprietary |
|
59 |
+
| | 39.51 | MadeAgents/Hammer2.0-0.5b (FC) | 67 | 62 | 80 | 68 | 58 | 65.73 | 48.43 | 82 | 80 | 52.5 | 51.62 | 47.67 | 42.14 | 50 | 37.5 | 0 | 0 | 0 | 0 | 0 | N/A | 87.8 | 67 | MadeAgents | cc-by-nc-4.0 |
|
60 |
+
| 30 | 38.39 | Claude-3-Haiku-20240307 (Prompt) | 62.52 | 77.58 | 93 | 47.5 | 32 | 60.73 | 89.43 | 94 | 32 | 27.5 | 58.06 | 71.71 | 75.99 | 56.25 | 58.33 | 1.62 | 2.5 | 0.5 | 1 | 2.5 | N/A | 85.37 | 18.9 | Anthropic | Proprietary |
|
61 |
+
| 31 | 37.77 | Claude-3-Haiku-20240307 (FC tools-2024-04-04) | 42.42 | 74.17 | 93 | 2 | 0.5 | 47.16 | 90.64 | 92 | 6 | 0 | 51.98 | 71.32 | 64.9 | 0 | 4.17 | 18.5 | 25 | 6.5 | 24 | 18.5 | N/A | 97.56 | 29.08 | Anthropic | Proprietary |
|
62 |
+
| 32 | 16.66 | Hermes-2-Theta-Llama-3-70B (FC) | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38.87 | | | | | | | | | | | | | | |
|
63 |
|
64 |
In addition, we evaluated our Hammer2.0 series (0.5b, 1.5b, 3b, 7b) on other academic benchmarks to further show our model's generalization ability:
|
65 |
|