Update README.md
Browse files
README.md
CHANGED
@@ -14,12 +14,12 @@ We're excited to release lightweight Hammer 2.0 models ([0.5B](https://huggingfa
|
|
14 |
Hammer2.0 finetuned based on [Qwen 2.5 series](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and [Qwen 2.5 coder series](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) using function masking techniques. It's trained using the [APIGen Function Calling Datasets](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) containing 60,000 samples, supplemented by [xlam-irrelevance-7.5k](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) we generated. Hammer2.0 has achieved exceptional performances across numerous function calling benchmarks. For detailed data construction, training methods, and evaluation strategies, please refer to our paper [Hammer: Robust Function-Calling for On-Device Language Models via Function Masking](https://arxiv.org/abs/2410.04587) and the [Hammer GitHub repository](https://github.com/MadeAgents/Hammer) .
|
15 |
|
16 |
## Evaluation
|
17 |
-
The evaluation results of Hammer 2.0 models on the Berkeley Function-Calling Leaderboard (BFCL) are presented in the following table:
|
18 |
<div style="text-align: center;">
|
19 |
<img src="v2_figures/bfcl.PNG" alt="overview" width="1000" style="margin: auto;">
|
20 |
</div>
|
21 |
|
22 |
-
Our Hammer 2.0 series consistently achieves corresponding best performance at comparable scales. The 7B model outperforms most function calling enchanced models,
|
23 |
|
24 |
In addition, we evaluated the Hammer 2.0 models on other academic benchmarks to further demonstrate the generalization ability of our models.
|
25 |
|
|
|
14 |
Hammer2.0 finetuned based on [Qwen 2.5 series](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and [Qwen 2.5 coder series](https://huggingface.co/collections/Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f) using function masking techniques. It's trained using the [APIGen Function Calling Datasets](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) containing 60,000 samples, supplemented by [xlam-irrelevance-7.5k](https://huggingface.co/datasets/MadeAgents/xlam-irrelevance-7.5k) we generated. Hammer2.0 has achieved exceptional performances across numerous function calling benchmarks. For detailed data construction, training methods, and evaluation strategies, please refer to our paper [Hammer: Robust Function-Calling for On-Device Language Models via Function Masking](https://arxiv.org/abs/2410.04587) and the [Hammer GitHub repository](https://github.com/MadeAgents/Hammer) .
|
15 |
|
16 |
## Evaluation
|
17 |
+
The evaluation results of Hammer 2.0 models on the Berkeley Function-Calling Leaderboard (BFCL-v3) are presented in the following table:
|
18 |
<div style="text-align: center;">
|
19 |
<img src="v2_figures/bfcl.PNG" alt="overview" width="1000" style="margin: auto;">
|
20 |
</div>
|
21 |
|
22 |
+
Our Hammer 2.0 series consistently achieves corresponding best performance at comparable scales. The 7B model outperforms most function calling enchanced models, and the 1.5B model also achieves unexpected performance.
|
23 |
|
24 |
In addition, we evaluated the Hammer 2.0 models on other academic benchmarks to further demonstrate the generalization ability of our models.
|
25 |
|