Safetensors
qwen2
Hammer2.0-7b / README.md
linqq9's picture
Update README.md
7713c67 verified
|
raw
history blame
28.6 kB
metadata
license: cc-by-4.0
datasets:
  - Salesforce/xlam-function-calling-60k
base_model:
  - Qwen/Qwen2.5-Coder-7B-Instruct

Hammer2.0-7b Function Calling Model

Introduction

The Hammer2.0 series models have been released, including versions such as 0.5b, 1.5b, 3b, and 7b. Compared to the Hammer1.0 series, the Hammer2.0 series has stronger performance in function calling.

Model Details

Take Hammer2.0-7b as an example, it is a fine-tuned model based on Qwen2.5-Coder-7B-Instruct. It's trained using the APIGen Function Calling Datasets containing 60,000 samples, supplemented by 7,500 irrelevance detection data we generated. Employing innovative training techniques like function masking, function shuffling, and prompt optimization, Hammer2.0-7b has achieved exceptional performances across numerous benchmarks including Berkley Function Calling Leaderboard, API-Bank, Tool-Alpaca, Nexus Raven and Seal-Tools.

Tuning Details

Thanks so much for your attention, a report with all the technical details leading to our models will be published soon.

Evaluation

First, we evaluate Hammer series on the Berkeley Function-Calling Leaderboard (BFCL):

Multi Turn Hallucination Measurement
Non-live (AST) Non-live (Exec) Live (AST) Multi turn
Rank Overall Acc Model AST Summary Simple Multiple Parallel Multiple Parallel Exec Summary Simple Multiple Parallel Multiple Parallel Overall Acc Simple Multiple Parallel Multiple Parallel Overall Acc Base Miss Func Miss Param Long Context Composite Relevance Irrelevance Organization License
1 59.49 GPT-4-turbo-2024-04-09 (FC) 82.65 60.58 91 90 89 83.8 88.71 88 86 72.5 73.39 67.83 74.45 75 62.5 21.62 33.5 3.5 20 29.5 N/A 70.73 79.79 OpenAI Proprietary
2 59.29 GPT-4o-2024-08-06 (FC) 85.52 73.58 92.5 91.5 84.5 82.96 85.36 90 84 72.5 71.79 67.83 69.43 75 66.67 21.25 31 5 19.5 29.5 N/A 63.41 82.91 OpenAI Proprietary
3 59.13 xLAM-8x22b-r (FC) 89.75 77 95.5 92.5 94 89.32 98.29 94 90 75 72.81 70.93 77.72 75 75 15.62 21.5 3.5 17 20.5 N/A 97.56 75.23 Salesforce cc-by-nc-4.0
4 58.45 GPT-4o-mini-2024-07-18 (FC) 82.83 67.83 90.5 89.5 83.5 81.8 83.21 92 82 70 67.53 67.83 69.82 81.25 70.83 25.75 36.5 9.5 24.5 32.5 N/A 82.93 71.83 OpenAI Proprietary
5 57.94 xLAM-8x7b-r (FC) 88.44 77.25 95.5 92 89 85.89 91.57 94 88 70 71.97 68.99 76.18 50 75 15.75 18.5 8 15.5 21 N/A 92.68 72.35 Salesforce cc-by-nc-4.0
6 57.21 GPT-4o-mini-2024-07-18 (Prompt) 86.54 79.67 89.5 89 88 87.95 98.29 94 82 77.5 72.77 72.09 73.77 81.25 70.83 11.62 15 1.5 13 17 N/A 80.49 79.2 OpenAI Proprietary
56.96 MadeAgents/Hammer2.0-7b (FC) 90.33 79.83 95 94 92.5 82.2 83.29 92 86 67.5 68.99 67.83 76.28 75 70.83 16.5 21.5 7.5 19 18 N/A 92.68 68.88 MadeAgents cc-by-nc-4.0
7 55.82 mistral-large-2407 (FC) 84.12 57.5 94 93 92 83.09 76.86 92 86 77.5 67.17 79.07 78.88 87.5 75 20.5 29 13 19.5 20.5 N/A 78.05 48.93 Mistral AI Proprietary
8 55.67 GPT-4-turbo-2024-04-09 (Prompt) 91.31 82.25 94.5 95 93.5 88.12 99 96 80 77.5 67.97 78.68 83.12 81.25 75 10.62 12.5 5.5 11 13.5 N/A 82.93 61.82 OpenAI Proprietary
9 54.83 Claude-3.5-Sonnet-20240620 (FC) 70.35 75.42 93.5 62 50.5 66.34 95.36 86 44 40 71.39 72.48 70.68 68.75 75 23.5 30.5 8 27 28.5 N/A 63.41 75.91 Anthropic Proprietary
10 53.66 GPT-4o-2024-08-06 (Prompt) 80.9 64.08 86.5 88 85 77.89 70.57 88 78 75 73.88 67.44 67.21 56.25 58.33 6.12 9 1 7.5 7 N/A 53.66 89.56 OpenAI Proprietary
11 53.43 o1-mini-2024-09-12 (Prompt) 75.48 68.92 89 73.5 70.5 76.86 78.93 88 78 62.5 71.17 62.79 65.09 68.75 58.33 11 16 2 12.5 13.5 N/A 46.34 88.07 OpenAI Proprietary
12 53.01 Gemini-1.5-Flash-Preview-0514 (FC) 77.1 65.42 94.5 71.5 77 71.23 57.93 84 78 65 71.17 62.79 72.61 56.25 54.17 13.12 17.5 4 15.5 15.5 N/A 60.98 76.15 Google Proprietary
13 52.53 Gemini-1.5-Pro-Preview-0514 (FC) 75.54 50.17 89.5 83.5 79 77.46 71.86 86 82 70 69.26 60.08 66.35 75 54.17 10.87 15.5 1.5 11 15.5 N/A 60.98 80.56 Google Proprietary
51.94 MadeAgents/Hammer2.0-1.5b (FC) 84.31 75.25 92.5 87.5 82 81.8 83.71 90 86 67.5 63.17 64.73 67.31 50 66.67 11.38 14 7 12 12.5 N/A 92.68 61.83 MadeAgents cc-by-nc-4.0
14 51.93 GPT-3.5-Turbo-0125 (FC) 84.52 74.08 93 87.5 83.5 81.66 95.14 88 86 57.5 59 65.5 74.16 56.25 54.17 19.12 30 7.5 23 16 N/A 97.56 35.83 OpenAI Proprietary
15 51.78 FireFunction-v2 (FC) 85.71 78.83 92 91 81 84.23 94.43 88 82 72.5 61.71 69.38 70.97 56.25 54.17 11.62 21.5 1.5 17.5 6 N/A 87.8 52.94 Fireworks Apache 2.0
16 51.78 Open-Mistral-Nemo-2407 (FC) 80.98 60.92 92 85.5 85.5 81.46 91.36 86 86 62.5 61.44 68.22 67.98 75 62.5 14.25 21 10 13.5 12.5 N/A 65.85 59.14 Mistral AI Proprietary
17 51.45 xLAM-7b-fc-r (FC) 86.83 77.33 92.5 91.5 86 85.02 91.57 88 88 72.5 68.81 63.57 63.36 56.25 50 0 0 0 0 0 N/A 80.49 79.76 Salesforce cc-by-nc-4.0
18 51.01 Gorilla-OpenFunctions-v2 (FC) 87.29 77.67 95 89 87.5 84.96 95.86 96 78 70 68.59 63.95 63.93 62.5 45.83 0 0 0 0 0 N/A 85.37 73.13 Gorilla LLM Apache 2.0
49.88 MadeAgents/Hammer2.0-3b (FC) 86.77 77.08 92.5 89.5 88 80.25 81.5 86 86 67.5 66.06 63.95 72.81 56.25 66.67 0.5 1 0 0.5 0.5 N/A 92.68 68.59 MadeAgents cc-by-nc-4.0
19 49.63 Claude-3-Opus-20240229 (FC tools-2024-04-04) 58.4 74.08 89.5 35 35 63.16 84.64 86 52 30 70.5 64.73 70.4 43.75 20.83 15.62 22 4 14.5 22 N/A 73.17 76.4 Anthropic Proprietary
20 49.55 Meta-Llama-3-70B-Instruct (Prompt) 87.21 75.83 94.5 91.5 87 87.41 94.14 94 84 77.5 63.39 69.77 78.01 75 66.67 1.12 1.5 1.5 1 0.5 N/A 92.68 50.63 Meta Meta Llama 3 Community
21 48.14 Command-R-Plus (Prompt) (Original) 75.54 71.17 85 80 66 77.57 91.29 86 78 55 67.88 65.12 71.26 75 58.33 0.25 0.5 0 0 0.5 N/A 75.61 69.31 Cohere For AI cc-by-nc-4.0
22 47.66 Granite-20b-FunctionCalling (FC) 82.67 73.17 92 84 81.5 82.96 85.36 90 84 72.5 55.89 57.36 54.1 37.5 54.17 3.63 4.5 1.5 3.5 5 N/A 95.12 72.43 IBM Apache-2.0
23 45.88 Hermes-2-Pro-Llama-3-70B (FC) 81.73 65.92 80.5 90.5 90 81.29 80.64 88 84 72.5 58.6 66.67 62.49 50 66.67 0.25 0.5 0 0 0.5 N/A 80.49 53.8 NousResearch apache-2.0
24 45.4 xLAM-1b-fc-r (FC) 79.17 73.17 89.5 77.5 76.5 80.5 78 88 86 70 57.57 56.59 56.12 50 58.33 0.5 0.5 0.5 0.5 0.5 N/A 95.12 61.26 Salesforce cc-by-nc-4.0
25 45.22 Command-R-Plus (FC) (Original) 77.65 69.58 88 82.5 70.5 77.41 89.14 86 82 52.5 54.24 58.91 56.89 50 54.17 6.12 9.5 0 6.5 8.5 N/A 92.68 52.75 Cohere For AI cc-by-nc-4.0
26 44.28 Hermes-2-Pro-Llama-3-8B (FC) 77.17 64.17 91 79.5 74 74.05 68.71 90 80 57.5 57.8 60.47 58.92 43.75 41.67 1.88 2.5 0.5 2.5 2 N/A 53.66 55.16 NousResearch apache-2.0
27 44.23 Hermes-2-Pro-Mistral-7B (FC) 73.17 62.67 85.5 77 67.5 74.25 60.5 90 84 62.5 54.11 59.3 57.47 43.75 33.33 9.88 12 6.5 10 11 N/A 75.61 38.55 NousResearch apache-2.0
28 43.9 Hermes-2-Theta-Llama-3-8B (FC) 73.56 61.25 82.5 75.5 75 72.54 69.14 88 78 55 59.57 55.81 53.13 43.75 41.67 1 1.5 0 1 1.5 N/A 51.22 62.66 NousResearch apache-2.0
29 43 Open-Mixtral-8x22b (FC) 56.12 50.5 95 8.5 70.5 59.7 77.79 92 24 45 65.3 68.99 70.49 12.5 54.17 8.88 12.5 6.5 8 8.5 N/A 85.37 44.2 Mistral AI Proprietary
39.51 MadeAgents/Hammer2.0-0.5b (FC) 67 62 80 68 58 65.73 48.43 82 80 52.5 51.62 47.67 42.14 50 37.5 0 0 0 0 0 N/A 87.8 67 MadeAgents cc-by-nc-4.0
30 38.39 Claude-3-Haiku-20240307 (Prompt) 62.52 77.58 93 47.5 32 60.73 89.43 94 32 27.5 58.06 71.71 75.99 56.25 58.33 1.62 2.5 0.5 1 2.5 N/A 85.37 18.9 Anthropic Proprietary
31 37.77 Claude-3-Haiku-20240307 (FC tools-2024-04-04) 42.42 74.17 93 2 0.5 47.16 90.64 92 6 0 51.98 71.32 64.9 0 4.17 18.5 25 6.5 24 18.5 N/A 97.56 29.08 Anthropic Proprietary
32 16.66 Hermes-2-Theta-Llama-3-70B (FC) 0 0 0 0 0 0 0 0 0 0 38.87

In addition, we evaluated our Hammer2.0 series (0.5b, 1.5b, 3b, 7b) on other academic benchmarks to further show our model's generalization ability:

Model Size Func-Name+Args Det. (F1 Func-Name | F1 Args) F1 Average
API-Bank L-1 API-Bank L-2 Tool-Alpaca SealTool(Single-Tool) Nexus Raven Func Name Args
GPT-4o-mini (Prompt) -- 95.1% 89.3% 84.3% 67.5% 64.3% 54.7% 87.9% 86.0% 91.7% 84.6% 84.7% 76.4%
qwen2-7b-instruct 7B 81.5% 60.6% 95.7% 49.5% 71.6% 48.1% 93.9% 77.5% 87.1% 63.5% 85.9% 59.8%
qwen1.5-4b-Chat 4B 55.3% 59.8% 46.7% 38.5% 35.4% 17.0% 48.4% 62.3% 29.0% 33.7% 43.0% 42.2%
qwen2-1.5b-instruct 1.5B 74.6% 63.6% 57.7% 33.6% 65.8% 45.2% 82.1% 75.5% 70.6% 45.5% 70.2% 52.7%
Gorilla-openfunctions-v2 7B 69.2% 70.3% 48.8% 54.7% 72.9% 51.3% 93.2% 91.1% 72.8% 68.4% 71.4% 67.2%
GRANITE-20B-FUNCTIONCALLING 20B 90.4% 77.8% 78.9% 59.2% 77.3% 58.0% 94.9% 92.7% 94.5% 75.1% 87.2% 72.6%
xlam-7b-fc-r 7B 90.0% 80.7% 72.5% 64.2% 67.3% 59.0% 79.0% 76.9% 54.1% 57.5% 72.6% 67.7%
xlam-1b-fc-r 1.3B 94.9% 83.7% 91.8% 64.3% 64.9% 50.6% 90.7% 80.4% 64.4% 54.8% 81.3% 66.8%
Hammer-7b 7B 93.5% 85.8% 82.9% 66.4% 82.3% 59.9% 97.4% 91.7% 92.5% 77.4% 89.7% 76.2%
Hammer-4b 4B 91.6% 81.5% 77.6% 61.0% 85.1% 57.0% 96.4% 92.4% 81.7% 64.9% 86.5% 71.4%
Hammer-1.5b 1.5B 82.1% 72.3% 79.8% 59.7% 80.9% 53.5% 95.6% 88.6% 79.9% 56.9% 83.7% 66.2%
Hammer2.0-0.5B 0.5B 81.2% 67.8% 62.9% 52.0% 79.1% 50.9% 94.9% 83.8% 74.7% 49.0% 78.5% 60.7%
Hammer2.0-1.5B 1.5B 90.2% 80.4% 82.9% 63.8% 86.2% 59.5% 97.5% 92.5% 86.4% 65.5% 88.6% 72.4%
Hammer2.0-3B 3B 93.6% 84.3% 83.7% 59.0% 83.1% 58.8% 95.3% 91.2% 92.5% 70.5% 89.6% 72.8%
Hammer2.0-7B 7B 91.0% 82.1% 82.5% 65.1% 85.2% 59.6% 96.8% 92.7% 93.0% 80.5% 89.7% 76.0%

Requiements

The code of Hammer2.0-7b has been in the latest Hugging face transformers and we advise you to install transformers>=4.37.0.

How to Use

This is a simple example of how to use our model.

import json
import torch 
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "MadeAgents/Hammer2.0-7b"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name) 

# Please use our provided instruction prompt for best performance
TASK_INSTRUCTION = """You are a tool calling assistant. In order to complete the user's request, you need to select one or more appropriate tools from the following tools and fill in the correct values for the tool parameters. Your specific tasks are:
1. Make one or more function/tool calls to meet the request based on the question.
2. If none of the function can be used, point it out and refuse to answer.
3. If the given question lacks the parameters required by the function, also point it out.
"""

FORMAT_INSTRUCTION = """
The output MUST strictly adhere to the following JSON format, and NO other text MUST be included.
The example format is as follows. Please make sure the parameter type is correct. If no function call is needed, please directly output an empty list '[]'
```
[
    {"name": "func_name1", "arguments": {"argument1": "value1", "argument2": "value2"}},
    ... (more tool calls as required)
]
```
"""

# Define the input query and available tools
query = "Where can I find live giveaways for beta access and games? And what's the weather like in New York, US?" 

live_giveaways_by_type = {
    "name": "live_giveaways_by_type",
    "description": "Retrieve live giveaways from the GamerPower API based on the specified type.",
    "parameters": {
        "type": "object",
        "properties": {
            "type": {
                "type": "string",
                "description": "The type of giveaways to retrieve (e.g., game, loot, beta).",
                "default": "game"
            }
        },
        "required": ["type"]
    }
}
get_current_weather={
        "name": "get_current_weather",
        "description": "Get the current weather",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                }
            },
            "required": ["location"]
        }
    }
get_stock_price={
        "name": "get_stock_price",
        "description": "Retrieves the current stock price for a given ticker symbol. The ticker symbol must be a valid symbol for a publicly traded company on a major US stock exchange like NYSE or NASDAQ. The tool will return the latest trade price in USD. It should be used when the user asks about the current or most recent price of a specific stock. It will not provide any other information about the stock or company.",
        "parameters": {
            "type": "object",
            "properties": {
                "ticker": {
                    "type": "string",
                    "description": "The stock ticker symbol, e.g. AAPL for Apple Inc."
                }
            },
            "required": ["ticker"]
        }
    }

def convert_to_format_tool(tools):
    ''''''
    if isinstance(tools, dict):
        format_tools = {
            "name": tools["name"],
            "description": tools["description"],
            "parameters": tools["parameters"].get("properties", {}),
        }
        required = tools["parameters"].get("required", [])
        for param in required:
            format_tools["parameters"][param]["required"] = True
        for param in format_tools["parameters"].keys():
            if "default" in format_tools["parameters"][param]:
                default = format_tools["parameters"][param]["default"]
                format_tools["parameters"][param]["description"]+=f"default is \'{default}\'"
        return format_tools
    elif isinstance(tools, list):
        return [convert_to_format_tool(tool) for tool in tools]
    else:
        return tools
# Helper function to build the input prompt for our model
def build_prompt(task_instruction: str, format_instruction: str, tools: list, query: str):
    prompt = f"[BEGIN OF TASK INSTRUCTION]\n{task_instruction}\n[END OF TASK INSTRUCTION]\n\n"
    prompt += f"[BEGIN OF AVAILABLE TOOLS]\n{json.dumps(tools)}\n[END OF AVAILABLE TOOLS]\n\n"
    prompt += f"[BEGIN OF FORMAT INSTRUCTION]\n{format_instruction}\n[END OF FORMAT INSTRUCTION]\n\n"
    prompt += f"[BEGIN OF QUERY]\n{query}\n[END OF QUERY]\n\n"
    return prompt
   
# Build the input and start the inference
openai_format_tools = [live_giveaways_by_type, get_current_weather,get_stock_price]
format_tools = convert_to_format_tool(openai_format_tools)
content = build_prompt(TASK_INSTRUCTION, FORMAT_INSTRUCTION, format_tools, query)

messages=[
    { 'role': 'user', 'content': content}
]
inputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)

# tokenizer.eos_token_id is the id of <|EOT|> token
outputs = model.generate(inputs, max_new_tokens=512, do_sample=False, num_return_sequences=1, eos_token_id=tokenizer.eos_token_id)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))