QuantFactory/Meta-Llama-3-8B-Instruct-function-calling-json-mode-GGUF

This is quantized version of hiieu/Meta-Llama-3-8B-Instruct-function-calling-json-mode created using llama.cpp

Model Description

This model was fine-tuned on meta-llama/Meta-Llama-3-8B-Instruct for function calling and json mode.

Usage

JSON Mode

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "hiieu/Meta-Llama-3-8B-Instruct-function-calling-json-mode"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant, answer in JSON with key \"message\""},
    {"role": "user", "content": "Who are you?"},
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
# >> {"message": "I am a helpful assistant, with access to a vast amount of information. I can help you with tasks such as answering questions, providing definitions, translating text, and more. Feel free to ask me anything!"}

Function Calling

Function calling requires two step inferences, below is the example:

Step 1:

functions_metadata = [
    {
      "type": "function",
      "function": {
        "name": "get_temperature",
        "description": "get temperature of a city",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string",
              "description": "name"
            }
          },
          "required": [
            "city"
          ]
        }
      }
    }
]

messages = [
    { "role": "system", "content": f"""You are a helpful assistant with access to the following functions: \n {str(functions_metadata)}\n\nTo use these functions respond with:\n<functioncall> {{ "name": "function_name", "arguments": {{ "arg_1": "value_1", "arg_1": "value_1", ... }} }} </functioncall>\n\nEdge cases you must handle:\n - If there are no functions that match the user request, you will respond politely that you cannot help."""},
    { "role": "user", "content": "What is the temperature in Tokyo right now?"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
# >> <functioncall> {"name": "get_temperature", "arguments": '{"city": "Tokyo"}'} </functioncall>"""}

Step 2:

messages = [
    { "role": "system", "content": f"""You are a helpful assistant with access to the following functions: \n {str(functions_metadata)}\n\nTo use these functions respond with:\n<functioncall> {{ "name": "function_name", "arguments": {{ "arg_1": "value_1", "arg_1": "value_1", ... }} }} </functioncall>\n\nEdge cases you must handle:\n - If there are no functions that match the user request, you will respond politely that you cannot help."""},
    { "role": "user", "content": "What is the temperature in Tokyo right now?"},
    # You will get the previous prediction, extract it will the tag <functioncall>
    # execute the function and append it to the messages like below:
    { "role": "assistant", "content": """<functioncall> {"name": "get_temperature", "arguments": '{"city": "Tokyo"}'} </functioncall>"""},    
    { "role": "user", "content": """<function_response> {"temperature":30 C} </function_response>"""}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

terminators = [
    tokenizer.eos_token_id,
    tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = model.generate(
    input_ids,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
response = outputs[0][input_ids.shape[-1]:]
print(tokenizer.decode(response, skip_special_tokens=True))
# >> The current temperature in Tokyo is 30 degrees Celsius.

Uploaded model

  • Developed by: hiieu

This model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
71
GGUF
Model size
8.03B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for QuantFactory/Meta-Llama-3-8B-Instruct-function-calling-json-mode-GGUF