Expand Output after deploying it on SageMaker
#3
by
gdcoder
- opened
I get back a response from the model but it is not complete. I manage to deploy it on ml.g5.12xlarge following the instructions.
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
import json
model = "OpenAssistant/falcon-40b-sft-mix-1226"
tokenizer = AutoTokenizer.from_pretrained(model)
# grab environment variables
ENDPOINT_NAME = "huggingface-pytorch-tgi-inference-2023-06-14-22-44-39-458"
runtime= boto3.client('runtime.sagemaker')
prompt = "<|prompter|>What is a meme, and what's the history behind this word?<|endoftext|><|assistant|>"
input_data = {
"inputs": prompt,
"parameters": {
"do_sample": True,
"temperature":0.1,
"include_prompt_in_result": False,
"top_k":10,
"num_return_sequences":10,
"max_length": 10,
#"eos_token_id":tokenizer.eos_token_id,
"return_full_text":False,
}
}
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=json.dumps(input_data).encode('utf-8'))
response_json = json.loads(response['Body'].read().decode("utf-8"))
response_json
By "not complete" do you mean that it cuts off early? If so it's likely because of the "max_length": 10
parameter you pass. That limits the generation to 10 tokens, which is really not a lot. If you want a somewhat detailed answer you should set it to at least 300. Though keep in mind that it is a max length, not an enforced length, so the answer can be shorter than this length.