shenzhi-wang/Llama3-8B-Chinese-Chat

The stopping condition is dictated by the model, which should learn when to output an end-of-sequence (EOS) token.But, the 8B-chinese-chat model generition output is repeating.Is it bug in my code ?

inputs = tokenizer([prompt], return_tensors="pt")
# Despite returning the usual output, the streamer will also print the generated text to stdout.
_ = model.generate(**inputs, streamer=streamer, max_new_tokens=8192,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
    cache_implementation="quantized", 
    cache_config={"nbits": 4, "backend": "quanto"})

shenzhi-wang
/

Llama3-8B-Chinese-Chat

Stop condition