Text Generation
Transformers
PyTorch
Italian
English
mistral
conversational
text-generation-inference
Inference Endpoints

Prompt style documentation

#4
by kobim - opened

Hi, i've wanted to try out your model for a personal project that I'm doing and I was wondering if you could provide some more documentation regarding the prompt style that needs to be used

I've been using Llama Index and the default chat messages roles are like this:

class MessageRole(str, Enum):
"""Message role."""
SYSTEM = "system"
USER = "user"
ASSISTANT = "assistant"
FUNCTION = "function"
TOOL = "tool"

Wanted to know what are the equivalents for this model and the general prompt usage, since even following the example on the readme I cannot get it to run properly, sometimes it continues to generate even after the fist message, and sometimes the generated response is missing the final vertical bar "|"
see below
image.png

Sorry i didn't include a prompt format specification in the README. I have updated it https://huggingface.co/galatolo/cerbero-7b#prompt-format

The prompt is:

[|Umano|] First human message
[|Assistente|] First AI reply
[|Umano|] Second human message
[|Assistente|] Second AI reply

When crafting prompts, ensure to conclude with the [|Assistente|] tag, signaling the AI to generate a response.
Use [|Umano|] as stop word. For example:

[|Umano|] Come posso distinguere un AI da un umano?
[|Assistente|]

Hello @galatolo what do you mean exactly with "Use [|Umano|] as stop word"? As @kobim , also my cerbero continues to generate even after the first message

Hi, I typically use vLLM for inference, which supports this feature by default.

Using Hugging Face Transformers to halt generation at a multi-token word such as [|Umano|], you can follow the approach outlined in this discussion by defining a custom stopping criteria:

from transformers import StoppingCriteria

class MyStoppingCriteria(StoppingCriteria):
    def __init__(self, target_sequence, prompt):
        self.target_sequence = target_sequence
        self.prompt = prompt

    def __call__(self, input_ids, scores, **kwargs):
        # Convert the generated token IDs to text and remove the initial prompt
        generated_text = tokenizer.decode(input_ids[0]).replace(self.prompt, '')
        # Halt generation if the target sequence is found
        return self.target_sequence in generated_text

    def __len__(self):
        return 1

    def __iter__(self):
        yield self

Then, incorporate this criterion into the generate function:

model.generate(input_ids, max_new_tokens=128, stopping_criteria=MyStoppingCriteria("[|Umano|]", prompt))

While I haven't tested this myself, the logic appears to be sound.

It super worked! Thank you, you're amazing.

I post the code below for completeness. I'm executing the whole thing in a HF Inference Endpoint with this custom handler:

from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria
import torch
from typing import Dict

class MyStoppingCriteria(StoppingCriteria):
    """Necessary for multi-token EOS words."""

    def __init__(self, target_sequence, prompt, tokenizer):
        self.target_sequence = target_sequence
        self.prompt = prompt
        self.tokenizer = tokenizer

    def __call__(self, input_ids, scores, **kwargs):
        # Convert the generated token IDs to text and remove the initial prompt
        generated_text = self.tokenizer.decode(input_ids[0]).replace(self.prompt, '')
        # Halt generation if the target sequence is found
        return self.target_sequence in generated_text

    def __len__(self):
        return 1

    def __iter__(self):
        yield self

class EndpointHandler():
    def __init__(self, path=""):

        # Variables
        model_id = "galatolo/cerbero-7b-openchat"
        self.device = "cuda:0" if torch.cuda.is_available() else "cpu"
        torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

        # Model (GPU)
        self.model = AutoModelForCausalLM.from_pretrained(
                                                    model_id,
                                                    torch_dtype=torch_dtype,
                                                    low_cpu_mem_usage=True
                                                )
        self.model.to(self.device)

        # Tokenizer (CPU)
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

    def __call__(self, data: Dict[str, bytes]) -> Dict[str, str]:

        # Read the input
        prompt = data.pop("inputs", data)
        stopping_criteria = MyStoppingCriteria("[|Umano|]", prompt, self.tokenizer)

        # Encode
        input_ids = self.tokenizer(prompt, return_tensors='pt').input_ids
        input_ids = input_ids.to(self.device)

        # Generate
        with torch.no_grad():
            output_ids = self.model.generate(input_ids, max_new_tokens=2048, stopping_criteria=stopping_criteria)

        # Decode
        generated_text = self.tokenizer.decode(output_ids[0], skip_special_tokens=True)

        return generated_text

Thank you for providing the full code!

galatolo changed discussion status to closed

Sign up or log in to comment