BahasaGPT-1 Fine-Tuning Documentation Summary

Introduction

This document provides an overview of the BahasaGPT-1 model, which is a fine-tuned model for a specific task in the Indonesian language. The model is based on the Bloomz-7B-mt architecture and is fine-tuned using a dataset of over 70,000 Indonesian instructions.

Model Details

Model Name: BahasaGPT-1

Model Source: Bloomz-7B-mt

Dataset for Fine-Tuning: Over 70k Indonesia Instruct Dataset generated using the Alpaca method from the following sources:

Stanford Alpaca
Translated instructions from OA (Anh/data at main · LAION-AI/Anh)

Fine-Tuning Process

The BahasaGPT-1 model was fine-tuned using a dataset of over 70,000 Indonesian instructions, which were generated using the Alpaca method from Stanford and translated instructions from OA. This combination of datasets allowed the model to be better adapted to the specific needs of Indonesian language tasks.

The fine-tuning process involved adjusting the model's weights and biases based on the input dataset. This was done iteratively to optimize the model's performance for the specific task in the Indonesian language.

Known Limitations

Despite the successful fine-tuning, the BahasaGPT-1 model still has some limitations:

Hallucination: The model sometimes generates outputs that may seem plausible but are not based on the input data. This may lead to incorrect or nonsensical responses in some cases.
Repeated Tokens: The model occasionally produces repeated tokens in the output, which may affect the overall coherence and readability of the generated text.

Conclusion

The BahasaGPT-1 model is a fine-tuned language model for Indonesian language tasks, based on the Bloomz-7B-mt architecture. The model was trained on a dataset of over 70,000 Indonesian instructions generated using the Alpaca method and translated instructions from OA. Despite some limitations, such as occasional hallucination and repeated tokens, the model provides a valuable tool for working with Indonesian language tasks.

How to Run

from typing import Tuple
import torch
import numpy as np
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
)

END_KEY = "### End"
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY_NL = f"### Response:\n"
DEFAULT_SEED = 42

# The format of the instruction the model has been trained on.
PROMPT_FORMAT = """%s
%s
{instruction}
%s""" % (
    "Dibawah ini adalah instruksi yang menjelaskan suatu tugas.",
    INSTRUCTION_KEY,
    RESPONSE_KEY_NL,
)

def xglm_prompt(dic):
    if dic.get("input") is None:
        text = PROMPT_DICT['prompt_no_input'].format_map(dic)
    else:
        text = PROMPT_DICT['prompt_input'].format_map(dic)
    return text

logger = logging.getLogger(__name__)


def load_model_tokenizer_for_generate(
    pretrained_model_name_or_path: str,
) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
    """Loads the model and tokenizer so that it can be used for generating responses.
    Args:
        pretrained_model_name_or_path (str): name or path for model
    Returns:
        Tuple[PreTrainedModel, PreTrainedTokenizer]: model and tokenizer
    """
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path,load_in_8bit=True, device_map="auto", trust_remote_code=True
    )
    return model, tokenizer


def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
    """Gets the token ID for a given string that has been added to the tokenizer as a special token.
    When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
    treated specially and converted to a single, new token.  This retrieves the token ID each of these keys map to.
    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token
    Raises:
        RuntimeError: if more than one ID was generated
    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise RuntimeError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]


def generate_response(
    instruction: str,
    *,
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    do_sample: bool = True,
    max_new_tokens: int = 256,
    top_p: float = 0.92,
    top_k: int = 40,
    **kwargs,
) -> str:
    """Given an instruction, uses the model and tokenizer to generate a response.  This formats the instruction in
    the instruction format that the model was fine-tuned on.
    Args:
        instruction (str): instruction to generate response for
        model (PreTrainedModel): model to use
        tokenizer (PreTrainedTokenizer): tokenizer to use
        do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with probabilities
            that add up to top_p or higher are kept for generation. Defaults to 0.92.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            Defaults to 0.
    Returns:
        str: the generated response
    """
    print(PROMPT_FORMAT.format(instruction=instruction))
    input_ids = tokenizer(PROMPT_FORMAT.format(instruction=instruction), return_tensors="pt").input_ids.to("cuda")

    response_key_token_id = get_special_token_id(tokenizer, RESPONSE_KEY_NL)
    end_key_token_id = get_special_token_id(tokenizer, END_KEY)
    gen_tokens = model.generate(
        input_ids,
        pad_token_id=tokenizer.pad_token_id,
        # Ensure generation stops once it generates "### End"
        eos_token_id=end_key_token_id,
        do_sample=do_sample,
        max_new_tokens=max_new_tokens,
        top_p=top_p,
        no_repeat_ngram_size=5,
        repetition_penalty=1.0,
        num_beams=4,
        top_k=top_k,
        **kwargs,
    )[0].cpu()

    # The response will be set to this variable if we can identify it.
    decoded = None

    # Find where "### Response:" is first found in the generated tokens.  Considering this is part of the prompt,
    # we should definitely find it.  We will return the tokens found after this token.
    response_pos = None
    response_positions = np.where(gen_tokens == response_key_token_id)[0]
    if len(response_positions) == 0:
        logger.warn(f"Could not find response key {response_key_token_id} in: {gen_tokens}")
    else:
        response_pos = response_positions[0]

    if response_pos:
        # Next find where "### End" is located.  The model has been trained to end its responses with this sequence
        # (or actually, the token ID it maps to, since it is a special token).  We may not find this token, as the
        # response could be truncated.  If we don't find it then just return everything to the end.  Note that
        # even though we set eos_token_id, we still see the this token at the end.
        end_pos = None
        end_positions = np.where(gen_tokens == end_key_token_id)[0]
        if len(end_positions) > 0:
            end_pos = end_positions[0]

        decoded = tokenizer.decode(gen_tokens[response_pos + 1 : end_pos]).strip()

    return decoded

model ,tokenizer = load_model_tokenizer_for_generate(pretrained_model_name_or_path="Bahasalab/BahasaGPT-1")

def main():

    while True:
        instruction = input("Enter your instruction (type 'exit' to quit): ")

        if instruction.lower() == "exit":
            break

        response = generate_response(model=model, tokenizer=tokenizer, instruction=instruction)
        print(response)

if __name__ == "__main__":
    main()```