BahasaGPT-1 Fine-Tuning Documentation Summary
Introduction
This document provides an overview of the BahasaGPT-1 model, which is a fine-tuned model for a specific task in the Indonesian language. The model is based on the Bloomz-7B-mt architecture and is fine-tuned using a dataset of over 70,000 Indonesian instructions.
Model Details
Model Name: BahasaGPT-1
Model Source: Bloomz-7B-mt
Dataset for Fine-Tuning: Over 70k Indonesia Instruct Dataset generated using the Alpaca method from the following sources:
- Stanford Alpaca
- Translated instructions from OA (Anh/data at main · LAION-AI/Anh)
Fine-Tuning Process
The BahasaGPT-1 model was fine-tuned using a dataset of over 70,000 Indonesian instructions, which were generated using the Alpaca method from Stanford and translated instructions from OA. This combination of datasets allowed the model to be better adapted to the specific needs of Indonesian language tasks.
The fine-tuning process involved adjusting the model's weights and biases based on the input dataset. This was done iteratively to optimize the model's performance for the specific task in the Indonesian language.
Known Limitations
Despite the successful fine-tuning, the BahasaGPT-1 model still has some limitations:
Hallucination: The model sometimes generates outputs that may seem plausible but are not based on the input data. This may lead to incorrect or nonsensical responses in some cases.
Repeated Tokens: The model occasionally produces repeated tokens in the output, which may affect the overall coherence and readability of the generated text.
Conclusion
The BahasaGPT-1 model is a fine-tuned language model for Indonesian language tasks, based on the Bloomz-7B-mt architecture. The model was trained on a dataset of over 70,000 Indonesian instructions generated using the Alpaca method and translated instructions from OA. Despite some limitations, such as occasional hallucination and repeated tokens, the model provides a valuable tool for working with Indonesian language tasks.
How to Run
from typing import Tuple
import torch
import numpy as np
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
PreTrainedModel,
PreTrainedTokenizer,
)
END_KEY = "### End"
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY_NL = f"### Response:\n"
DEFAULT_SEED = 42
# The format of the instruction the model has been trained on.
PROMPT_FORMAT = """%s
%s
{instruction}
%s""" % (
"Dibawah ini adalah instruksi yang menjelaskan suatu tugas.",
INSTRUCTION_KEY,
RESPONSE_KEY_NL,
)
def xglm_prompt(dic):
if dic.get("input") is None:
text = PROMPT_DICT['prompt_no_input'].format_map(dic)
else:
text = PROMPT_DICT['prompt_input'].format_map(dic)
return text
logger = logging.getLogger(__name__)
def load_model_tokenizer_for_generate(
pretrained_model_name_or_path: str,
) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
"""Loads the model and tokenizer so that it can be used for generating responses.
Args:
pretrained_model_name_or_path (str): name or path for model
Returns:
Tuple[PreTrainedModel, PreTrainedTokenizer]: model and tokenizer
"""
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
model = AutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path,load_in_8bit=True, device_map="auto", trust_remote_code=True
)
return model, tokenizer
def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
"""Gets the token ID for a given string that has been added to the tokenizer as a special token.
When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
treated specially and converted to a single, new token. This retrieves the token ID each of these keys map to.
Args:
tokenizer (PreTrainedTokenizer): the tokenizer
key (str): the key to convert to a single token
Raises:
RuntimeError: if more than one ID was generated
Returns:
int: the token ID for the given key
"""
token_ids = tokenizer.encode(key)
if len(token_ids) > 1:
raise RuntimeError(f"Expected only a single token for '{key}' but found {token_ids}")
return token_ids[0]
def generate_response(
instruction: str,
*,
model: PreTrainedModel,
tokenizer: PreTrainedTokenizer,
do_sample: bool = True,
max_new_tokens: int = 256,
top_p: float = 0.92,
top_k: int = 40,
**kwargs,
) -> str:
"""Given an instruction, uses the model and tokenizer to generate a response. This formats the instruction in
the instruction format that the model was fine-tuned on.
Args:
instruction (str): instruction to generate response for
model (PreTrainedModel): model to use
tokenizer (PreTrainedTokenizer): tokenizer to use
do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with probabilities
that add up to top_p or higher are kept for generation. Defaults to 0.92.
top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
Defaults to 0.
Returns:
str: the generated response
"""
print(PROMPT_FORMAT.format(instruction=instruction))
input_ids = tokenizer(PROMPT_FORMAT.format(instruction=instruction), return_tensors="pt").input_ids.to("cuda")
response_key_token_id = get_special_token_id(tokenizer, RESPONSE_KEY_NL)
end_key_token_id = get_special_token_id(tokenizer, END_KEY)
gen_tokens = model.generate(
input_ids,
pad_token_id=tokenizer.pad_token_id,
# Ensure generation stops once it generates "### End"
eos_token_id=end_key_token_id,
do_sample=do_sample,
max_new_tokens=max_new_tokens,
top_p=top_p,
no_repeat_ngram_size=5,
repetition_penalty=1.0,
num_beams=4,
top_k=top_k,
**kwargs,
)[0].cpu()
# The response will be set to this variable if we can identify it.
decoded = None
# Find where "### Response:" is first found in the generated tokens. Considering this is part of the prompt,
# we should definitely find it. We will return the tokens found after this token.
response_pos = None
response_positions = np.where(gen_tokens == response_key_token_id)[0]
if len(response_positions) == 0:
logger.warn(f"Could not find response key {response_key_token_id} in: {gen_tokens}")
else:
response_pos = response_positions[0]
if response_pos:
# Next find where "### End" is located. The model has been trained to end its responses with this sequence
# (or actually, the token ID it maps to, since it is a special token). We may not find this token, as the
# response could be truncated. If we don't find it then just return everything to the end. Note that
# even though we set eos_token_id, we still see the this token at the end.
end_pos = None
end_positions = np.where(gen_tokens == end_key_token_id)[0]
if len(end_positions) > 0:
end_pos = end_positions[0]
decoded = tokenizer.decode(gen_tokens[response_pos + 1 : end_pos]).strip()
return decoded
model ,tokenizer = load_model_tokenizer_for_generate(pretrained_model_name_or_path="Bahasalab/BahasaGPT-1")
def main():
while True:
instruction = input("Enter your instruction (type 'exit' to quit): ")
if instruction.lower() == "exit":
break
response = generate_response(model=model, tokenizer=tokenizer, instruction=instruction)
print(response)
if __name__ == "__main__":
main()```
- Downloads last month
- 4