File size: 8,170 Bytes
b50a7e8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad180d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
license: bigscience-bloom-rail-1.0
---
# BahasaGPT-1 Fine-Tuning Documentation Summary

## Introduction

This document provides an overview of the BahasaGPT-1 model, which is a fine-tuned model for a specific task in the Indonesian language. The model is based on the Bloomz-7B-mt architecture and is fine-tuned using a dataset of over 70,000 Indonesian instructions.

## Model Details

**Model Name:** BahasaGPT-1

**Model Source:** Bloomz-7B-mt

**Dataset for Fine-Tuning:** Over 70k Indonesia Instruct Dataset generated using the Alpaca method from the following sources:

- [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca)
- Translated instructions from OA ([Anh/data at main · LAION-AI/Anh](https://github.com/LAION-AI/Anh))

## Fine-Tuning Process

The BahasaGPT-1 model was fine-tuned using a dataset of over 70,000 Indonesian instructions, which were generated using the Alpaca method from Stanford and translated instructions from OA. This combination of datasets allowed the model to be better adapted to the specific needs of Indonesian language tasks.

The fine-tuning process involved adjusting the model's weights and biases based on the input dataset. This was done iteratively to optimize the model's performance for the specific task in the Indonesian language.

## Known Limitations

Despite the successful fine-tuning, the BahasaGPT-1 model still has some limitations:

1. **Hallucination:** The model sometimes generates outputs that may seem plausible but are not based on the input data. This may lead to incorrect or nonsensical responses in some cases.

2. **Repeated Tokens:** The model occasionally produces repeated tokens in the output, which may affect the overall coherence and readability of the generated text.

## Conclusion

The BahasaGPT-1 model is a fine-tuned language model for Indonesian language tasks, based on the Bloomz-7B-mt architecture. The model was trained on a dataset of over 70,000 Indonesian instructions generated using the Alpaca method and translated instructions from OA. Despite some limitations, such as occasional hallucination and repeated tokens, the model provides a valuable tool for working with Indonesian language tasks.


## How to Run

```import logging
from typing import Tuple
import torch
import numpy as np
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    PreTrainedModel,
    PreTrainedTokenizer,
)

END_KEY = "### End"
INSTRUCTION_KEY = "### Instruction:"
RESPONSE_KEY_NL = f"### Response:\n"
DEFAULT_SEED = 42

# The format of the instruction the model has been trained on.
PROMPT_FORMAT = """%s
%s
{instruction}
%s""" % (
    "Dibawah ini adalah instruksi yang menjelaskan suatu tugas.",
    INSTRUCTION_KEY,
    RESPONSE_KEY_NL,
)

def xglm_prompt(dic):
    if dic.get("input") is None:
        text = PROMPT_DICT['prompt_no_input'].format_map(dic)
    else:
        text = PROMPT_DICT['prompt_input'].format_map(dic)
    return text

logger = logging.getLogger(__name__)


def load_model_tokenizer_for_generate(
    pretrained_model_name_or_path: str,
) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
    """Loads the model and tokenizer so that it can be used for generating responses.
    Args:
        pretrained_model_name_or_path (str): name or path for model
    Returns:
        Tuple[PreTrainedModel, PreTrainedTokenizer]: model and tokenizer
    """
    tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
    model = AutoModelForCausalLM.from_pretrained(
        pretrained_model_name_or_path,load_in_8bit=True, device_map="auto", trust_remote_code=True
    )
    return model, tokenizer


def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
    """Gets the token ID for a given string that has been added to the tokenizer as a special token.
    When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
    treated specially and converted to a single, new token.  This retrieves the token ID each of these keys map to.
    Args:
        tokenizer (PreTrainedTokenizer): the tokenizer
        key (str): the key to convert to a single token
    Raises:
        RuntimeError: if more than one ID was generated
    Returns:
        int: the token ID for the given key
    """
    token_ids = tokenizer.encode(key)
    if len(token_ids) > 1:
        raise RuntimeError(f"Expected only a single token for '{key}' but found {token_ids}")
    return token_ids[0]


def generate_response(
    instruction: str,
    *,
    model: PreTrainedModel,
    tokenizer: PreTrainedTokenizer,
    do_sample: bool = True,
    max_new_tokens: int = 256,
    top_p: float = 0.92,
    top_k: int = 40,
    **kwargs,
) -> str:
    """Given an instruction, uses the model and tokenizer to generate a response.  This formats the instruction in
    the instruction format that the model was fine-tuned on.
    Args:
        instruction (str): instruction to generate response for
        model (PreTrainedModel): model to use
        tokenizer (PreTrainedTokenizer): tokenizer to use
        do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
        max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
        top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with probabilities
            that add up to top_p or higher are kept for generation. Defaults to 0.92.
        top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
            Defaults to 0.
    Returns:
        str: the generated response
    """
    print(PROMPT_FORMAT.format(instruction=instruction))
    input_ids = tokenizer(PROMPT_FORMAT.format(instruction=instruction), return_tensors="pt").input_ids.to("cuda")

    response_key_token_id = get_special_token_id(tokenizer, RESPONSE_KEY_NL)
    end_key_token_id = get_special_token_id(tokenizer, END_KEY)
    gen_tokens = model.generate(
        input_ids,
        pad_token_id=tokenizer.pad_token_id,
        # Ensure generation stops once it generates "### End"
        eos_token_id=end_key_token_id,
        do_sample=do_sample,
        max_new_tokens=max_new_tokens,
        top_p=top_p,
        no_repeat_ngram_size=5,
        repetition_penalty=1.0,
        num_beams=4,
        top_k=top_k,
        **kwargs,
    )[0].cpu()

    # The response will be set to this variable if we can identify it.
    decoded = None

    # Find where "### Response:" is first found in the generated tokens.  Considering this is part of the prompt,
    # we should definitely find it.  We will return the tokens found after this token.
    response_pos = None
    response_positions = np.where(gen_tokens == response_key_token_id)[0]
    if len(response_positions) == 0:
        logger.warn(f"Could not find response key {response_key_token_id} in: {gen_tokens}")
    else:
        response_pos = response_positions[0]

    if response_pos:
        # Next find where "### End" is located.  The model has been trained to end its responses with this sequence
        # (or actually, the token ID it maps to, since it is a special token).  We may not find this token, as the
        # response could be truncated.  If we don't find it then just return everything to the end.  Note that
        # even though we set eos_token_id, we still see the this token at the end.
        end_pos = None
        end_positions = np.where(gen_tokens == end_key_token_id)[0]
        if len(end_positions) > 0:
            end_pos = end_positions[0]

        decoded = tokenizer.decode(gen_tokens[response_pos + 1 : end_pos]).strip()

    return decoded

model ,tokenizer = load_model_tokenizer_for_generate(pretrained_model_name_or_path="Bahasalab/BahasaGPT-1")

def main():

    while True:
        instruction = input("Enter your instruction (type 'exit' to quit): ")

        if instruction.lower() == "exit":
            break

        response = generate_response(model=model, tokenizer=tokenizer, instruction=instruction)
        print(response)

if __name__ == "__main__":
    main()```