acul3 commited on
Commit
0ad180d
1 Parent(s): b50a7e8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -0
README.md CHANGED
@@ -35,3 +35,164 @@ Despite the successful fine-tuning, the BahasaGPT-1 model still has some limitat
35
  ## Conclusion
36
 
37
  The BahasaGPT-1 model is a fine-tuned language model for Indonesian language tasks, based on the Bloomz-7B-mt architecture. The model was trained on a dataset of over 70,000 Indonesian instructions generated using the Alpaca method and translated instructions from OA. Despite some limitations, such as occasional hallucination and repeated tokens, the model provides a valuable tool for working with Indonesian language tasks.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
  ## Conclusion
36
 
37
  The BahasaGPT-1 model is a fine-tuned language model for Indonesian language tasks, based on the Bloomz-7B-mt architecture. The model was trained on a dataset of over 70,000 Indonesian instructions generated using the Alpaca method and translated instructions from OA. Despite some limitations, such as occasional hallucination and repeated tokens, the model provides a valuable tool for working with Indonesian language tasks.
38
+
39
+
40
+ ## How to Run
41
+
42
+ ```import logging
43
+ from typing import Tuple
44
+ import torch
45
+ import numpy as np
46
+ from transformers import (
47
+ AutoModelForCausalLM,
48
+ AutoTokenizer,
49
+ PreTrainedModel,
50
+ PreTrainedTokenizer,
51
+ )
52
+
53
+ END_KEY = "### End"
54
+ INSTRUCTION_KEY = "### Instruction:"
55
+ RESPONSE_KEY_NL = f"### Response:\n"
56
+ DEFAULT_SEED = 42
57
+
58
+ # The format of the instruction the model has been trained on.
59
+ PROMPT_FORMAT = """%s
60
+ %s
61
+ {instruction}
62
+ %s""" % (
63
+ "Dibawah ini adalah instruksi yang menjelaskan suatu tugas.",
64
+ INSTRUCTION_KEY,
65
+ RESPONSE_KEY_NL,
66
+ )
67
+
68
+ def xglm_prompt(dic):
69
+ if dic.get("input") is None:
70
+ text = PROMPT_DICT['prompt_no_input'].format_map(dic)
71
+ else:
72
+ text = PROMPT_DICT['prompt_input'].format_map(dic)
73
+ return text
74
+
75
+ logger = logging.getLogger(__name__)
76
+
77
+
78
+ def load_model_tokenizer_for_generate(
79
+ pretrained_model_name_or_path: str,
80
+ ) -> Tuple[PreTrainedModel, PreTrainedTokenizer]:
81
+ """Loads the model and tokenizer so that it can be used for generating responses.
82
+ Args:
83
+ pretrained_model_name_or_path (str): name or path for model
84
+ Returns:
85
+ Tuple[PreTrainedModel, PreTrainedTokenizer]: model and tokenizer
86
+ """
87
+ tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, padding_side="left")
88
+ model = AutoModelForCausalLM.from_pretrained(
89
+ pretrained_model_name_or_path,load_in_8bit=True, device_map="auto", trust_remote_code=True
90
+ )
91
+ return model, tokenizer
92
+
93
+
94
+ def get_special_token_id(tokenizer: PreTrainedTokenizer, key: str) -> int:
95
+ """Gets the token ID for a given string that has been added to the tokenizer as a special token.
96
+ When training, we configure the tokenizer so that the sequences like "### Instruction:" and "### End" are
97
+ treated specially and converted to a single, new token. This retrieves the token ID each of these keys map to.
98
+ Args:
99
+ tokenizer (PreTrainedTokenizer): the tokenizer
100
+ key (str): the key to convert to a single token
101
+ Raises:
102
+ RuntimeError: if more than one ID was generated
103
+ Returns:
104
+ int: the token ID for the given key
105
+ """
106
+ token_ids = tokenizer.encode(key)
107
+ if len(token_ids) > 1:
108
+ raise RuntimeError(f"Expected only a single token for '{key}' but found {token_ids}")
109
+ return token_ids[0]
110
+
111
+
112
+ def generate_response(
113
+ instruction: str,
114
+ *,
115
+ model: PreTrainedModel,
116
+ tokenizer: PreTrainedTokenizer,
117
+ do_sample: bool = True,
118
+ max_new_tokens: int = 256,
119
+ top_p: float = 0.92,
120
+ top_k: int = 40,
121
+ **kwargs,
122
+ ) -> str:
123
+ """Given an instruction, uses the model and tokenizer to generate a response. This formats the instruction in
124
+ the instruction format that the model was fine-tuned on.
125
+ Args:
126
+ instruction (str): instruction to generate response for
127
+ model (PreTrainedModel): model to use
128
+ tokenizer (PreTrainedTokenizer): tokenizer to use
129
+ do_sample (bool, optional): Whether or not to use sampling. Defaults to True.
130
+ max_new_tokens (int, optional): Max new tokens after the prompt to generate. Defaults to 128.
131
+ top_p (float, optional): If set to float < 1, only the smallest set of most probable tokens with probabilities
132
+ that add up to top_p or higher are kept for generation. Defaults to 0.92.
133
+ top_k (int, optional): The number of highest probability vocabulary tokens to keep for top-k-filtering.
134
+ Defaults to 0.
135
+ Returns:
136
+ str: the generated response
137
+ """
138
+ print(PROMPT_FORMAT.format(instruction=instruction))
139
+ input_ids = tokenizer(PROMPT_FORMAT.format(instruction=instruction), return_tensors="pt").input_ids.to("cuda")
140
+
141
+ response_key_token_id = get_special_token_id(tokenizer, RESPONSE_KEY_NL)
142
+ end_key_token_id = get_special_token_id(tokenizer, END_KEY)
143
+ gen_tokens = model.generate(
144
+ input_ids,
145
+ pad_token_id=tokenizer.pad_token_id,
146
+ # Ensure generation stops once it generates "### End"
147
+ eos_token_id=end_key_token_id,
148
+ do_sample=do_sample,
149
+ max_new_tokens=max_new_tokens,
150
+ top_p=top_p,
151
+ no_repeat_ngram_size=5,
152
+ repetition_penalty=1.0,
153
+ num_beams=4,
154
+ top_k=top_k,
155
+ **kwargs,
156
+ )[0].cpu()
157
+
158
+ # The response will be set to this variable if we can identify it.
159
+ decoded = None
160
+
161
+ # Find where "### Response:" is first found in the generated tokens. Considering this is part of the prompt,
162
+ # we should definitely find it. We will return the tokens found after this token.
163
+ response_pos = None
164
+ response_positions = np.where(gen_tokens == response_key_token_id)[0]
165
+ if len(response_positions) == 0:
166
+ logger.warn(f"Could not find response key {response_key_token_id} in: {gen_tokens}")
167
+ else:
168
+ response_pos = response_positions[0]
169
+
170
+ if response_pos:
171
+ # Next find where "### End" is located. The model has been trained to end its responses with this sequence
172
+ # (or actually, the token ID it maps to, since it is a special token). We may not find this token, as the
173
+ # response could be truncated. If we don't find it then just return everything to the end. Note that
174
+ # even though we set eos_token_id, we still see the this token at the end.
175
+ end_pos = None
176
+ end_positions = np.where(gen_tokens == end_key_token_id)[0]
177
+ if len(end_positions) > 0:
178
+ end_pos = end_positions[0]
179
+
180
+ decoded = tokenizer.decode(gen_tokens[response_pos + 1 : end_pos]).strip()
181
+
182
+ return decoded
183
+
184
+ model ,tokenizer = load_model_tokenizer_for_generate(pretrained_model_name_or_path="Bahasalab/BahasaGPT-1")
185
+
186
+ def main():
187
+
188
+ while True:
189
+ instruction = input("Enter your instruction (type 'exit' to quit): ")
190
+
191
+ if instruction.lower() == "exit":
192
+ break
193
+
194
+ response = generate_response(model=model, tokenizer=tokenizer, instruction=instruction)
195
+ print(response)
196
+
197
+ if __name__ == "__main__":
198
+ main()```