kancilgpt / README.md
abdiharyadi's picture
Update README.md
eee6377 verified
|
raw
history blame
20.7 kB
metadata
inference: false
license: mit
base_model: indobenchmark/indogpt
tags:
  - generated_from_trainer
model-index:
  - name: kancilgpt
    results: []

KancilGPT

(Note: KancilGPT is still under development.)

Once upon a time, in a digital data forest, there was a language model called KancilGPT.

Model Description

KancilGPT is a fine-tuned version of indobenchmark/indogpt. Its task is generating an Indonesian fable story. In addition, this model name is based on a famous, wise (but also master at trolling), cute fable character: kancil. KancilGPT learns on the unpublished dataset, gathered from dongengceritarakyat.com.

Dataset and Prompt

The dataset consists of 388 Indonesian fable stories. These stories was gathered from dongengceritarakyat.com at January 8, 2024. The duplicated stories without any paraphrashing was removed, based on the value of cosine similarity of TF-IDF trigram words. Furthermore, the remaining stories were cleaned manually for removing non-fable stories, incomplete stories (e.g. synopsis), some misused punctuations, and some typos. This cleaning were continued until now. If a mistake is found, the dataset will be modified as soon as possible.

The cleaned stories was splitted with 80:10:10 ratio, giving

  • 310 stories for training,
  • 39 stories for evaluation, and
  • 39 stories for test (for now, it's unused).

The splitting is based on the value of cosine similarity of TF-IDF trigram words, same as duplicate story handling. The stories are chosen one by one, and the smaller of maximum cosine similarity of a story is prioritized. The first 39 stories is used for test, and the rest is used for training and evaluation, randomly. This method is used to make sure no duplicate paraphrasing story exists in the test data.

To make the KancilGPT understand to generate a story, the prompts were built with the following formats:

  1. <s> awal cerita | judul: <title> | <entire-story-content> | tamat </s>
  2. <s> awal cerita | judul: <title> | <beginning-story-content> | bersambung </s>
  3. <s> pertengahan cerita | judul: <title> | <last-story-content> | tamat </s>
  4. <s> pertengahan cerita | judul: <title> | <end-story-content> | bersambung </s>

Indonesian language was used for all prompts. Generally, there are four part of the prompt:

  1. story part type—it can be the beginning of a story (awal cerita) or it can be the middle of a story (pertengahan cerita);
  2. story title (judul);
  3. story content; and
  4. story end status—it can be "to be continued" (bersambung) or "the end" (tamat).

A story content consists of n sentences that totally contains at least 1750 characters, with minimum n value. If the entire story contains no more than 1750 characters, the format 1 will be used. For making a transition of story content from format 2 to format 3, or from format 3 to another format 3, or from format 3 to format 4, the first k sentences from the current story content will be removed until the content contains less than 1750 characters, with minimum k value and k ≥ 1 constraint.

How to Use

After learns how to generate an Indonesian fable story, KancilGPT can generate a random fable story with the specified procedures. All steps using the generate arguments do_sample=True, max_new_tokens=512, and pad_token_id=<eos_token_id>. Huggingface pipeline can not be used yet since KancilGPT uses IndoNLGTokenizer class from indobenchmark-toolkit.

Step 1: Begin the story

Use this prompt to generate the beginning of a story, including the generation of a title (judul):

<s> awal cerita | judul:

Below is the example output:

<s> awal cerita | judul: pemburu dan anak cheetah | suatu hari , pemburu itu melihat seekor cheetah yang sedang bersantai di tepi sungai . berburu cheetah di hutan itu menyenangkan , karena dia akan makan saat sedang asyik . cheetah itu gesit dan cerdik , dia bisa memburu cheetah yang sudah tua dan terlihat sangat lezat . pemburu itu berniat untuk menangkap nya , tapi sayang nya dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut . pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera berlari meninggalkan pemburu . | bersambung</s>

Notice that the real output has longer leading </s> with another random tokens. That's normal. From the generated output, notice the end status of a story before the </s> token. If it's tamat, the story ends. Go to step 3. If it's bersambung, the story should be continued. Remove the first k sentences so the remaining sentences contains less than 1750 characters, with minimum k value and k ≥ 1 constraint. Take the remaining sentences as the next content for the next prompt in step 2. Below is the next content from the example output:

berburu cheetah di hutan itu menyenangkan , karena dia akan makan saat sedang asyik . cheetah itu gesit dan cerdik , dia bisa memburu cheetah yang sudah tua dan terlihat sangat lezat . pemburu itu berniat untuk menangkap nya , tapi sayang nya dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut . pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera berlari meninggalkan pemburu .

Step 2: Continue the story

With the existing title and next content, use this prompt format to continue the story:

<s> pertengahan cerita | judul: <title> | <next-content>

Below is the example prompt from the example next content from the step 1:

<s> pertengahan cerita | judul: pemburu dan anak cheetah | berburu cheetah di hutan itu menyenangkan , karena dia akan makan saat sedang asyik . cheetah itu gesit dan cerdik , dia bisa memburu cheetah yang sudah tua dan terlihat sangat lezat . pemburu itu berniat untuk menangkap nya , tapi sayang nya dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut . pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera berlari meninggalkan pemburu .

Below is the example output:

<s> pertengahan cerita | judul: pemburu dan anak cheetah | berburu cheetah di hutan itu menyenangkan , karena dia akan makan saat sedang asyik . cheetah itu gesit dan cerdik , dia bisa memburu cheetah yang sudah tua dan terlihat sangat lezat . pemburu itu berniat untuk menangkap nya , tapi sayang nya dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut . pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera berlari meninggalkan pemburu . namun , pemburu itu tak melihat rusa itu kembali . dia menengok ke belakang dan melihat kepala rusa itu masih di belakang . memburu cheetah berarti sudah menyerah lebih dulu . pemburu itu menjatuhkan cheetah pada diri nya sendiri . | bersambung</s>

From the generated output, notice the end status of a story before the </s> token. If it's tamat, the story ends. Go to step 3. If it's bersambung, the story should be continued. Same as step 1, Remove the first k sentences so the remaining sentences contains less than 1750 characters, with minimum k value and k ≥ 1 constraint. Take the remaining sentences as the next content for the next prompt in step 2. Below is the next content from the example output:

pemburu itu berniat untuk menangkap nya , tapi sayang nya dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut . pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera berlari meninggalkan pemburu . namun , pemburu itu tak melihat rusa itu kembali . dia menengok ke belakang dan melihat kepala rusa itu masih di belakang . memburu cheetah berarti sudah menyerah lebih dulu . pemburu itu menjatuhkan cheetah pada diri nya sendiri .

Do step 2 until the end status is tamat.

Step 3: Finish the story

Take all story contents from the generated outputs, and merge it. The story is finished!

Below is the example of generated story:

pemburu dan anak cheetah

suatu hari , pemburu itu melihat seekor cheetah yang sedang bersantai di tepi sungai . berburu cheetah di hutan itu menyenangkan , karena dia akan makan saat sedang
asyik . cheetah itu gesit dan cerdik , dia bisa memburu cheetah yang sudah tua dan terlihat sangat lezat . pemburu itu berniat untuk menangkap nya , tapi sayang nya
dia tidak membawa cheetah itu . oleh karena itu dia hanya mencari seekor kijang dan melihat tulang-tulang itu . setelah menemukan rusa itu , pemburu itu langsung
mengejar nya hingga tubuh nya cukup besar . ketika selesai menangkap rusa itu , pemburu itu meminta cheetah untuk memasukkan kepala nya ke dalam cangkang . " ini adalah
kepala ku , adik ku , " kata pemburu itu . " kau harus segera menggigit aku . " pemburu itu setuju , dan segera memukulkan kepala nya ke kepala anak cheetah tersebut .
pemburu itu melempar sang kijang ke arah rusa . rusa segera menjerit kesakitan dan lari meninggalkan pemburu itu . pemburu yang melihat kejadian itu segera melaporkan
kejadian itu kepada teman-teman nya yang lain . " pemburu itu adalah anak cheetah , mereka memang baru datang sekarang , tapi mereka selalu menjaga dan memberi semangat
untuk berburu . " pemburu itu membawa seekor cheetah , lalu menunjukkan pada dua teman nya bahwa kepala nya tertembak oleh pemburu . " pemburu itu membawa rusa dan dua
orang anak nya , " kata pemburu . " ayo kawan , kita lanjutkan saja perjalanan kita . aku akan mencari kepala rusa itu . bagaimana ? " teriak pemburu . rusa mencoba untuk
melepaskan diri dari pemburu . tapi apa yang terjadi ? rusa muda tersebut malah melompat dari kepala pemburu . pemburu itu berhasil membebaskan nya . si cheetah segera
berlari meninggalkan pemburu .  namun , pemburu itu tak melihat rusa itu kembali . dia menengok ke belakang dan melihat kepala rusa itu masih di belakang . memburu cheetah
berarti sudah menyerah lebih dulu . pemburu itu menjatuhkan cheetah pada diri nya sendiri . pemburu itu merasa sangat marah dan kesal . dia ingin segera menangkap rusa itu
kembali agar dia cepat mati . tetapi karena rusa itu sangat marah , dia pun mengibas-ngibaskan kepala nya sehingga kepala nya lepas . " sudah lah . aku tidak akan
melepaskan pemburu itu daripada mendapatkan kemampuan berlari nya . "

Below is the English translation, with the helps of Google Translate:

the hunter and the cheetah cub

one day, the hunter saw a cheetah relaxing on the river bank. hunting a cheetah in the forest is fun, because he will eat when he is fun . the cheetah is agile and clever,
it can hunt old cheetahs and looks very delicious. the hunter intended to catch him, but unfortunately he didn't bring the cheetah. therefore he only looked for a deer and
saw the bones. after finding the deer, the hunter immediately chase him until his body is big enough. when he finished catching the deer, the hunter asked the cheetah to
put its head in the shell. " this is my head , my little brother , " said the hunter . " you must bite me immediately . "the hunter agreed, and immediately hit his head on
the cheetah cub's head. the hunter threw the deer at the deer. the deer immediately screamed in pain and ran away from the hunter. hunters who saw the incident immediately
reported it the incident happened to his other friends. "the hunters are cheetah cubs, they have only just arrived now, but they always look after and encourage them to
hunt. "the hunter brought a cheetah, then showed his two friends that his head had been shot by the hunter." the hunter brought a deer and two the child," said the hunter.
"come on, friend, let's just continue our journey. i'll look for the deer's head. how ? " shouted the hunter. the deer tried to escape from hunters. but what happened ?
the young deer instead jumped off the hunter's head. the hunter managed to free him. the cheetah immediately ran away from the hunter. however, the hunter did not see the
deer again. he looked back and saw the deer's head still behind him. hunt cheetahs means you have given up already. the hunter dropped the cheetah on himself. the hunter
felt very angry and annoyed. he wanted to catch the deer immediately come back so he can die quickly. but because the deer was very angry, he shook his head so that his
head fell off. "never mind. i won't let go of the hunter rather than gain his running ability. "

Limitations

The reader probably got confused after reading the previous generated story. This shows the limitation of KancilGPT. The generated story sometimes

  1. gives low correlation between title and content (The cheetah cub wasn't mentioned as a main character.),
  2. introduces new character out-of-nowhere (Where the deer come from?),
  3. introduces new character with the same name that leads to confusing anaphora resolution ("'This is my head, my little brother,' said the hunter. 'You must bite me immediately.' The hunter agreed, and [...]")
  4. gives an illogical sentence ("The hunters are cheetah cubs").

Furthermore, all stories involved with KancilGPT were lowercased because the pretrained model was trained on lowercase texts. In the end, all of the limitations opened some opportunities to make KancilGPT better from time to time. This is just the beginning. By exploring the digital forest deeper, KancilGPT will generate a high quality Indonesian fable story in the future.

The end.


Behind The Story: Training Procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 10

Training results

Training Loss Epoch Step Validation Loss
2.0208 1.0 432 2.6771
1.4309 2.0 864 2.7912
1.0811 3.0 1296 2.9315
0.8536 4.0 1728 3.0387
0.6999 5.0 2160 3.1300
0.5949 6.0 2592 3.2062
0.5232 7.0 3024 3.2750
0.474 8.0 3456 3.2936
0.4422 9.0 3888 3.3380
0.4246 10.0 4320 3.3414

Choosing the best validation loss, KancilGPT achieves loss=3.3414 on the evaluation set.

Framework versions

  • Transformers 4.35.2
  • Pytorch 2.1.0+cu121
  • Datasets 2.16.1
  • Tokenizers 0.15.0