File size: 4,510 Bytes
eed5c34 161de81 eed5c34 161de81 a95f5a1 161de81 eed5c34 161de81 3955926 bb38337 161de81 bbb4952 161de81 9fc7a6f 161de81 28f9013 161de81 eed5c34 28f9013 161de81 eed5c34 161de81 eed5c34 161de81 eed5c34 161de81 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
library_name: transformers
tags:
- unsloth
- llama3
- indonesia
license: llama3
datasets:
- catinthebag/Tumpeng-1-Indonesian
language:
- id
inference: false
---
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document Title</title>
<style>
h1 {
font-size: 36px;
color: navy;
font-family: 'Tahoma';
text-align: center;
}
</style>
</head>
<body>
<h1>Introducing the Kancil family of open models</h1>
</body>
</html>
<center>
<img src="https://imgur.com/9nG5J1T.png" alt="Kancil" width="600" height="300">
<p><em>Kancil is a fine-tuned version of Llama 3 8B using synthetic QA dataset generated with Llama 3 70B. Version zero of Kancil is the first generative Indonesian LLM gain functional instruction performance using solely synthetic data.</em></p>
<p><strong><a href="https://colab.research.google.com/drive/1Gp-I9vMqfhU_i5xX77ZKlh7eQCx1_13I?usp=sharing" style="color: blue; font-family: Tahoma;">βGo straight to the colab demoβ</a></strong></p>
<p><em style="color: black; font-weight: bold;">Beta preview</em></p>
</center>
Selamat datang!
I am ultra-overjoyed to introduce you... the π¦ Kancil! It's a fine-tuned version of Llama 3 8B with the Tumpeng, an instruction dataset of 14.8 million words. Both the model and dataset is openly available in Huggingface.
π The dataset was synthetically generated from Llama 3 70B. A big problem with existing Indonesian instruction dataset is they're in reality not-very-good-translations of English datasets. Llama 3 70B can generate fluent Indonesian! (with minor caveats π)
π¦ This follows previous efforts for collection of open, fine-tuned Indonesian models, like Merak and Cendol. However, Kancil solely leverages synthetic data in a very creative way, which makes it a very unique contribution!
### Version 1.0
This is the second working prototype, Kancil V1.
β¨ Training
- 2.2x Dataset word count
- 2x lora parameters
- Rank-stabilized lora
- 2x fun
β¨ New features
- Multi-turn conversation (beta; optimized for curhat/personal advice π)
- Better text generation (full or outline writing; optimized for essays)
- QA from text (copy paste to prompt and ask a question about it)
- Making slogans
This model was fine-tuned with QLoRA using the amazing Unsloth framework! It was built on top of [unsloth/llama-3-8b-bnb-4bit](https://huggingface.co/unsloth/llama-3-8b-bnb-4bit) and subsequently merged with the adapter.
### Uses
This model is developed with research purposes for researchers or general AI hobbyists. However, it has one big application: You can have lots of fun with it!
### Out-of-Scope Use
This is a research preview model with minimal safety curation. Do not use this model for commercial or practical applications.
You are also not allowed to use this model without having fun.
### Getting started
As mentioned, this model was trained with Unsloth. Please use its code for better experience.
```
# Install dependencies. You need GPU to run this (at least T4)
%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
from unsloth import FastLanguageModel
import torch
# Available versions
KancilV1 = "catinthebag/Kancil-V1-llama3-4bit"
# Load the model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = KancilV1,
max_seq_length = 4096,
dtype = None, # Auto detect
load_in_4bit = True,
)
```
```
# This model was trained on this specific prompt template. Changing it might lead to performance degradations.
prompt_template = """<|user|>
{prompt}
<|assistant|>
{response}"""
# Start generating!
inputs = tokenizer(
[
prompt_template.format(
prompt="Bagaimana cara memberi tahu orang tua kalau saya ditolak universitas favorit saya?",
response="",)
], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 600, temperature=.8, use_cache = True)
print(tokenizer.batch_decode(outputs)[0].replace('\\n', '\n'))
```
**Note:** There is an issue with the dataset where the newline characters are interpreted as literal strings. Very sorry about this! π Please keep the .replace() method to fix newline errors.
### Acknowledgments
- **Developed by:** Afrizal Hasbi Azizy
- **License:** Llama 3 Community License Agreement |