metadata

license: gemma
datasets:
  - airesearch/WangchanThaiInstruct
  - airesearch/WangchanX-FLAN-v6.1
  - airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k
language:
  - th
  - en
base_model:
  - aisingapore/gemma2-9b-cpt-sea-lionv3-base
pipeline_tag: text-generation
library_name: transformers

Gemma2 9B WangchanLIONv2 Instruct

WangchanLION is a joint effort between VISTEC and AI Singapore to develop a Thai-specific collection of Large Language Models (LLMs), pre-trained for Southeast Asian (SEA) languages, and instruct-tuned specifically for the Thai language.

Gemma2 9B WangchanLIONv2 Instruct is a multilingual model that has been fine-tuned with around 3,760,000 Thai instruction-completion pairs from human-annotated instructions, automatic data construction in FLAN-style, and synthetic samples.

Developed by: Products Pillar, AI Singapore, and VISTEC
Funded by: Singapore NRF, PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited
Model type: Decoder
Languages: English, Thai
License: Gemma Community License

Model Details

Model Description

We performed instruction tuning in Thai on continued pre-trained Gemma2 9B CPT SEA-LIONv3, a decoder model using the Gemma2 architecture, to create Gemma2 9B CPT WangchanLIONv2 Instruct.

For tokenization, the model employs the default tokenizer used in Gemma-2-9B. The model has a context length of 8192.

Benchmark Performance

We evaluated Gemma2 9B WangchanLIONv2 Instruct in Thai using the Thai LLM Benchmark. The benchmark consists of Thai multiple-choice exams, multi-turn chat, reading comprehension and language generation. The evaluation results are available on the leaderboard.

Usage

NOTE This model has not been trained to use a system prompt or to use tool calling.

Gemma2 9B WangchanLIONv2 Instruct can be run using the 🤗 Transformers library

# Please use transformers==4.45.2

import transformers
import torch

model_id = "aisingapore/Gemma2-9b-WangchanLIONv2-instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)
messages = [
    {"role": "user", "content": "แต่งกลอนให้หน่อย"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Caveats

It is important for users to be aware that our model exhibits certain limitations that warrant consideration. Like many LLMs, the model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning.

Limitations

Safety

Current SEA-LION models, including this commercially permissive release, have not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes.

Technical Specifications

Fine-Tuning Details

Gemma2 9B WangchanLIONv2 Instruct was built using parameter-efficient fine-tuning. The training process for fine-tuning was approximately 3 days on 8x H100-80GB GPUs.

The following hyperparameters were used during training:

quantization_bit: 4
quantization_method: bitsandbytes
finetuning_type: lora
lora_target: all
cutoff_len: 2048
per_device_train_batch_size: 4
gradient_accumulation_steps: 8
learning_rate: 1.0e-4
num_train_epochs: 3
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
val_size: 0.01
per_device_eval_batch_size: 1
eval_strategy: steps
eval_steps: 4000

We use LLaMA Factory framework for the fine-tuning process.

Data

Gemma2 9B WangchanLIONv2 Instruct was trained on

Call for Contributions

We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions.

The Team

AISG

Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin

WangchanX

Can Udomcharoenchaikit, Chalermpun Mai-On, Chayapat Uthayopas, Ekapol Chuangsuwanich, Lalita Lowphansirikul, Nonthakit Chaiwong, Panuthep Tasawong, Patomporn Payoungkhamdee, Pume Tuchinda, Romrawin Chumpu, Sarana Nutanong, Wannaphong Phatthiyaphaibun

Acknowledgements

AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore. This release is part of WangchanX, a Large Language Model (LLM) research and development project supported by PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited. The project is a collaborative effort originated by PyThaiNLP and VISTEC-depa Thailand AI Research Institute, focusing on the development of Adaptation Toolsets, Instruction Tuning & Alignment Datasets, and Benchmarks.

Contact

Chalermpun Mai-On [email protected]
Patomporn Payoungkhamdee [email protected]
Peerat Limkonchotiwat [email protected]

Link to SEA-LION's GitHub repository

Link to WangchanX FLAN-like Dataset Creation Github repository

Disclaimer

This is the repository for the commercial instruction-tuned model. The model has not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.