--- license: gemma datasets: - airesearch/WangchanThaiInstruct - airesearch/WangchanX-FLAN-v6.1 - airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k language: - th - en base_model: - aisingapore/gemma2-9b-cpt-sea-lionv3-base pipeline_tag: text-generation library_name: transformers --- # Gemma2 9B WangchanLIONv2 Instruct WangchanLION is a joint effort between VISTEC and AI Singapore to develop a Thai-specific collection of Large Language Models (LLMs), pre-trained for Southeast Asian (SEA) languages, and instruct-tuned specifically for the Thai language. Gemma2 9B WangchanLIONv2 Instruct is a multilingual model that has been fine-tuned with around **3,760,000 Thai instruction-completion pairs** from human-annotated instructions, automatic data construction in FLAN-style, and synthetic samples. - **Developed by:** Products Pillar, AI Singapore, and VISTEC - **Funded by:** Singapore NRF, PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited - **Model type:** Decoder - **Languages:** English, Thai - **License:** [Gemma Community License](https://ai.google.dev/gemma/terms) ## Model Details ### Model Description We performed instruction tuning in Thai on [continued pre-trained Gemma2 9B CPT SEA-LIONv3](https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base), a decoder model using the Gemma2 architecture, to create Gemma2 9B CPT WangchanLIONv2 Instruct. For tokenization, the model employs the default tokenizer used in Gemma-2-9B. The model has a context length of 8192. ### Benchmark Performance We evaluated Gemma2 9B WangchanLIONv2 Instruct in Thai using the [Thai LLM Benchmark](https://blog.opentyphoon.ai/introducing-the-thaillm-leaderboard-thaillm-evaluation-ecosystem-508e789d06bf). The benchmark consists of Thai multiple-choice exams, multi-turn chat, reading comprehension and language generation. The evaluation results are available on [the leaderboard](https://huggingface.co/spaces/ThaiLLM-Leaderboard/leaderboard). ### Usage **NOTE** This model has not been trained to use a system prompt or to use tool calling. Gemma2 9B WangchanLIONv2 Instruct can be run using the 🤗 Transformers library ```python # Please use transformers==4.45.2 import transformers import torch model_id = "aisingapore/WangchanLIONv2" pipeline = transformers.pipeline( "text-generation", model=model_id, model_kwargs={"torch_dtype": torch.bfloat16}, device_map="auto", ) messages = [ {"role": "user", "content": "แต่งกลอนให้หน่อย"}, ] outputs = pipeline( messages, max_new_tokens=256, ) print(outputs[0]["generated_text"][-1]) ``` ### Caveats It is important for users to be aware that our model exhibits certain limitations that warrant consideration. Like many LLMs, the model can hallucinate and occasionally generates irrelevant content, introducing fictional elements that are not grounded in the provided context. Users should also exercise caution in interpreting and validating the model's responses due to the potential inconsistencies in its reasoning. ## Limitations ### Safety Current SEA-LION models, including this commercially permissive release, have not been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claim, damages, or other liability arising from the use of the released weights and codes. ## Technical Specifications ### Fine-Tuning Details Gemma2 9B WangchanLIONv2 Instruct was built using parameter-efficient fine-tuning. The training process for fine-tuning was approximately 3 days on 8x H100-80GB GPUs. The following hyperparameters were used during training: - quantization_bit: 4 - quantization_method: bitsandbytes - finetuning_type: lora - lora_target: all - cutoff_len: 2048 - per_device_train_batch_size: 4 - gradient_accumulation_steps: 8 - learning_rate: 1.0e-4 - num_train_epochs: 3 - lr_scheduler_type: cosine - warmup_ratio: 0.1 - bf16: true - val_size: 0.01 - per_device_eval_batch_size: 1 - eval_strategy: steps - eval_steps: 4000 We use [LLaMA Factory](https://github.com/hiyouga/LLaMA-Factory) framework for the fine-tuning process. ## Data Gemma2 9B WangchanLIONv2 Instruct was trained on 1. [Human-Annotated Thai Instruction Dataset](https://huggingface.co/datasets/airesearch/WangchanThaiInstruct) 2. [FLAN-like dataset in Thai](https://huggingface.co/datasets/airesearch/WangchanX-FLAN-v6.1) 3. [WangchanX Seed-Free Synthetic Instruct Thai 120k](https://huggingface.co/datasets/airesearch/wangchanx-seed-free-synthetic-instruct-thai-120k) ## Call for Contributions We encourage researchers, developers, and language enthusiasts to actively contribute to the enhancement and expansion of SEA-LION. Contributions can involve identifying and reporting bugs, sharing pre-training, instruction, and preference data, improving documentation usability, proposing and implementing new model evaluation tasks and metrics, or training versions of the model in additional Southeast Asian languages. Join us in shaping the future of SEA-LION by sharing your expertise and insights to make these models more accessible, accurate, and versatile. Please check out our GitHub for further information on the call for contributions. ## The Team ### AISG Chan Adwin, Choa Esther, Cheng Nicholas, Huang Yuli, Lau Wayne, Lee Chwan Ren, Leong Wai Yi, Leong Wei Qi, Limkonchotiwat Peerat, Liu Bing Jie Darius, Montalan Jann Railey, Ng Boon Cheong Raymond, Ngui Jian Gang, Nguyen Thanh Ngan, Ong Brandon, Ong Tat-Wee David, Ong Zhi Hao, Rengarajan Hamsawardhini, Siow Bryan, Susanto Yosephine, Tai Ngee Chia, Tan Choon Meng, Teo Eng Sipp Leslie, Teo Wei Yi, Tjhi William, Teng Walter, Yeo Yeow Tong, Yong Xianbin ### WangchanX Can Udomcharoenchaikit, Chalermpun Mai-On, Chayapat Uthayopas, Ekapol Chuangsuwanich, Lalita Lowphansirikul, Nonthakit Chaiwong, Panuthep Tasawong, Patomporn Payoungkhamdee, Pume Tuchinda, Romrawin Chumpu, Sarana Nutanong, Wannaphong Phatthiyaphaibun ## Acknowledgements [AI Singapore](​​https://aisingapore.org/) is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of the National Research Foundation or the National University of Singapore. This release is part of WangchanX, a Large Language Model (LLM) research and development project supported by PTT Public Company Limited, SCB Public Company Limited, and SCBX Public Company Limited. The project is a collaborative effort originated by PyThaiNLP and VISTEC-depa Thailand AI Research Institute, focusing on the development of Adaptation Toolsets, Instruction Tuning & Alignment Datasets, and Benchmarks. ## Contact - Chalermpun Mai-On chalermpunm_pro@vistec.ac.th - Patomporn Payoungkhamdee patomporn.p_s21@vistec.ac.th - Peerat Limkonchotiwat peerat@aisingapore.org [Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion) [Link to WangchanX FLAN-like Dataset Creation Github repository](https://github.com/vistec-AI/WangchanX/tree/datasets) ## Disclaimer This is the repository for the commercial instruction-tuned model. The model has _not_ been aligned for safety. Developers and users should perform their own safety fine-tuning and related security measures. In no event shall the authors be held liable for any claims, damages, or other liabilities arising from the use of the released weights and codes.