Taiwanese Hokkien LLM
Collection
The collection of Taiwanese Hokkien (Taigi) large language models and related resources.
•
12 items
•
Updated
•
1
The Taigi-Llama-2 series are built based on the Traditional Chinese version of the LLaMA-2 model. We conducted continued pre-training on web-scraped data in Taiwanese Hokkien, including Hanzi, POJ, and Hanlo, totaling around 78MB.
For more details, please refer to our GitHub repository and the paper: Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems
Explore other models and datasets in the Taiwanese Hokkien LLM collection.
from transformers import AutoModelForCausalLM, AutoTokenizer, TextGenerationPipeline
import torch
import accelerate
def get_pipeline(path:str, tokenizer:AutoTokenizer, accelerator:accelerate.Accelerator) -> TextGenerationPipeline:
model = AutoModelForCausalLM.from_pretrained(
path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True)
terminators = [tokenizer.eos_token_id, tokenizer.pad_token_id]
pipeline = TextGenerationPipeline(model = model, tokenizer = tokenizer, num_workers=accelerator.state.num_processes*4, pad_token_id=tokenizer.pad_token_id, eos_token_id=terminators)
return pipeline
model_dir = "Bohanlu/Taigi-Llama-2-7B" # or Bohanlu/Taigi-Llama-2-13B for the 13B model
tokenizer = AutoTokenizer.from_pretrained(model_dir, use_fast=False)
accelerator = accelerate.Accelerator()
pipe = get_pipeline(model_dir, tokenizer, accelerator)
# Few-shot示例:問答
qa_prompt = """Example 1:
問題:台北101有偌懸?
答案:台北101的高度是五百空八公尺。
Example 2:
問題:台灣上長的溪仔是佗一條?
答案:台灣上長的溪仔是濁水溪,規个長度有百八公里遐爾長。
Example 3:
問題:臺灣上懸的山是啥物?
答案:"""
print(pipe(qa_prompt, return_full_text=False))
# Output: [{'generated_text': '臺灣上懸的山是玉山,海拔三千九百五十二公尺。'}]
If you find the resources in the Taiwanese Hokkien LLM collection useful in your work, please cite it using the following reference:
@misc{lu2024enhancing,
title={Enhancing Taiwanese Hokkien Dual Translation by Exploring and Standardizing of Four Writing Systems},
author={Bo-Han Lu and Yi-Hsuan Lin and En-Shiun Annie Lee and Richard Tzong-Han Tsai},
year={2024},
eprint={2403.12024},
archivePrefix={arXiv},
primaryClass={cs.CL}
}