File size: 5,095 Bytes
57e5006 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 |
<div align="center">
<img src="./images/通古logo.png" width="400"/>
</div>
# TongGu LLM
## Introduction
Tonggu is a classical Chinese LLM developed by the Deep Learning and Visual Computing Laboratory (SCUT-DLVCLab) at South China University of Technology. It has strong capabilities in understanding and processing ancient texts. Tonggu uses multi-stage instruction fine-tuning and innovatively proposes a Redundancy-Aware Tuning (RAT) method, which greatly retains the capabilities of the base model while enhancing the performance of downstream tasks.
<div align="center">
<img src="./images/model_training.png">
</div>
## Evaluation
Tonggu has surpassed existing models in a wide range of classical Chinese understanding and processing tasks. A comparison with its base model Baichuan2-7B-Chat demonstrates the effectiveness of Tonggu's training process and methods. In the future, Tonggu will continue to update its model and benefit from even more powerful base models.
<div align="center">
<img src="./images/evaluation_table.png">
</div>
<div align="center">
<img src="./images/evaluation_table2.png" width="600">
</div>
# Open-source List
## Model
[**TongGu-7B-Instruct**](https://huggingface.co/SCUT-DLVCLab/TongGu-7B-Instruct): The 7B classical Chinese language model is based on Baichuan2-7B-Base, which has undergone unsupervised incremental pre-training on a corpus of 2.41 billion classical Chinese texts, and fine-tuned on 4 million classical Chinese dialogue data, possessing functions such as ancient text annotation, translation, and appreciation.
## Data
**ACCN-INS**: 4 million classical Chinese instruction data, covering 24 estimated tasks across three dimensions of ancient text understanding, generation, and knowledge.
The ACCN-INS dataset can only be used for non-commercial research purposes. For scholar or organization who wants to use the MSDS dataset, please first fill in this [Application Form](https://github.com/SCUT-DLVCLab/TongGu-LLM/blob/main/application-form/Application-Form-for-Using-ACCN-INS.docx) and email them to us. When submitting the application form to us, please list or attached 1-2 of your publications in the recent 6 years to indicate that you (or your team) do research in the related research fields of classical Chinese.
We will give you the download link and the decompression password after your application has been received and approved.
All users must follow all use conditions; otherwise, the authorization will be revoked.
# News
- 2024/9/21 The paper of Tonggu has been accepted by EMNLP 2024.
- 2024/9/26 Tonggu model and instruction data have been opened sourced.
# Examples
<details><summary><b>句读</b></summary>
![image](./images/标点.png)
</details>
<details><summary><b>成语解释</b></summary>
![image](./images/成语解释.png)
</details>
<details><summary><b>文白翻译</b></summary>
![image](./images/文白翻译.png)
</details>
<details><summary><b>白文翻译</b></summary>
![image](./images/白文翻译.png)
</details>
<details><summary><b>诗词创作</b></summary>
![image](./images/词创作.png)
</details>
# Inference
```python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "SCUT-DLVCLab/TongGu-7B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_path, device_map='auto', torch_dtype=torch.bfloat16, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
system_message = "你是通古,由华南理工大学DLVCLab训练而来的古文大模型。你具备丰富的古文知识,为用户提供有用、准确的回答。"
user_query = "翻译成白话文:大学之道,在明明德,在亲民,在止于至善。"
prompt = f"{system_message}\n<用户> {user_query}\n<通古> "
inputs = tokenizer(prompt, return_tensors='pt')
generate_ids = model.generate(
inputs.input_ids.cuda(),
max_new_tokens=128
)
generate_text = tokenizer.batch_decode(
generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0][len(prompt):]
print(generate_text)
```
# Citation
```
@article{cao2024tonggu,
title={TongGu: Mastering Classical Chinese Understanding with Knowledge-Grounded Large Language Models},
author={Cao, Jiahuan and Peng, Dezhi and Zhang, Peirong and Shi, Yongxin and Liu, Yang and Ding, Kai and Jin, Lianwen},
journal={EMNLP 2024},
year={2024}
}
```
# Statement:
After extensive data incremental pre-training and instruction fine-tuning, Tonggu has strong capabilities in processing ancient texts, such as punctuation and translation. However, due to limitations in model size and the autoregressive generation paradigm, Tonggu may still generate misleading replies containing factual errors or harmful content that includes bias or discrimination. Please use it cautiously and be aware of discerning such content. Do not spread harmful content generated by Tonggu on the Internet. If any adverse consequences arise, the disseminator shall bear the responsibility. |