File size: 3,896 Bytes
b8248ca
 
36c2bd5
 
 
 
 
 
 
 
 
 
 
b8248ca
36c2bd5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
---
license: apache-2.0
datasets:
- tatsu-lab/alpaca
- news_commentary
language:
- tr
- en
metrics:
- bleu
- bleurt
- comet
pipeline_tag: text-generation
---
# Extrapolating Large Language Models to Non-English by Aligning Languages

This repository contains the code implementation for the project that aims to empower pre-trained Large Language Models (LLMs) on non-English languages by building semantic alignment across languages. The project explores cross-lingual instruction-tuning and multilingual instruction-tuning techniques. The code implementation is based on [Stanford Alpaca](https://github.com/tatsu-lab/stanford_alpaca).

![](./xllama.jpg)

## Requirements and Installation
To install this repository, follow these steps:
```
git clone [email protected]:NJUNLP/x-LLM.git
cd x-LLM
pip install --editable ./
```

For detailed information about the conda environment, refer to the environment.yml file.

## Usage
### Download Pre-trained LLM
Start by downloading the pre-trained LLM into the ./model directory.

### Download Dataset
You can download all the datasets used in this project from this [link](https://drive.google.com/file/d/1bkejieKDJFDJ45UmQYiY4eeqpGBwj-r-/view?usp=drive_link). Once downloaded, place the datasets in the ./data directory. The datasets include:

* Training dataset
  * Alpaca
  * Wikimatrix
  * Newscommentary
* Evaluation dataset
  * XQUAD
  * MLQA
  * Flores-101
  * MI-Eval

### Load Raw Data Along with Instruction 
You can load raw data along with instruction using the provided scripts (./data/<dataset>/<dataset.py>). If you want to use a new dataset, you need to implement the corresponding script. The loaded data will have the following structure:
``` python
datasets.Features(
    {
        "id": datasets.Value("string"),
        "instruction": datasets.Value("string"),
        "input": datasets.Value("string"),
        "output": datasets.Value("string")
    }
)
```

## Instruction-tune Pre-trained LLM
To instruction-tune the pre-trained LLM, run the train.sh script. For example, you can instruction-tune LLaMA-7B to x-LLaMA-7B (Chinese) with the following command:
``` bash
bash script/train.sh llama-7b-hf alpaca_en+alpaca_zh+translation_ncwm_en-zh
```
In this command, the first argument denotes the pre-trained LLM to use, and the second argument represents the training data to use. You can use + to concatenate multiple datasets, and the training data will be shuffled by the Huggingface Trainer.

Once the training is complete, the finetuned LLM will be saved in ./model/llama-7b-hf.alpaca_en+alpaca_zh+translation_ncwm_en-zh.finetune. You can use aliases to define shorter names, and more details can be found in ./data/alias/alias.json.

## Test Finetuned LLM
To test the finetuned LLM, run the inference.sh script. For example, you can test the tuned LLM on the Flores dataset with the following command:
``` bash
bash script/inference.sh llama-7b-hf.alpaca_en+alpaca_zh+translation_ncwm_en-zh.finetune translation_flores_en-zh
```
The output results will be saved in model/llama-7b-hf.alpaca_en+alpaca_zh+translation_ncwm_en-zh.finetune/test/translation_flores_en-zh.inference.jsonl. The prediction field represents the generated content of the LLM.

## Interact with LLM Through Web UI
To interact with the LLM through a web UI, run app.py with the following command:
``` bash
bash app.py model/llama-7b-hf.alpaca_en+alpaca_zh+translation_ncwm_en-zh.finetune
```

## Citation
If you find this repository helpful, please consider citing our paper:
```
@misc{zhu2023extrapolating,
      title={Extrapolating Large Language Models to Non-English by Aligning Languages}, 
      author={Wenhao Zhu and Yunzhe Lv and Qingxiu Dong and Fei Yuan and Jingjing Xu and Shujian Huang and Lingpeng Kong and Jiajun Chen and Lei Li},
      year={2023},
      eprint={2308.04948},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```