hqsiswiliam commited on
Commit
5210987
1 Parent(s): 45502d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +226 -3
README.md CHANGED
@@ -1,3 +1,226 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # LAPDOG
2
+ repo for Learning Retrieval Augmentation for Personalized Dialogue Generation. Our paper is available on [Learning Retrieval Augmentation for Personalized Dialogue Generation](https://aclanthology.org/2023.emnlp-main.154/).
3
+ ## Introduction
4
+ Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose **L**earning Retrieval **A**ugmentation for **P**ersonalized **D**ial**O**gue **G**eneration (**LAPDOG**), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method.
5
+ The LAPDOG model code is publicly available for further exploration. [https://github.com/hqsiswiliam/LAPDOG](https://github.com/hqsiswiliam/LAPDOG)
6
+ ## Architecture
7
+ ![LAPDOG](framework.png)
8
+ ## Experimental Results
9
+ ![EXPERIMENT](experiment.png)
10
+
11
+ # Repo Details
12
+ ## Basic Project Structure
13
+ - `ckpt`
14
+ - The folder contains the generation results and checkpoint file
15
+ - `data/convai2`
16
+ - This folder contains the CONVAI2 data preprocessed by us into jsonl file
17
+ - `data/corpora`
18
+ - This folder contains the pre-processed stories from StoryROC dataset
19
+ - `src`
20
+ - contains all the model and data processing code for training/evaluating LAPDOG
21
+ -
22
+ ## Checkpoint Download URL
23
+ - Since the original trained file has been missed in the server, we have re-trained a XL size model located in the [Huggingface Download Path](https://huggingface.co/hqsiswiliam/LAPDOG-XL/resolve/main/model.pth.tar).
24
+ - After the tar file has been downloaded, put this file as `ckpt/xl/ckpt/checkpoint/pth/model.pth.tar`
25
+ ## Environment Initialization
26
+ - The LAPDOG's environment can be built using Anaconda (which we recommend), we provide the `env.yml` for one-step environment creation
27
+ ```bash
28
+ conda env create -f env.yml
29
+ ```
30
+ ## Basic Definitions
31
+ - reader: generator used in LAPDOG
32
+ - retriever: retriever used in LAPDOG
33
+ ## Main Arguments
34
+ - `name`: experimental name
35
+ - `checkpoint_dir`: the path to save model
36
+ - `save_index_path`: the path to save retriever index
37
+ - `refresh_index`: the index refresh steps for retriever, -1 for never refresh (every `refresh_index` steps)
38
+ - `per_gpu_batch_size`: batch size per gpu for training/inferencing
39
+ - `total_steps`: the total step for the whole training process
40
+ - `eval_freq`: the steps for model evaluation (every `eval_freq` steps)
41
+ - `save_freq`: the steps for model save (every `save_freq` steps)
42
+ - `per_gpu_embedder_batch_size`: the batch size for retriever index embedding
43
+ - `gold_score_mode`: the metric to guide update the retriever
44
+ - `train_retriever`: whether to train the retriever
45
+ - `precision`: the precision of the model, fp16/bf16/fp32
46
+ - `shard_optim`: shards optimizer state over available GPUs
47
+ - `shard_grads`: shards gradient over available GPUs
48
+ - `target_maxlength`: Maximum length of target outputs
49
+ - `generation_max_length`: similar to target max length
50
+ - `reader_model_type`: the model for the generator
51
+ - `dropout`: the dropout value for the architecture
52
+ - `weight_decay`: the weight decay for the optimizer
53
+ - `lr`: learning rate for generator
54
+ - `lr_retriever`: learning rate for retriever
55
+ - `scheduler`: which scheduler to be used to schedule learning rate
56
+ - `text_maxlength`: maximum number of tokens in input text segments (concatenated story+context)
57
+ - `retriever_from`: the query for the retriever
58
+ - `use_gradient_checkpoint_reader`: using gradient checkpoint for generator
59
+ - `use_gradient_checkpoint_retriever`: using gradient checkpoint for retriever
60
+ - `train_data`: the jsonl file to training data
61
+ - `eval_data`: the jsonl file to evaluation data
62
+ - `n_conetxt`: how many retrieved storied are fused to the generator
63
+ - `retriever_n_context`: the number of retrieved stories from story corpus
64
+ - `log_freq`: the log frequency in steps
65
+ - `wramup_steps`: the steps to do warm up
66
+ - `writing_results`: whether to write results after evaluation
67
+ - `task`: we are fixed to qa current, as the pre-processor in qa is to process CONVAI2 dataset
68
+ - `index_mode`: whether to use flat or faiss to retrieve top-k neighbours
69
+ - `passages`: the path to jsonl story corpus
70
+ - `save_index_n_shards`: how many shards to save an index to file with
71
+ - `temperature_score`: temperature parameter used in the loss computation
72
+ - `temperature_gold`: temperature parameter used in the loss computation
73
+ - `model_path`: path to initialize a model
74
+ - `load_reader_weights_only`: only read generator weight, no retriever weight are passed (after stage1, the checkpoint only contains the weight of generator)
75
+
76
+ ## Example Training Script
77
+ ### Stage 1 - Training a generator
78
+ - This example trains a generator on top of CUDA DEVICES 0 and 1, change the device indexes based on your own machine.
79
+ - change NGPU and nproc_per_node to your available GPU numbers
80
+ ```bash
81
+ export LR=5e-4
82
+ export CUDA_VISIBLE_DEVICES=0,1
83
+ export NGPU=2
84
+ python -m torch.distributed.launch --master_port=29566 --nproc_per_node=2 train.py \
85
+ --closed_book \
86
+ --shuffle \
87
+ --per_gpu_batch_size=64 \
88
+ --total_steps=12000 \
89
+ --eval_freq=1000 \
90
+ --save_freq=1000 \
91
+ --name= \
92
+ --checkpoint_dir=ckpt/xp_exp \
93
+ --use_gradient_checkpoint_reader \
94
+ --precision=fp32 \
95
+ --shard_optim \
96
+ --shard_grads \
97
+ --target_maxlength=32 \
98
+ --generation_max_length=32 \
99
+ --reader_model_type=google/t5-xl-lm-adapt \
100
+ --dropout=0.1 \
101
+ --weight_decay=0.01 \
102
+ --lr=${LR} \
103
+ --scheduler=linear \
104
+ --text_maxlength=560 \
105
+ --retriever_from=persona \
106
+ --train_data="data/convai2/train.jsonl" \
107
+ --eval_data="data/convai2/valid.jsonl" \
108
+ --log_freq=1 \
109
+ --warmup_steps=5 \
110
+ --write_results
111
+ ```
112
+ ### Stage 2, tune a retriever
113
+ - After stage 1 model has been trained, train the LAPDOG based on the generator trained on the first stage
114
+ ```bash
115
+ export STAGE1_GENERATOR=PATH_TO_THE_MODEL_TRAINED_IN_STAGE1
116
+ export LR=5e-4
117
+ export RET_LR=5e-4
118
+ export TEMP=0.8
119
+ export TEMPG=0.85
120
+ export NGPU=2
121
+ export CUDA_VISIBLE_DEVICES=0,1
122
+ python -m torch.distributed.launch --master_port=29899 --nproc_per_node=2 train.py \
123
+ --shuffle \
124
+ --refresh_index=1200 \
125
+ --per_gpu_batch_size=8 \
126
+ --total_steps=12000 \
127
+ --eval_freq=1200 \
128
+ --save_freq=1200 \
129
+ --per_gpu_embedder_batch_size=128 \
130
+ --name=LAPDOG_XL \
131
+ --checkpoint_dir=ckpt/xl_exp/ \
132
+ --save_index_path=ckpt/xl_exp/saved_index \
133
+ --gold_score_mode=f1rougebleudist \
134
+ --train_retriever \
135
+ --precision=fp32 \
136
+ --shard_optim \
137
+ --shard_grads \
138
+ --target_maxlength=32 \
139
+ --generation_max_length=32 \
140
+ --reader_model_type=google/t5-xl-lm-adapt \
141
+ --dropout=0.1 \
142
+ --weight_decay=0.01 \
143
+ --lr=${LR} \
144
+ --lr_retriever=${RET_LR} \
145
+ --scheduler=linear \
146
+ --text_maxlength=512 \
147
+ --retriever_from=persona \
148
+ --use_gradient_checkpoint_reader \
149
+ --use_gradient_checkpoint_retriever \
150
+ --train_data="data/convai2/train.jsonl" \
151
+ --eval_data="data/convai2/valid.jsonl" \
152
+ --n_context=6 \
153
+ --retriever_n_context=6 \
154
+ --log_freq=1 \
155
+ --warmup_steps=50 \
156
+ --write_results \
157
+ --task=qa \
158
+ --index_mode=flat \
159
+ --passages="data/corpora/story/story.jsonl" \
160
+ --save_index_n_shards=128 \
161
+ --temperature_score=${TEMP} \
162
+ --temperature_gold=${TEMPG} \
163
+ --model_path=${STAGE1_GENERATOR} \
164
+ --load_reader_weights_only
165
+ ```
166
+ ## Evaluation
167
+ We provide a script (`evaluation_test.sh`) to run evaluation on trained checkpoint. The content of the script is:
168
+ Also, you can check the jsonl located in the ckpt folder (`ckpt/xl/ckpt/valid-result.jsonl`) for referring.
169
+ ```shell
170
+ NGPU=1 CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --master_port=28888 --nproc_per_node=1 evaluate.py \
171
+ --reader_model_type=google/t5-xl-lm-adapt \
172
+ --text_maxlength=512 \
173
+ --checkpoint_dir=ckpt/xl/eval \
174
+ --model_path=ckpt/xl/ckpt/checkpoint/pth/ \
175
+ --per_gpu_batch_size=1 \
176
+ --eval_data="data/convai2/valid.jsonl" \
177
+ --n_context=6 \
178
+ --retriever_n_context=6 \
179
+ --index_mode="flat" \
180
+ --precision=fp32 \
181
+ --save_index_path=ckpt/xl/eval \
182
+ --write_results \
183
+ --passages="data/corpora/story/story.jsonl" \
184
+ --retriever_from=persona \
185
+ --passages="data/corpora/story/story.jsonl" \
186
+ --generation_num_beams=1 \
187
+ --generation_length_penalty=1
188
+ ```
189
+
190
+ ## Computing Metrics for the Generation Results
191
+ To compute the metric for the evaluation result, simply run
192
+ ```shell
193
+ python compute_metrics.py
194
+ ```
195
+ If you want to change the jsonl to be evaluate, change the path in line 64 in this py file to your desired one:
196
+ ```python
197
+ ...
198
+ file_name = 'ckpt/xl/ckpt/valid-result.jsonl'
199
+ ...
200
+ ```
201
+ # Citing
202
+ To cite this work, please use the following bibtex:
203
+ ```
204
+ @inproceedings{huang-etal-2023-learning,
205
+ title = "Learning Retrieval Augmentation for Personalized Dialogue Generation",
206
+ author = "Huang, Qiushi and
207
+ Fu, Shuai and
208
+ Liu, Xubo and
209
+ Wang, Wenwu and
210
+ Ko, Tom and
211
+ Zhang, Yu and
212
+ Tang, Lilian",
213
+ editor = "Bouamor, Houda and
214
+ Pino, Juan and
215
+ Bali, Kalika",
216
+ booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
217
+ month = dec,
218
+ year = "2023",
219
+ address = "Singapore",
220
+ publisher = "Association for Computational Linguistics",
221
+ url = "https://aclanthology.org/2023.emnlp-main.154",
222
+ doi = "10.18653/v1/2023.emnlp-main.154",
223
+ pages = "2523--2540",
224
+ abstract = "Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose $\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for $\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration ($\textbf{LAPDOG}$), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration.",
225
+ }
226
+ ```