Spaces:

OFA-Sys
/

OFA-Image_Caption

App Files Files Community

OFA-Image_Caption / fairseq /examples /latent_depth /README.md

JustinLin610

update

8437114 over 2 years ago

preview code

raw

history blame contribute delete

2.97 kB

	# Deep Transformers with Latent Depth (Li et al., 2020)

	[https://arxiv.org/abs/2009.13102](https://arxiv.org/abs/2009.13102).

	## Introduction

	We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. As an extension of this framework, we propose a novel method to train one shared Transformer network for multilingual machine translation with different layer selection posteriors for each language pair.

	## Training a multilingual model with latent depth

	Below is an example of training with latent depth in decoder for one-to-many (O2M) related languages. We use the same preprocessed (numberized and binarized) TED8 dataset as in [Balancing Training for Multilingual Neural Machine Translation (Wang et al., 2020)](https://github.com/cindyxinyiwang/multiDDS), which could be generated by [the script](https://github.com/cindyxinyiwang/multiDDS/blob/multiDDS/util_scripts/prepare_multilingual_data.sh) the author provided.
	```bash
	lang_pairs_str="eng-aze,eng-bel,eng-ces,eng-glg,eng-por,eng-rus,eng-slk,eng-tur"
	databin_dir=<path to binarized data>

	fairseq-train ${databin_dir} \
	--user-dir examples/latent_depth/latent_depth_src \
	--lang-pairs "${lang_pairs_str}" \
	--arch multilingual_transformer_iwslt_de_en \
	--task multilingual_translation_latent_depth \
	--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
	--share-encoders \
	--share-decoders \
	--decoder-langtok \
	--share-decoder-input-output-embed \
	--dropout 0.3 --attention-dropout 0.3 \
	--optimizer adam --adam-eps 1e-06 --adam-betas '(0.9, 0.98)' \
	--lr-scheduler inverse_sqrt --stop-min-lr 1e-9 --warmup-init-lr 1e-7 --warmup-updates 8000 \
	--max-tokens 4096 --update-freq 1 \
	--lr 0.0015 \
	--clip-norm 1.0 \
	--seed 2 \
	--ddp-backend=legacy_ddp \
	--encoder-layers 12 \
	--decoder-layers 24 \
	--decoder-latent-layer \
	--sparsity-weight 0.1 \
	--anneal-updates 5000 \
	--soft-update 500 \
	--target-layers 12 \
	--share-weight 0.1
	```
	## Inference command

	```bash
	lang_pairs_str="eng-aze,eng-bel,eng-ces,eng-glg,eng-por,eng-rus,eng-slk,eng-tur"
	databin_dir=<path to binarized data>
	model_path=<path to checkpoint>
	src_lang=<source language to translate from>
	tgt_lang=<target language to translate to>
	gen_data=<name of data split, e.g. valid, test, etc>

	fairseq-generate ${databin_dir} \
	--path ${model_path} \
	--task multilingual_translation_latent_depth \
	--decoder-latent-layer \
	--lang-pairs "${lang_pairs_str}" \
	-s ${src_lang} -t ${tgt_lang} \
	--gen-subset $gen_data \
	--scoring sacrebleu \
	--remove-bpe 'sentencepiece' \
	--lenpen 1.0 \
	--beam 5 \
	--decoder-langtok \
	--max-tokens 4096
	```


	## Citation
	```bibtex
	@article{li2020deep,
	title={Deep Transformers with Latent Depth},
	author={Li, Xian and Stickland, Asa Cooper and Tang, Yuqing and Kong, Xiang},
	journal={arXiv preprint arXiv:2009.13102},
	year={2020}
	}
	```