radinplaid commited on
Commit
f9ebc67
·
verified ·
1 Parent(s): 8817d2f

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,100 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - fr
5
+ tags:
6
+ - translation
7
+ license: cc-by-4.0
8
+ datasets:
9
+ - quickmt/quickmt-train.en-fr
10
+ model-index:
11
+ - name: quickmt-en-fr
12
+ results:
13
+ - task:
14
+ name: Translation fra-eng
15
+ type: translation
16
+ args: fra-eng
17
+ dataset:
18
+ name: flores101-devtest
19
+ type: flores_101
20
+ args: eng_Latn fra_Latn devtest
21
+ metrics:
22
+ - name: CHRF
23
+ type: chrf
24
+ value: 71.60
25
+ - name: BLEU
26
+ type: bleu
27
+ value: 50.79
28
+ - name: COMET
29
+ type: comet
30
+ value: 87.11
31
+ ---
32
+
33
+
34
+ # `quickmt-en-fr` Neural Machine Translation Model
35
+
36
+ `quickmt-en-fr` is a reasonably fast and reasonably accurate neural machine translation model for translation from `en` into `fr`.
37
+
38
+
39
+ ## Model Information
40
+
41
+ * Trained using [`eole`](https://github.com/eole-nlp/eole)
42
+ * 185M parameter transformer 'big' with 8 encoder layers and 2 decoder layers
43
+ * 50k joint Sentencepiece vocabulary
44
+ * Exported for fast inference to [CTranslate2](https://github.com/OpenNMT/CTranslate2) format
45
+ * Training data: https://huggingface.co/datasets/quickmt/quickmt-train.en-fr/tree/main
46
+
47
+ See the `eole-config.yaml` model configuration in this repository for further details.
48
+
49
+
50
+ ## Usage with `quickmt`
51
+
52
+ You must install the Nvidia cuda toolkit first, if you want to do GPU inference.
53
+
54
+ Next, install the `quickmt` python library and download the model:
55
+
56
+ ```bash
57
+ git clone https://github.com/quickmt/quickmt.git
58
+ pip install ./quickmt/
59
+
60
+ # List available models
61
+ quickmt-list
62
+
63
+ # Download a model
64
+ quickmt-model-download quickmt/quickmt-en-fr ./quickmt-en-fr
65
+ ```
66
+
67
+ Finally use the model in python:
68
+
69
+ ```python
70
+ from quickmt import Translator
71
+
72
+ # Auto-detects GPU, set to "cpu" to force CPU inference
73
+ t = Translator("./quickmt-en-fr/", device="auto")
74
+
75
+ # Translate - set beam size to 5 for higher quality (but slower speed)
76
+ sample_text = "The Virgo interferometer is a large-scale scientific instrument near Pisa, Italy, for detecting gravitational waves."
77
+ t(sample_text, beam_size=1)
78
+
79
+ # Get alternative translations by sampling
80
+ # You can pass any cTranslate2 `translate_batch` arguments
81
+ t([sample_text], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
82
+ ```
83
+
84
+ The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use `ctranslate2` directly instead of through `quickmt`. It is also possible to get this model to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
85
+
86
+
87
+ ## Metrics
88
+
89
+ `bleu` and `chrf2` are calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the [Flores200 `devtest` test set](https://huggingface.co/datasets/facebook/flores) ("fra_Latn"->"eng_Latn"). `comet22` with the [`comet`](https://github.com/Unbabel/COMET) library and the [default model](https://huggingface.co/Unbabel/wmt22-comet-da). "Time (s)" is the time in seconds to translate (using `ctranslate2`) the flores-devtest dataset (1012 sentences) on an RTX 4070s GPU with batch size 32.
90
+
91
+ | Model | chrf2 | bleu | comet22 | Time (s) |
92
+ | -------------------------------- | ----- | ------- | ------- | -------- |
93
+ | quickmt/quickmt-en-fr | 71.60 | 50.79 | 87.11 | 1.28 |
94
+ | Helsinki-NLP/opus-mt-en-fr | 69.98 | 47.97 | 86.29 | 4.13 |
95
+ | facebook/m2m100_418M | 63.29 | 39.52 | 82.11 | 22.4 |
96
+ | facebook/m2m100_1.2B | 68.31 | 45.39 | 86.50 | 44.0 |
97
+ | facebook/nllb-200-distilled-600M | 70.36 | 48.71 | 87.63 | 27.8 |
98
+ | facebook/nllb-200-distilled-1.3B | 71.95 | 51.10 | 88.50 | 47.8 |
99
+
100
+ `quickmt-en-fr` is the fastest and is higher quality than `opus-mt-en-fr`, `m2m100_418m`, `m2m100_1.2B`.
config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_source_bos": false,
3
+ "add_source_eos": true,
4
+ "bos_token": "<s>",
5
+ "decoder_start_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "layer_norm_epsilon": 1e-06,
8
+ "multi_query_attention": false,
9
+ "unk_token": "<unk>"
10
+ }
eole-config.yaml ADDED
@@ -0,0 +1,101 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## IO
2
+ save_data: enfr/data_spm
3
+ overwrite: True
4
+ seed: 1234
5
+ report_every: 100
6
+ valid_metrics: ["BLEU"]
7
+ tensorboard: true
8
+ tensorboard_log_dir: tensorboard
9
+
10
+ ### Vocab
11
+ src_vocab: enfr/joint.eole.vocab
12
+ tgt_vocab: enfr/joint.eole.vocab
13
+ src_vocab_size: 50000
14
+ tgt_vocab_size: 50000
15
+ vocab_size_multiple: 8
16
+ share_vocab: True
17
+ n_sample: 0
18
+
19
+ data:
20
+ corpus_1:
21
+ path_src: hf://quickmt/quickmt-train.fr-en/en
22
+ path_tgt: hf://quickmt/quickmt-train.fr-en/fr
23
+ path_sco: hf://quickmt/quickmt-train.fr-en/sco
24
+ valid:
25
+ path_src: enfr/dev.en
26
+ path_tgt: enfr/dev.fr
27
+
28
+ transforms: [sentencepiece, filtertoolong]
29
+ transforms_configs:
30
+ sentencepiece:
31
+ src_subword_model: "enfr/joint.spm.model"
32
+ tgt_subword_model: "enfr/joint.spm.model"
33
+ filtertoolong:
34
+ src_seq_length: 256
35
+ tgt_seq_length: 256
36
+
37
+ training:
38
+ # Run configuration
39
+ model_path: enfr/model
40
+ train_from: enfr/model
41
+ keep_checkpoint: 4
42
+ save_checkpoint_steps: 2000
43
+ train_steps: 100000
44
+ valid_steps: 2000
45
+
46
+ # Train on a single GPU
47
+ world_size: 1
48
+ gpu_ranks: [0]
49
+
50
+ # Batching
51
+ batch_type: "tokens"
52
+ batch_size: 16384
53
+ valid_batch_size: 16384
54
+ batch_size_multiple: 8
55
+ accum_count: [8]
56
+ accum_steps: [0]
57
+
58
+ # Optimizer & Compute
59
+ compute_dtype: "bf16"
60
+ optim: "pagedadamw8bit"
61
+ #optim: "adamw"
62
+ learning_rate: 2.0
63
+ warmup_steps: 10000
64
+ decay_method: "noam"
65
+ adam_beta2: 0.998
66
+
67
+ # Data loading
68
+ bucket_size: 128000
69
+ num_workers: 4
70
+ prefetch_factor: 100
71
+
72
+ # Hyperparams
73
+ dropout_steps: [0]
74
+ dropout: [0.1]
75
+ attention_dropout: [0.1]
76
+ max_grad_norm: 2
77
+ label_smoothing: 0.1
78
+ average_decay: 0.0001
79
+ param_init_method: xavier_uniform
80
+ normalization: "tokens"
81
+
82
+ model:
83
+ architecture: "transformer"
84
+ layer_norm: standard
85
+ share_embeddings: true
86
+ share_decoder_embeddings: true
87
+ add_ffnbias: true
88
+ mlp_activation_fn: gelu
89
+ add_estimator: false
90
+ add_qkvbias: false
91
+ norm_eps: 1e-6
92
+ hidden_size: 1024
93
+ encoder:
94
+ layers: 8
95
+ decoder:
96
+ layers: 2
97
+ heads: 8
98
+ transformer_ff: 4096
99
+ embeddings:
100
+ word_vec_size: 1024
101
+ position_encoding_type: "SinusoidalInterleaved"
eole-model/config.json ADDED
@@ -0,0 +1,150 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "valid_metrics": [
3
+ "BLEU"
4
+ ],
5
+ "tensorboard_log_dir": "tensorboard",
6
+ "src_vocab_size": 50000,
7
+ "tgt_vocab": "enfr/joint.eole.vocab",
8
+ "seed": 1234,
9
+ "save_data": "enfr/data_spm",
10
+ "tgt_vocab_size": 50000,
11
+ "overwrite": true,
12
+ "share_vocab": true,
13
+ "vocab_size_multiple": 8,
14
+ "tensorboard": true,
15
+ "transforms": [
16
+ "sentencepiece",
17
+ "filtertoolong"
18
+ ],
19
+ "report_every": 100,
20
+ "tensorboard_log_dir_dated": "tensorboard/Feb-20_00-47-23",
21
+ "n_sample": 0,
22
+ "src_vocab": "enfr/joint.eole.vocab",
23
+ "training": {
24
+ "dropout_steps": [
25
+ 0
26
+ ],
27
+ "batch_size_multiple": 8,
28
+ "bucket_size": 128000,
29
+ "adam_beta2": 0.998,
30
+ "label_smoothing": 0.1,
31
+ "dropout": [
32
+ 0.1
33
+ ],
34
+ "compute_dtype": "torch.bfloat16",
35
+ "valid_batch_size": 16384,
36
+ "valid_steps": 2000,
37
+ "accum_count": [
38
+ 8
39
+ ],
40
+ "world_size": 1,
41
+ "gpu_ranks": [
42
+ 0
43
+ ],
44
+ "batch_size": 16384,
45
+ "train_steps": 100000,
46
+ "train_from": "enfr/model",
47
+ "average_decay": 0.0001,
48
+ "save_checkpoint_steps": 2000,
49
+ "accum_steps": [
50
+ 0
51
+ ],
52
+ "max_grad_norm": 2.0,
53
+ "prefetch_factor": 100,
54
+ "batch_type": "tokens",
55
+ "keep_checkpoint": 4,
56
+ "attention_dropout": [
57
+ 0.1
58
+ ],
59
+ "learning_rate": 2.0,
60
+ "optim": "pagedadamw8bit",
61
+ "num_workers": 0,
62
+ "model_path": "enfr/model",
63
+ "normalization": "tokens",
64
+ "decay_method": "noam",
65
+ "warmup_steps": 10000,
66
+ "param_init_method": "xavier_uniform"
67
+ },
68
+ "transforms_configs": {
69
+ "sentencepiece": {
70
+ "src_subword_model": "${MODEL_PATH}/joint.spm.model",
71
+ "tgt_subword_model": "${MODEL_PATH}/joint.spm.model"
72
+ },
73
+ "filtertoolong": {
74
+ "tgt_seq_length": 256,
75
+ "src_seq_length": 256
76
+ }
77
+ },
78
+ "model": {
79
+ "layer_norm": "standard",
80
+ "add_qkvbias": false,
81
+ "architecture": "transformer",
82
+ "add_estimator": false,
83
+ "position_encoding_type": "SinusoidalInterleaved",
84
+ "norm_eps": 1e-06,
85
+ "heads": 8,
86
+ "share_decoder_embeddings": true,
87
+ "hidden_size": 1024,
88
+ "mlp_activation_fn": "gelu",
89
+ "share_embeddings": true,
90
+ "transformer_ff": 4096,
91
+ "add_ffnbias": true,
92
+ "encoder": {
93
+ "layer_norm": "standard",
94
+ "encoder_type": "transformer",
95
+ "add_qkvbias": false,
96
+ "n_positions": null,
97
+ "hidden_size": 1024,
98
+ "mlp_activation_fn": "gelu",
99
+ "position_encoding_type": "SinusoidalInterleaved",
100
+ "norm_eps": 1e-06,
101
+ "transformer_ff": 4096,
102
+ "layers": 8,
103
+ "heads": 8,
104
+ "src_word_vec_size": 1024,
105
+ "add_ffnbias": true
106
+ },
107
+ "decoder": {
108
+ "layer_norm": "standard",
109
+ "n_positions": null,
110
+ "add_qkvbias": false,
111
+ "hidden_size": 1024,
112
+ "tgt_word_vec_size": 1024,
113
+ "mlp_activation_fn": "gelu",
114
+ "decoder_type": "transformer",
115
+ "position_encoding_type": "SinusoidalInterleaved",
116
+ "norm_eps": 1e-06,
117
+ "transformer_ff": 4096,
118
+ "layers": 2,
119
+ "heads": 8,
120
+ "add_ffnbias": true
121
+ },
122
+ "embeddings": {
123
+ "tgt_word_vec_size": 1024,
124
+ "src_word_vec_size": 1024,
125
+ "position_encoding_type": "SinusoidalInterleaved",
126
+ "word_vec_size": 1024
127
+ }
128
+ },
129
+ "data": {
130
+ "corpus_1": {
131
+ "path_tgt": "hf://quickmt/quickmt-train.fr-en/fr",
132
+ "transforms": [
133
+ "sentencepiece",
134
+ "filtertoolong"
135
+ ],
136
+ "path_align": null,
137
+ "path_sco": "hf://quickmt/quickmt-train.fr-en/sco",
138
+ "path_src": "hf://quickmt/quickmt-train.fr-en/en"
139
+ },
140
+ "valid": {
141
+ "path_tgt": "enfr/dev.fr",
142
+ "transforms": [
143
+ "sentencepiece",
144
+ "filtertoolong"
145
+ ],
146
+ "path_align": null,
147
+ "path_src": "enfr/dev.en"
148
+ }
149
+ }
150
+ }
eole-model/joint.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19bab02bdbc41207bd3fabf86e20e691e978f78725d898c42de586b67cdaed02
3
+ size 1146015
eole-model/model.00.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2044ba22f9b306c427eed360f189b63ca8172bddc4d6fe559593beac1188f7cf
3
+ size 762769904
eole-model/vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
joint.eole.vocab ADDED
The diff for this file is too large to render. See raw diff
 
joint.spm.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:19bab02bdbc41207bd3fabf86e20e691e978f78725d898c42de586b67cdaed02
3
+ size 1146015
joint.spm.vocab ADDED
The diff for this file is too large to render. See raw diff
 
model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8e9e6b469af7bbafdd6fbbf1cabfabcd278e74b5acb637d4c2ea3e41f004b023
3
+ size 381336824
shared_vocabulary.json ADDED
The diff for this file is too large to render. See raw diff