File size: 2,089 Bytes
35af852 1099c0c 35af852 ff585be 35af852 6c90540 35af852 043dffc 35af852 1099c0c 35af852 043dffc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
language:
- en
- el
tags:
- translation
widget:
- text: "Η τύχη βοηθάει τους τολμηρούς."
license: apache-2.0
metrics:
- bleu
---
## Greek to English NMT (lower-case output)
## By the Hellenic Army Academy (SSE) and the Technical University of Crete (TUC)
* source languages: el
* target languages: en
* licence: apache-2.0
* dataset: Opus, CCmatrix
* model: transformer(fairseq)
* pre-processing: tokenization + BPE segmentation
* metrics: bleu, chrf
* output: lowercase only, for mixed-cased model use this: https://huggingface.co/lighteternal/SSE-TUC-mt-el-en-cased
### Model description
Trained using the Fairseq framework, transformer_iwslt_de_en architecture.\\
BPE segmentation (10k codes).\\
Lower-case model.
### How to use
```
from transformers import FSMTTokenizer, FSMTForConditionalGeneration
mname = " <your_downloaded_model_folderpath_here> "
tokenizer = FSMTTokenizer.from_pretrained(mname)
model = FSMTForConditionalGeneration.from_pretrained(mname)
text = "Η τύχη βοηθάει τους τολμηρούς."
encoded = tokenizer.encode(text, return_tensors='pt')
outputs = model.generate(encoded, num_beams=5, num_return_sequences=5, early_stopping=True)
for i, output in enumerate(outputs):
i += 1
print(f"{i}: {output.tolist()}")
decoded = tokenizer.decode(output, skip_special_tokens=True)
print(f"{i}: {decoded}")
```
## Training data
Consolidated corpus from Opus and CC-Matrix (~6.6GB in total)
## Eval results
Results on Tatoeba testset (EL-EN):
| BLEU | chrF |
| ------ | ------ |
| 79.3 | 0.795 |
Results on XNLI parallel (EL-EN):
| BLEU | chrF |
| ------ | ------ |
| 66.2 | 0.623 |
### BibTeX entry and citation info
Dimitris Papadopoulos, et al. "PENELOPIE: Enabling Open Information Extraction for the Greek Language through Machine Translation." (2021). Accepted at EACL 2021 SRW
### Acknowledgement
The research work was supported by the Hellenic Foundation for Research and Innovation (HFRI) under the HFRI PhD Fellowship grant (Fellowship Number:50, 2nd call)
|