Fairseq
German
Catalan
AudreyVM commited on
Commit
cc4c122
·
verified ·
1 Parent(s): 14b3918

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -19
README.md CHANGED
@@ -13,8 +13,10 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
17
- which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 
 
18
 
19
  ## Intended uses and limitations
20
 
@@ -54,28 +56,36 @@ However, we are well aware that our models may be biased. We intend to conduct r
54
 
55
  The model was trained on a combination of the following datasets:
56
 
57
- | Dataset | Sentences | Sentences after Cleaning|
58
- |-------------------|----------------|-------------------|
59
- | Multi CCAligned | 1.478.152 | 1.027.481 |
60
- | WikiMatrix | 180.322 | 125.811 |
61
- | GNOME | 12.333| 1.241|
62
- | KDE4 | 165.439 | 105.098 |
63
- | OpenSubtitles | 303.329 | 171.376 |
64
- | GlobalVoices| 4.636 | 3.578|
65
- | Tatoeba | 732 | 655 |
66
- | Books | 4.445 | 2049 |
67
- | Europarl | 1.734.643 | 1.734.643 |
68
- | Tilde | 3.434.091 | 3.434.091 |
 
 
 
 
 
 
69
 
70
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
71
  The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 
 
72
 
73
 
74
  ### Training procedure
75
 
76
  ### Data preparation
77
 
78
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
79
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
80
  The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a
81
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
@@ -127,10 +137,11 @@ Below are the evaluation results on the machine translation from German to Catal
127
 
128
  | Test set | SoftCatalà | Google Translate | aina-translator-de-ca |
129
  |----------------------|------------|------------------|---------------|
130
- | Flores 101 dev | 29,0 | **35,1** | 29,8 |
131
- | Flores 101 devtest |29,3 | **35,4** | 30,1 |
132
- | NTREX | 25,8 | **31,3** | 24,5 |
133
- | Average | 28,0 | **33,9** | 28,1 |
 
134
 
135
  ## Additional information
136
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-German data
17
+ sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Germancorpora using
18
+ [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs.
19
+ The model is evaluated on the Flores, NTEU and NTREX evaluation sets.  
20
 
21
  ## Intended uses and limitations
22
 
 
56
 
57
  The model was trained on a combination of the following datasets:
58
 
59
+ | Dataset |
60
+ |-------------------|
61
+ | Multi CCAligned |
62
+ | WikiMatrix |
63
+ | GNOME |
64
+ | KDE4 |
65
+ | OpenSubtitles |
66
+ | GlobalVoices|
67
+ | Tatoeba |
68
+ | Books |
69
+ | Europarl |
70
+ | Tilde |
71
+ | Multi-Paracawl |
72
+ | DGT |
73
+ | EU Bookshop |
74
+ | NLLB |
75
+ | OpenSubtitles |
76
+
77
 
78
  All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
79
  The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
80
+ Where a Spanish-German corpus was used, synthetic Catalan was generated from the Spanish side using
81
+ [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
82
 
83
 
84
  ### Training procedure
85
 
86
  ### Data preparation
87
 
88
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
89
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
90
  The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a
91
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 
137
 
138
  | Test set | SoftCatalà | Google Translate | aina-translator-de-ca |
139
  |----------------------|------------|------------------|---------------|
140
+ | Flores 101 dev | 28,9 | **35,1** | 33,1 |
141
+ | Flores 101 devtest |29,2 | **35,9** | 33,2 |
142
+ | NTEU | 38,9 | 39,1 | **42,9** |
143
+ | NTREX | 25,7 | **31,2** | 29,1 |
144
+ | Average | 30,7 | **35,3** | 34,3 |
145
 
146
  ## Additional information
147