projecte-aina
/

aina-translator-de-ca

Fairseq

German

Catalan

Model card Files Files and versions Community

AudreyVM commited on Nov 6, 2024

Commit

cc4c122

verified ·

1 Parent(s): 14b3918

Update README.md

Browse files

Files changed (1) hide show

README.md +30 -19

README.md CHANGED Viewed

@@ -13,8 +13,10 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of Catalan-German datasets,
-which after filtering and cleaning comprised 6.258.272 sentence pairs. The model was evaluated on the Flores and NTREX evaluation datasets.
 ## Intended uses and limitations
@@ -54,28 +56,36 @@ However, we are well aware that our models may be biased. We intend to conduct r
 The model was trained on a combination of the following datasets:
-| Dataset       	| Sentences  	| Sentences after Cleaning|
-|-------------------|----------------|-------------------|
-| Multi CCAligned | 1.478.152 | 1.027.481 |
-| WikiMatrix  	| 180.322 	| 125.811 	|
-| GNOME	| 12.333|	1.241|
-| KDE4    	| 165.439   	|  105.098 	|
-| OpenSubtitles	| 303.329	| 171.376	|
-| GlobalVoices| 4.636 	|	3.578|
-| Tatoeba | 732 | 655 |
-| Books | 4.445 | 2049 |
-| Europarl | 1.734.643 | 1.734.643 |
-| Tilde | 3.434.091 | 3.434.091 |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
 The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
  The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
@@ -127,10 +137,11 @@ Below are the evaluation results on the machine translation from German to Catal
 | Test set         	| SoftCatalà | Google Translate | aina-translator-de-ca |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 29,0     	| **35,1**     	| 29,8     	|
-| Flores 101 devtest   |29,3   	| **35,4**     	| 30,1     	|
-| NTREX | 25,8 | **31,3** | 24,5 |
-| Average          	| 28,0   	| **33,9**     	| 28,1      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-German data
+sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-Germancorpora using
+[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs.
+The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
 The model was trained on a combination of the following datasets:
+| Dataset       	|
+|-------------------|
+| Multi CCAligned |
+| WikiMatrix  	|
+| GNOME	|
+| KDE4    	|
+| OpenSubtitles	|
+| GlobalVoices|
+| Tatoeba |
+| Books |
+| Europarl |
+| Tilde |
+| Multi-Paracawl |
+| DGT |
+| EU Bookshop |
+| NLLB |
+| OpenSubtitles |
 All corpora except Europarl and Tilde were collected from [Opus](https://opus.nlpl.eu/).
 The Europarl and Tilde corpora are a synthetic parallel corpus created from the original Spanish-Catalan corpora by [SoftCatalà](https://github.com/Softcatala).
+Where a Spanish-German corpus was used, synthetic Catalan was generated from the Spanish side using
+[Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca).
 ### Training procedure
 ### Data preparation
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
  The filtered datasets are then concatenated to form a final corpus of 6.258.272 and before training the punctuation is normalized using a
  modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py)
 | Test set         	| SoftCatalà | Google Translate | aina-translator-de-ca |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 28,9     	| **35,1**     	| 33,1    	|
+| Flores 101 devtest   |29,2   	| **35,9**     	| 33,2     	|
+| NTEU | 38,9 | 39,1 | **42,9** |
+| NTREX | 25,7 | **31,2** | 29,1 |
+| Average          	| 30,7   	| **35,3**     	| 34,3      	|
 ## Additional information