Fairseq
French
Catalan
AudreyVM commited on
Commit
2a67270
·
verified ·
1 Parent(s): 9521443

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +32 -20
README.md CHANGED
@@ -13,8 +13,7 @@ library_name: fairseq
13
 
14
  ## Model description
15
 
16
- This model was trained from scratch using the Fairseq toolkit on a combination of Catalan-French datasets, which after filtering and
17
- cleaning comprised 18.634.844 sentence pairs. The model is evaluated on the Flores and NTREX evaluation sets.
18
 
19
  ## Intended uses and limitations
20
 
@@ -54,23 +53,33 @@ However, we are well aware that our models may be biased. We intend to conduct r
54
 
55
  The model was trained on a combination of the following datasets:
56
 
57
- | Dataset | Sentences | Sentences after Cleaning |
58
- |-------------------|----------------|-------------------|
59
- | CCMatrix |24.386.198 | 16.305.758 |
60
- | Multi CCAligned | 1.954.475 | 1.442.584 |
61
- | WikiMatrix | 490.871 | 437.665 |
62
- | GNOME | 12.962 | 1.686 |
63
- |KDE 4 | 163.143 | 111.750 |
64
- | Open Subtitles | 392.159 | 225.786 |
 
 
 
 
 
 
 
 
 
 
 
65
 
66
  ### Training procedure
67
 
68
  ### Data preparation
69
 
70
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
71
- This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
72
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
73
- modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
74
 
75
  #### Tokenization
76
 
@@ -110,19 +119,22 @@ Weights were saved every 1000 updates and reported results are the average of th
110
 
111
  ### Variable and metrics
112
 
113
- We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 
 
114
 
115
  ### Evaluation results
116
 
117
  Below are the evaluation results on the machine translation from French to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
118
  [Google Translate](https://translate.google.es/?hl=es):
119
 
120
- | Test set | SoftCatalà | Google Translate | aina-translator-fr-ca |
121
  |----------------------|------------|------------------|---------------|
122
- | Flores 101 dev | 30,9 | **37,0** | 33,0 |
123
- | Flores 101 devtest | 31,3 | **37,1** | 34,4 |
124
- | NTREX | 24,5 | **30,5** | 27,0 |
125
- | Average | 28,9 | **34,9** | 31,5 |
 
126
 
127
  ## Additional information
128
 
 
13
 
14
  ## Model description
15
 
16
+ This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-French data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-French corpora using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 
17
 
18
  ## Intended uses and limitations
19
 
 
53
 
54
  The model was trained on a combination of the following datasets:
55
 
56
+ | Datasets |
57
+ |----------------------|
58
+ | DGT |
59
+ |EU Bookshop |
60
+ | Europarl |
61
+ |Global Voices |
62
+ | GNOME |
63
+ |KDE 4 |
64
+ | Multi CCAligned |
65
+ | Multi Paracrawl |
66
+ | Multi UN |
67
+ | NLLB |
68
+ | NTEU |
69
+ | Open Subtitles |
70
+ |UNPC |
71
+ | WikiMatrix |
72
+
73
+ All data was sourced from OPUS and ELRC. After all Catalan-French data had been collected, Spanish-French data was collected and the Spanish data translated to Catalan using Projecte Aina’s Spanish-Catalan model.
74
+
75
 
76
  ### Training procedure
77
 
78
  ### Data preparation
79
 
80
+ All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
81
+ This is done using sentence embeddings calculated using LaBSE. The filtered datasets are then concatenated to form the final corpus.
82
+ Before training the punctuation is normalized using a modified version of the join-single-file.py script from SoftCatalà
 
83
 
84
  #### Tokenization
85
 
 
119
 
120
  ### Variable and metrics
121
 
122
+ We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores), NTREX and
123
+ NTEU (unpublished evaluation corpus)
124
+
125
 
126
  ### Evaluation results
127
 
128
  Below are the evaluation results on the machine translation from French to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
129
  [Google Translate](https://translate.google.es/?hl=es):
130
 
131
+ | Test set | SoftCatalà | Google Translate | aina-translator-ca-fr |
132
  |----------------------|------------|------------------|---------------|
133
+ | Flores 101 dev | 32,5 | **37,2** | 35,6 |
134
+ | Flores 101 devtest | 33,6 | **37,4** | 36,3 |
135
+ | NTEU | 39,7 |43,5 |**47.4**|
136
+ | NTREX |26,7 |**30,5 **|29,3 |
137
+ | Average | 33,1 | **37,1** | **37,1** |
138
 
139
  ## Additional information
140