projecte-aina
/

aina-translator-fr-ca

Fairseq

French

Catalan

Model card Files Files and versions Community

AudreyVM commited on Nov 6, 2024

Commit

2a67270

verified ·

1 Parent(s): 9521443

Update README.md

Browse files

Files changed (1) hide show

README.md +32 -20

README.md CHANGED Viewed

@@ -13,8 +13,7 @@ library_name: fairseq
 ## Model description
-This model was trained from scratch using the Fairseq toolkit on a combination of Catalan-French datasets, which after filtering and
-cleaning comprised 18.634.844 sentence pairs. The model is evaluated on the Flores and NTREX evaluation sets.
 ## Intended uses and limitations
@@ -54,23 +53,33 @@ However, we are well aware that our models may be biased. We intend to conduct r
 The model was trained on a combination of the following datasets:
-| Dataset       	| Sentences  	| Sentences after Cleaning        	|
-|-------------------|----------------|-------------------|
-| CCMatrix      	|24.386.198  	| 16.305.758   	|
-| Multi CCAligned | 1.954.475 | 1.442.584 |
-| WikiMatrix | 490.871 | 437.665 |
-| GNOME | 12.962 | 1.686 |
-|KDE 4 | 163.143 | 111.750 |
-| Open Subtitles | 392.159 | 225.786 |
 ### Training procedure
 ### Data preparation
- All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
- This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
- The filtered datasets are then concatenated to form a final corpus of 6.159.631 and before training the punctuation is normalized using a
- modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
 #### Tokenization
@@ -110,19 +119,22 @@ Weights were saved every 1000 updates and reported results are the average of th
 ### Variable and metrics
-We use the BLEU score for evaluation on the [Flores-101](https://github.com/facebookresearch/flores) and [NTREX](https://github.com/MicrosoftTranslator/NTREX) test sets.
 ### Evaluation results
 Below are the evaluation results on the machine translation from French to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
 [Google Translate](https://translate.google.es/?hl=es):
-| Test set         	| SoftCatalà | Google Translate | aina-translator-fr-ca |
 |----------------------|------------|------------------|---------------|
-| Flores 101 dev   	| 30,9    	| **37,0**     	| 33,0      	|
-| Flores 101 devtest   | 31,3   	| **37,1**     	| 34,4      	|
-| NTREX    	| 24,5   	| **30,5**     	| 27,0      	|
-| Average          	| 28,9  	| **34,9**     	| 31,5      	|
 ## Additional information

 ## Model description
+This model was trained from scratch using the [Fairseq toolkit](https://fairseq.readthedocs.io/en/latest/) on a combination of datasets comprising both Catalan-French data sourced from Opus, and additional datasets where synthetic Catalan was generated from the Spanish side of Spanish-French corpora using [Projecte Aina’s Spanish-Catalan model](https://huggingface.co/projecte-aina/aina-translator-es-ca). This gave a total of approximately 100 million sentence pairs. The model is evaluated on the Flores, NTEU and NTREX evaluation sets.
 ## Intended uses and limitations
 The model was trained on a combination of the following datasets:
+| Datasets       	|
+|----------------------|
+| DGT	|
+|EU Bookshop	|
+| Europarl 	|
+|Global Voices |
+| GNOME |
+|KDE 4 |
+| Multi CCAligned |
+| Multi Paracrawl |
+| Multi UN	|
+| NLLB    	|
+| NTEU		|
+| Open Subtitles |
+|UNPC	|
+| WikiMatrix |
+All data was sourced from OPUS and ELRC. After all Catalan-French data had been collected, Spanish-French data was collected and the Spanish data translated to Catalan using Projecte Aina’s Spanish-Catalan model.
 ### Training procedure
 ### Data preparation
+All datasets are deduplicated, filtered for language identification, and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
+This is done using sentence embeddings calculated using LaBSE. The filtered datasets are then concatenated to form the final corpus.
+Before training the punctuation is normalized using a modified version of the join-single-file.py script from SoftCatalà
 #### Tokenization
 ### Variable and metrics
+We use the BLEU score for evaluation on test sets: [Flores-101](https://github.com/facebookresearch/flores), NTREX and
+NTEU (unpublished evaluation corpus)
 ### Evaluation results
 Below are the evaluation results on the machine translation from French to Catalan compared to [Softcatalà](https://www.softcatala.org/) and
 [Google Translate](https://translate.google.es/?hl=es):
+| Test set         	| SoftCatalà | Google Translate | aina-translator-ca-fr |
 |----------------------|------------|------------------|---------------|
+| Flores 101 dev   	| 32,5    	| **37,2**     	| 35,6    	|
+| Flores 101 devtest   | 33,6               | **37,4**     	| 36,3     	|
+| NTEU		| 39,7	|43,5	|**47.4**|
+| NTREX    	|26,7	|**30,5	**|29,3	|
+| Average          	| 33,1  	|   **37,1**	| **37,1**    	|
 ## Additional information