joanllop commited on
Commit
26e813c
·
verified ·
1 Parent(s): 3258cc3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -5
README.md CHANGED
@@ -95,7 +95,7 @@ The pre-training corpus contains text in 35 European languages and code.
95
 
96
  ### Hyperparameters
97
 
98
- The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
99
 
100
  ### Architecture
101
 
@@ -149,7 +149,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
149
  operated by Barcelona Supercomputing Center.
150
 
151
  The accelerated partition is composed of 1,120 nodes with the following specifications:
152
- - 4x Nvidia Hopper GPUs with 64 HBM2 memory
153
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
154
  - 4x NDR200 (BW per node 800Gb/s)
155
  - 512 GB of Main memory (DDR5)
@@ -663,7 +663,7 @@ We only use tasks that are either human generated, human translated, or with a s
663
 
664
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
665
 
666
- It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
667
 
668
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
669
 
@@ -957,7 +957,7 @@ Score 1: The answer is mathematically correct, with accurate calculations and ap
957
 
958
  #### Multilingual results
959
 
960
- Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
961
 
962
  Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
963
 
@@ -1113,7 +1113,7 @@ the model performs very poorly in ambiguous settings, which indicates the presen
1113
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1114
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
1115
  but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1116
- We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
1117
  with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
1118
 
1119
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources
 
95
 
96
  ### Hyperparameters
97
 
98
+ The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_7b.yaml).
99
 
100
  ### Architecture
101
 
 
149
  operated by Barcelona Supercomputing Center.
150
 
151
  The accelerated partition is composed of 1,120 nodes with the following specifications:
152
+ - 4x Nvidia Hopper GPUs with 64GB HBM2 memory
153
  - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
154
  - 4x NDR200 (BW per node 800Gb/s)
155
  - 512 GB of Main memory (DDR5)
 
663
 
664
  During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
665
 
666
+ It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
667
 
668
  A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
669
 
 
957
 
958
  #### Multilingual results
959
 
960
+ Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 means that the model generates similar responses when comparing the three prompt varieties for a single instance.
961
 
962
  Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
963
 
 
1113
  Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
1114
  For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
1115
  but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
1116
+ We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
1117
  with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
1118
 
1119
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources