BSC-LT
/

salamandra-7b-instruct

@@ -95,7 +95,7 @@ The pre-training corpus contains text in 35 European languages and code.
 ### Hyperparameters
-The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/tree/main/configs).
 ### Architecture
@@ -149,7 +149,7 @@ All models were trained on [MareNostrum 5](https://www.bsc.es/ca/marenostrum/mar
 operated by Barcelona Supercomputing Center.
 The accelerated partition is composed of 1,120 nodes with the following specifications:
-- 4x Nvidia Hopper GPUs with 64 HBM2 memory
 - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
 - 4x NDR200 (BW per node 800Gb/s)
 - 512 GB of Main memory (DDR5)
@@ -663,7 +663,7 @@ We only use tasks that are either human generated, human translated, or with a s
 During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
-It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the models capabilities and potential. We thus advise caution when reading and interpreting the results.
 A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
@@ -957,7 +957,7 @@ Score 1: The answer is mathematically correct, with accurate calculations and ap
 #### Multilingual results
-Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 mean that the model generates similar responses when comparing the three prompt varieties for a single instance.
 Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
@@ -1113,7 +1113,7 @@ the model performs very poorly in ambiguous settings, which indicates the presen
 Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
 For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
 but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
-We measure effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
 with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources

 ### Hyperparameters
+The full list of hyperparameters for each model can be found [here](https://github.com/langtech-bsc/salamandra/blob/main/configs/bsc_7b.yaml).
 ### Architecture
 operated by Barcelona Supercomputing Center.
 The accelerated partition is composed of 1,120 nodes with the following specifications:
+- 4x Nvidia Hopper GPUs with 64GB HBM2 memory
 - 2x Intel Sapphire Rapids 8460Y+ at 2.3Ghz and 32c each (64 cores)
 - 4x NDR200 (BW per node 800Gb/s)
 - 512 GB of Main memory (DDR5)
 During the implementation of the evaluation we observed a series of issues worth considering when replicating and interpreting the results presented. These issues include ≈1.5% variances in performance in some tasks depending on the version of the `transformers` library used, and depending on the use (or lack of use) of tensor parallelism when loading a model. When implementing existing tasks, we carry out a comprehensive quality evaluation of the dataset, the Harness task itself, and what kind of input models see during evaluation. Our implementation (see links above) addresses multiple existing problems such as errors in datasets and prompts, and lack of pre-processing. All this means that results will vary if using other Harness implementations, and may slightly vary depending on the replication setup.
+It should be noted that these results are subject to all the drawbacks of every current gold-standard evaluation, and that the figures do not fully represent the model's capabilities and potential. We thus advise caution when reading and interpreting the results.
 A full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
 #### Multilingual results
+Here, we present results for seven categories of tasks in Spanish, Catalan, Basque, Galician, and English. Results are presented for each task, criterion and language. Criteria with a `(B)` after their name are binary criteria (i.e., numbers go from 0 to 1, where 1 is best). The rest of the criteria are measured using a 5-point Likert scale, where 5 is best. The first number of the pair of numbers separated by `/` shows the average score for the criterion (and language). The second number of each pair is the robustness score, where numbers closer to 0 means that the model generates similar responses when comparing the three prompt varieties for a single instance.
 Further details on all tasks and criteria, a full list of results compared to other baselines, a discussion of the model's performance across tasks and its implications, and details regarding problem-solving with task implementation will soon be available in the technical report.
 Our cognitive bias analysis focuses on positional effects in 0-shot settings, and majority class bias in few-shot settings.
 For positional effects, we leverage the ARC Multiple Choice Question dataset (Clark et al., 2018). We observe significant,
 but relatively weak primacy effects, whereby the model shows a preference for answers towards the beginning of the list of provided answers.
+We measure the effects of majority class effects in few-shot settings using SST-2 (Socher et al., 2013). We again detect significant effects,
 with a small effect size. This suggests that the model is relatively robust against the examined cognitive biases.
  We highlight that our analyses of these biases are by no means exhaustive and are limited by the relative scarcity of adequate resources