joermd commited on
Commit
6f25beb
·
verified ·
1 Parent(s): d8bb47f

Clone model from ALLaM-AI/ALLaM-7B-Instruct-preview

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. README.md +211 -0
  2. config.json +28 -0
  3. evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json +125 -0
  4. evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json +142 -0
  5. evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json +126 -0
  6. evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json +130 -0
  7. evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json +0 -0
  8. evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json +126 -0
  9. evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json +127 -0
  10. evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json +543 -0
  11. evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json +127 -0
  12. evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json +129 -0
  13. evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json +0 -0
  14. evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json +123 -0
  15. evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json +142 -0
  16. evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json +126 -0
  17. evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json +130 -0
  18. evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json +0 -0
  19. evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json +126 -0
  20. evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json +119 -0
  21. evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json +539 -0
  22. evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json +127 -0
  23. evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json +129 -0
  24. evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json +0 -0
  25. evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json +119 -0
  26. evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json +142 -0
  27. evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json +126 -0
  28. evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json +130 -0
  29. evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json +0 -0
  30. evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json +126 -0
  31. evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json +121 -0
  32. evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json +549 -0
  33. evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json +127 -0
  34. evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json +129 -0
  35. evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json +0 -0
  36. evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json +123 -0
  37. evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json +142 -0
  38. evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json +126 -0
  39. evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json +130 -0
  40. evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json +0 -0
  41. evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json +126 -0
  42. evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json +125 -0
  43. evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json +553 -0
  44. evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json +127 -0
  45. evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json +129 -0
  46. evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json +0 -0
  47. evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json +125 -0
  48. evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json +142 -0
  49. evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json +126 -0
  50. evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json +130 -0
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ar
5
+ - en
6
+ pipeline_tag: text-generation
7
+ tags:
8
+ - pytorch
9
+ library_name: transformers
10
+ ---
11
+ # ALLaM-7B-Instruct-preview
12
+
13
+ ALLaM is a series of powerful language models designed to advance Arabic Language Technology (ALT) developed by the National Center for Artificial Intelligence (NCAI) at the [Saudi Data and AI Authority (SDAIA)](https://sdaia.gov.sa/en/default.aspx). `ALLaM-AI/ALLaM-7B-Instruct-preview` is trained from scratch. Our pretraining from scratch recipe consists of two steps: training on 4T English tokens followed by training on 1.2T mixed Arabic/English tokens. This retains the English capabilities of the model without catastrophic forgetting, effectively transferring knowledge from one language distribution to another.
14
+
15
+ ## Intended Use
16
+
17
+ `ALLaM` is specifically designed to expedite the research and development of ALT through Large Language Models (LLM). It serves as one of the foundational elements for building product offerings as well as facilitating experimental initiatives.
18
+
19
+ The ALLaM series models are designed to be a component of a larger AI system, and it is important for developers to incorporate safety measures when creating these systems. These safety measures are crucial for ensuring a balance between effectiveness and security, as well as minimizing potential risks, such as those resulting from the integration of the model with external tools.
20
+
21
+ ## Model Details
22
+
23
+ ALLaM is a family of LLMs specially trained for Arabic. The main two paths followed for pretraining are:
24
+
25
+ - **ALLaM**: Pretraining models from scratch
26
+ - **ALLaM-Adapted/ALLaM-(\*\*)/(\*\*)-ALLaM**/: Continued training from open source/weight models
27
+
28
+ For this release, we are providing our instruction-tuned 7B parameter generative model pretrained from scratch.
29
+
30
+ Some parameters for this model are provided in the following table:
31
+
32
+ | Size | Context Length | Pretraining Tokens | Instructions | Preference Pairs |
33
+ |----------------|-----------------|--------------------|--------------|------------------|
34
+ | 7B parameters | 4096 tokens |4T(en) + 1.2T(en+ar)| 7M | 260K |
35
+
36
+
37
+ ## Model Description
38
+
39
+ - **Developed by:** National Center for Artificial Intelligence at [SDAIA](https://sdaia.gov.sa/en/default.aspx)
40
+ - **Model type:** Autoregressive Transformer
41
+ - **Language(s):** Arabic, English
42
+ - **License:** Please see the LICENSE file
43
+ - **Input:** Text
44
+ - **Output:** Text
45
+
46
+
47
+ ## Training Details
48
+
49
+ ALLaM-7B-Instruct-preview is pretrained on a total of 5.2 trillion tokens in English and Arabic, Our training codebase is built on [NVIDIA/MegatronLM](https://github.com/NVIDIA/Megatron-LM). Average MFU during training was ~42%. We trained our model using bf16-mixed precision.
50
+
51
+
52
+ ## Getting started
53
+
54
+
55
+ ### System Prompt
56
+
57
+ It is important to note that this model is optimized to function without a predefined system prompt.
58
+ While Allam does not come with a default system prompt, it does provide the flexibility to add a custom system prompt.
59
+ For instance, a well crafted system prompt could be:
60
+
61
+ “You are ALLaM, a bilingual English and Arabic AI assistant.”
62
+ System prompts can also be in Arabic:
63
+
64
+ "أنت علام، مساعد ذكاء اصطناعي مطور من الهيئة السعودية للبيانات والذكاء الاصطناعي، تجيب على الأسئلة بطريقة مفيدة مع مراعاة القيم الثقافية المحلية."
65
+ Alternatively, users can get creative with their prompts, such as:
66
+
67
+ “You are an AI assistant who responds to everything like a pirate.”
68
+
69
+ The system prompt is integrated inside the tokenizer config (accessed via `apply_chat_template()` module).
70
+
71
+
72
+ ### Example Usages
73
+
74
+ The weights for ALLaM model checkpoints can be accessed via [HuggingFace transformers](https://github.com/huggingface/transformers) (tested with `transformers>=4.40.1`). The following code snippet demonstrates how to load the model and generate text using the `ALLaM-AI/ALLaM-7B-Instruct-preview` model.
75
+
76
+ ```python
77
+ from transformers import AutoModelForCausalLM, AutoTokenizer
78
+ allam_model = AutoModelForCausalLM.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
79
+ tokenizer = AutoTokenizer.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
80
+ messages=[
81
+ {"role": "user", "content": "كيف أجهز كوب شاهي؟"},
82
+ ]
83
+ inputs = tokenizer.apply_chat_template(messages, tokenize=False)
84
+ inputs = tokenizer(inputs, return_tensors='pt', return_token_type_ids=False)
85
+ inputs = {k: v.to('cuda') for k,v in inputs.items()}
86
+ allam_model = allam_model.to('cuda')
87
+ response = allam_model.generate(**inputs, max_new_tokens=4096, do_sample=True, top_k=50, top_p=0.95, temperature=.6)
88
+ print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
89
+ ```
90
+
91
+
92
+ ## Ethical Considerations and Limitations
93
+
94
+ ALLaM is a generative model that comes with inherent uncertainties. Trials cannot encompass every possible use case. Hence, predicting ALLaM's responses in every context is not possible, leading on occasion to incorrect or biased outputs. Developers must conduct thorough safety evaluations and make specific adjustments to ensure the model is suitable for the intended purposes.
95
+
96
+ *The output generated by this model is not considered a statement of NCAI, SDAIA, or any other organization.*
97
+
98
+ ## Evaluation
99
+
100
+ ### Automatic Benchmarks
101
+
102
+ #### Arabic Benchmarks
103
+ **Massive Multitask Language Understanding** (MMLU) is a collection of many multiple-choice evaluation questions sourced from various academic levels (elementary to college level). These questions are typically related to humanities, STEM, or social sciences. It was originally an English dataset, but other variants were developed for Arabic:
104
+
105
+ <!-- - [Original English MMLU (MMLU-en)](https://github.com/hendrycks/test): A collection of 14,079 original English questions spanning 57 domains. -->
106
+ - [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
107
+ - [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
108
+
109
+ **Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
110
+
111
+ **Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
112
+
113
+ **Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,887 test samples.
114
+
115
+ **IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 9990 multiple-choice questions and 5823 true/false questions.
116
+
117
+ **GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
118
+
119
+ **AraPro**: A curated collection of 5,001 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
120
+
121
+ **AraMath**: AraMath consists of 605 MCQs derived from [ArMath](https://github.com/reem-codes/ArMATH), which includes mathematical word problems, that was transformed to MCQs internally.
122
+
123
+ **Ar-IFEval**: an Arabic instruction-following (IF) evaluation dataset designed to automatically assess language models' compliance with specified instructions through verifiable methods. The dataset consists of 535 instances, each containing two to four verifiable instructions that can be validated using deterministic programming approaches.
124
+
125
+ All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
126
+
127
+ The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
128
+
129
+
130
+ | Model |AVG | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | Ar-IFEval <br>(prompt strict) <br>0 shot | Ar-IFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabic MMLU <br>0 Shot | Openai MMLU <br>0 shot | GAT <br>0 shot |
131
+ |:----------------------------|:----------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
132
+ | ALLaM-7B-Instruct-preview | 64.42 | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
133
+ | AceGPT-v2-8B-Chat | 52.67 | 56.81 | 77.01 | 75.91 | 63.51 | 41.49 | 10.26 | 39.25 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
134
+ | AceGPT-v2-32B-Chat | 62.23 | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 55.31 | 71.57 | 68.3 | 60.8 | 43.21 |
135
+ | jais-family-6p7b-chat | 46.31 | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
136
+ | jais-family-13b-chat | 49.14 | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
137
+ | jais-family-30b-16k-chat | 52.54 | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
138
+ | jais-family-30b-8k-chat | 53.19 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
139
+ | jais-adapted-7b-chat | 45.19 | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
140
+ | jais-adapted-13b-chat | 51.86 | 48.12 | 69.65 | 71.85 | 59.07 | 37.02 | 23.32 | 60.61 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
141
+ | jais-adapted-70b-chat | 58.32 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
142
+ | Qwen2.5-7B-Instruct | 60.55 | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
143
+ | Qwen2.5-14B-Instruct | 71.26 | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
144
+ | Qwen2.5-72B-Instruct | **76.91** | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
145
+ | Mistral-7B-Instruct-v0.3 | 43.05 | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
146
+ | Mistral-Nemo-Instruct-2407 | 53.79 | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
147
+ | Mistral-Small-Instruct-2409 | 51.11 | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
148
+ | Falcon3-7B-Instruct | 41.3 | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 31.84 | 58.98 | 42.08 | 32.36 | 27.99 |
149
+ | Meta-Llama-3.1-8B-Instruct | 54.08 | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 52.51 | 69.93 | 56.43 | 44.67 | 30.9 |
150
+ | Llama-3.3-70B-Instruct | 71.43 | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
151
+
152
+ Closed models evaluations:
153
+
154
+ | Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
155
+ |:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
156
+ | Azureml GPT4o (gpt-4o-900ptu) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
157
+ | Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
158
+ | gemini pro 1.5 (gemini-1.5-pro) | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
159
+
160
+ #### English Benchmarks
161
+
162
+ | model |Avg | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
163
+ |:----------------------------------|:----------|:-----------------|:-----------------------|:--------------------------|:--------------------------|:--------------------|:-------------------|:------------------|:------------------|:----------------------|:--------------|:------------------------|:---------------------------------|:-------------------------------|:---------------|
164
+ | ALLaM-7B-Instruct-preview | 46.85 | 41.99 | 51.28 | 22.77 | 73.17 | 70.48 | 76.26 | 16.07 | 30.4 | 17.3 | 59.6 | 46.67 | 38.08 | 50.0 | 61.79 |
165
+ | AceGPT-v2-8B-Chat | 49.51 | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
166
+ | AceGPT-v2-32B-Chat | 57.14 | 56.01 | 53.92 | 32.8125 | 66.23 | 79.16 | 83.29 | 69.45 | 45.89 | 32.8 | 74.03 | 59.18 | 27.54 | 40.89 | 78.7 |
167
+ | jais-family-6p7b-chat | 38.33 | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
168
+ | jais-family-13b-chat | 42.62 | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75.0 | 35.82 | 24.4 | 19.1 | 51.91 | 40.57 | 19.41 | 30.82 | 64.59 |
169
+ | jais-family-30b-16k-chat | 45.15 | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 29.14 | 67.93 |
170
+ | jais-family-30b-8k-chat | 47.59 | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.44 | 57.46 | 49.49 | 22.92 | 37.05 | 72.48 |
171
+ | jais-adapted-7b-chat | 44.91 | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22.0 | 35.73 | 58.07 |
172
+ | jais-adapted-13b-chat | 47.7 | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
173
+ | jais-adapted-70b-chat | 53.49 | 39.96 | 59.56 | 20.98 | 70.8 | 77.27 | 84.06 | 68.64 | 37.25 | 27.72 | 65.23 | 44.49 | 31.61 | 44.0 | 77.26 |
174
+ | Qwen2.5-7B-Instruct | 54.68 | 59.2 | 51.28 | 26.56 | 73.76 | 69.38 | 79.55 | 50.59 | 44.92 | 12.04 | 70.56 | 58.93 | 57.3 | 68.23 | 43.29 |
175
+ | Qwen2.5-14B-Instruct | 62.37 | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 52.44 | 23.04 | 78.93 | 69.01 | 52.13 | 64.03 | 83.47 |
176
+ | Qwen2.5-72B-Instruct | **70.06** | **71.09** | **63.48** | 25.67 | 78.33 | 76.24 | **87.41** | 70.9 | **62.77** | **54.04** | **83.44** | **69.54** | 67.65 | 77.1 | **93.25** |
177
+ | Mistral-7B-Instruct-v0.3 | 51.98 | 36.45 | 58.87 | 23.21 | 72.58 | 73.95 | 82.93 | 67.97 | 33.18 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
178
+ | Mistral-Nemo-Instruct-2407 | 54.0 | 39.65 | 59.04 | 24.33 | 67.86 | 74.66 | 82.35 | 72.77 | 44.27 | 29.62 | 65.56 | 54.88 | 30.13 | 38.97 | 71.95 |
179
+ | Mistral-Small-Instruct-2409 | 61.65 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
180
+ | Falcon3-7B-Instruct | 58.04 | 43.84 | 59.47 | **33.71** | 70.39 | 70.09 | 78.43 | 51.98 | 46.73 | 30.76 | 68.14 | 55.53 | 56.01 | 68.59 | 78.92 |
181
+ | Meta-Llama-3.1-8B-Instruct | 56.5 | 42.39 | 55.12 | 27.23 | 66.69 | 73.95 | 79.28 | 70.05 | 40.641622 | 34.26 | 67.96 | 54.05 | 44.36 | 58.51 | 76.5 |
182
+ | Llama-3.3-70B-Instruct | 67.7 | 55.44 | 63.4 | 25.89 | **81.05** | **79.24** | 84.39 | **81.7** | 60.51 | 46.42 | 81.99 | 60.91 | **63.22** | **72.78** | 90.83 |
183
+
184
+ ### MT-bench
185
+
186
+ **Multi-Turn Bench** (MT-Bench): A challenging multi-turn benchmark that uses GPT-4o as a judge. MT-bench comprises 80 questions from 8 domains. Each question is presented to the model and the responses are submitted to GPT-4o to assign scores to each response. The judge returns a score for the first and second turn separately.
187
+ This dataset was also automatically translated to Arabic and manually verified and culturally aligned.
188
+
189
+ | Model | AR Average | AR Turn 1 | AR Turn 2 | EN Average | EN Turn 1 | EN Turn 2 |
190
+ |---------------------|------------|-----------|-----------|------------|-----------|-----------|
191
+ | ALLaM-7B-Instruct-preview | 5.9 | **6.93**| 4.88 | 6.5 | 7.49 | 5.15 |
192
+ | AceGPT-v1.5-13B-Chat | 4.61 | 5.28 | 3.93 | 4.86 | 5.56 | 4.17 |
193
+ | AceGPT-v2-32B-Chat |5.43 | 6.61 | 4.26 | **6.5** | **7.41** | **5.58** |
194
+ | jais-family-13b-chat | 4.89 | 5.37 | 4.41 | 4.77 | 5.57 | 3.97
195
+ | jais-family-30b-16k-chat | 4.87 | 5.50 | 4.25 | 5.13 | 5.86 | 4.4 |
196
+ | jais-adapted-70b-chat | 5.86 | 6.33 | **5.38** | 5.88 | 6.41 | 5.36 |
197
+
198
+ ## Citation
199
+
200
+ If you found this work helpful or used any part of this work, please include the following citation:
201
+
202
+ ```
203
+ @inproceedings{
204
+ bari2025allam,
205
+ title={{ALL}aM: Large Language Models for Arabic and English},
206
+ author={M Saiful Bari and Yazeed Alnumay and Norah A. Alzahrani and Nouf M. Alotaibi and Hisham Abdullah Alyahya and Sultan AlRashed and Faisal Abdulrahman Mirza and Shaykhah Z. Alsubaie and Hassan A. Alahmed and Ghadah Alabduljabbar and Raghad Alkhathran and Yousef Almushayqih and Raneem Alnajim and Salman Alsubaihi and Maryam Al Mansour and Saad Amin Hassan and Dr. Majed Alrubaian and Ali Alammari and Zaki Alawami and Abdulmohsen Al-Thubaity and Ahmed Abdelali and Jeril Kuriakose and Abdalghani Abujabal and Nora Al-Twairesh and Areeb Alowisheq and Haidar Khan},
207
+ booktitle={The Thirteenth International Conference on Learning Representations},
208
+ year={2025},
209
+ url={https://openreview.net/forum?id=MscdsFVZrN}
210
+ }
211
+ ```
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": false,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 1,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "silu",
10
+ "hidden_size": 4096,
11
+ "initializer_range": 0.006,
12
+ "intermediate_size": 11008,
13
+ "max_position_embeddings": 4096,
14
+ "model_type": "llama",
15
+ "num_attention_heads": 32,
16
+ "num_hidden_layers": 32,
17
+ "num_key_value_heads": 32,
18
+ "pretraining_tp": 1,
19
+ "rms_norm_eps": 1e-05,
20
+ "rope_scaling": null,
21
+ "rope_theta": 10000.0,
22
+ "tie_word_embeddings": false,
23
+ "torch_dtype": "bfloat16",
24
+ "transformers_version": "4.39.3",
25
+ "use_cache": true,
26
+ "vocab_size": 64000,
27
+ "internal_version": "7b-alpha-v1.27.2.25"
28
+ }
evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "acva": {
4
+ "alias": "acva",
5
+ "acc,none": 0.7274397244546499,
6
+ "acc_stderr,none": 0.004771397968508457,
7
+ "acc_norm,none": 0.7157290470723306,
8
+ "acc_norm_stderr,none": 0.004833440968499389
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "acva": []
13
+ },
14
+ "configs": {
15
+ "acva": {
16
+ "task": "acva",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
21
+ "dataset_kwargs": {
22
+ "trust_remote_code": true
23
+ },
24
+ "validation_split": "validation",
25
+ "test_split": "test",
26
+ "fewshot_split": "validation",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": false,
50
+ "metadata": {
51
+ "version": 1.0
52
+ }
53
+ }
54
+ },
55
+ "versions": {
56
+ "acva": 1.0
57
+ },
58
+ "n-shot": {
59
+ "acva": 5
60
+ },
61
+ "higher_is_better": {
62
+ "acva": {
63
+ "acc": true,
64
+ "acc_norm": true
65
+ }
66
+ },
67
+ "n-samples": {
68
+ "acva": {
69
+ "original": 8710,
70
+ "effective": 8710
71
+ }
72
+ },
73
+ "config": {
74
+ "model": "hf",
75
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
76
+ "model_num_parameters": 32512545792,
77
+ "model_dtype": "torch.float16",
78
+ "model_revision": "main",
79
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
80
+ "batch_size": "auto",
81
+ "batch_sizes": [
82
+ 64
83
+ ],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "788a3672",
95
+ "date": 1737779797.3395095,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.1",
98
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
99
+ "tokenizer_pad_token": [
100
+ "<|endoftext|>",
101
+ "151643"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "<|endoftext|>",
105
+ "151643"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ null,
109
+ "None"
110
+ ],
111
+ "eot_token_id": 151643,
112
+ "max_length": 32768,
113
+ "task_hashes": {},
114
+ "model_source": "hf",
115
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
116
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
117
+ "system_instruction": null,
118
+ "system_instruction_sha": null,
119
+ "fewshot_as_multiturn": false,
120
+ "chat_template": null,
121
+ "chat_template_sha": null,
122
+ "start_time": 26647.534977248,
123
+ "end_time": 27360.084961217,
124
+ "total_evaluation_time_seconds": "712.5499839689983"
125
+ }
evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "ar_ifeval": {
4
+ "alias": "ar_ifeval",
5
+ "prompt_level_strict_acc,none": 0.2574626865671642,
6
+ "prompt_level_strict_acc_stderr,none": 0.018903377119672635,
7
+ "inst_level_strict_acc,none": 0.6341296928327645,
8
+ "inst_level_strict_acc_stderr,none": "N/A",
9
+ "prompt_level_loose_acc,none": 0.31529850746268656,
10
+ "prompt_level_loose_acc_stderr,none": 0.020087907677710036,
11
+ "inst_level_loose_acc,none": 0.6764505119453925,
12
+ "inst_level_loose_acc_stderr,none": "N/A"
13
+ }
14
+ },
15
+ "group_subtasks": {
16
+ "ar_ifeval": []
17
+ },
18
+ "configs": {
19
+ "ar_ifeval": {
20
+ "task": "ar_ifeval",
21
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
22
+ "dataset_name": "ar_ifeval",
23
+ "dataset_kwargs": {
24
+ "trust_remote_code": true
25
+ },
26
+ "test_split": "test",
27
+ "doc_to_text": "prompt",
28
+ "doc_to_target": 0,
29
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
30
+ "description": "",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 0,
34
+ "metric_list": [
35
+ {
36
+ "metric": "prompt_level_strict_acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "inst_level_strict_acc",
42
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "prompt_level_loose_acc",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ },
50
+ {
51
+ "metric": "inst_level_loose_acc",
52
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
53
+ "higher_is_better": true
54
+ }
55
+ ],
56
+ "output_type": "generate_until",
57
+ "generation_kwargs": {
58
+ "until": [],
59
+ "do_sample": false,
60
+ "temperature": 0.0,
61
+ "max_gen_toks": 1280
62
+ },
63
+ "repeats": 1,
64
+ "should_decontaminate": false,
65
+ "metadata": {
66
+ "version": 4.0
67
+ }
68
+ }
69
+ },
70
+ "versions": {
71
+ "ar_ifeval": 4.0
72
+ },
73
+ "n-shot": {
74
+ "ar_ifeval": 0
75
+ },
76
+ "higher_is_better": {
77
+ "ar_ifeval": {
78
+ "prompt_level_strict_acc": true,
79
+ "inst_level_strict_acc": true,
80
+ "prompt_level_loose_acc": true,
81
+ "inst_level_loose_acc": true
82
+ }
83
+ },
84
+ "n-samples": {
85
+ "ar_ifeval": {
86
+ "original": 536,
87
+ "effective": 536
88
+ }
89
+ },
90
+ "config": {
91
+ "model": "hf",
92
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
93
+ "model_num_parameters": 32512545792,
94
+ "model_dtype": "torch.float16",
95
+ "model_revision": "main",
96
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
97
+ "batch_size": 1,
98
+ "batch_sizes": [],
99
+ "device": null,
100
+ "use_cache": null,
101
+ "limit": null,
102
+ "bootstrap_iters": 100000,
103
+ "gen_kwargs": null,
104
+ "random_seed": 0,
105
+ "numpy_seed": 1234,
106
+ "torch_seed": 1234,
107
+ "fewshot_seed": 1234
108
+ },
109
+ "git_hash": "788a3672",
110
+ "date": 1738794647.2071357,
111
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
112
+ "transformers_version": "4.48.2",
113
+ "upper_git_hash": null,
114
+ "tokenizer_pad_token": [
115
+ "<|endoftext|>",
116
+ "151643"
117
+ ],
118
+ "tokenizer_eos_token": [
119
+ "<|endoftext|>",
120
+ "151643"
121
+ ],
122
+ "tokenizer_bos_token": [
123
+ null,
124
+ "None"
125
+ ],
126
+ "eot_token_id": 151643,
127
+ "max_length": 32768,
128
+ "task_hashes": {
129
+ "ar_ifeval": "d0b91e989c8b697090db63bf498d8e2d8dd80815a595e5f22845a8425bff22fa"
130
+ },
131
+ "model_source": "hf",
132
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
133
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
134
+ "system_instruction": null,
135
+ "system_instruction_sha": null,
136
+ "fewshot_as_multiturn": false,
137
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
138
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
139
+ "start_time": 1753623.131321269,
140
+ "end_time": 1761093.682009075,
141
+ "total_evaluation_time_seconds": "7470.550687805982"
142
+ }
evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araMath_v3": {
4
+ "alias": "araMath_v3",
5
+ "acc,none": 0.6446280991735537,
6
+ "acc_stderr,none": 0.019475010007284948,
7
+ "acc_norm,none": 0.6446280991735537,
8
+ "acc_norm_stderr,none": 0.019475010007284948
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araMath_v3": []
13
+ },
14
+ "configs": {
15
+ "araMath_v3": {
16
+ "task": "araMath_v3",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
21
+ "dataset_name": "araMath_v3",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "{{choices}}",
31
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "araMath_v3": 0.0
58
+ },
59
+ "n-shot": {
60
+ "araMath_v3": 5
61
+ },
62
+ "higher_is_better": {
63
+ "araMath_v3": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "araMath_v3": {
70
+ "original": 605,
71
+ "effective": 605
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
77
+ "model_num_parameters": 32512545792,
78
+ "model_dtype": "torch.float16",
79
+ "model_revision": "main",
80
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "788a3672",
94
+ "date": 1738805225.8162587,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.2",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|endoftext|>",
100
+ "151643"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|endoftext|>",
104
+ "151643"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ null,
108
+ "None"
109
+ ],
110
+ "eot_token_id": 151643,
111
+ "max_length": 32768,
112
+ "task_hashes": {
113
+ "araMath_v3": "17b2596f46d709ea107ed20bef044ca126de23a8e9bbc8ba0a9beef94fbc032d"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
117
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
122
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
123
+ "start_time": 1764201.606664753,
124
+ "end_time": 1764270.091855178,
125
+ "total_evaluation_time_seconds": "68.48519042483531"
126
+ }
evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araPro": {
4
+ "alias": "araPro",
5
+ "acc,none": 0.671865626874625,
6
+ "acc_stderr,none": 0.006640213946839424,
7
+ "acc_norm,none": 0.671865626874625,
8
+ "acc_norm_stderr,none": 0.006640213946839424
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araPro": []
13
+ },
14
+ "configs": {
15
+ "araPro": {
16
+ "task": "araPro",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
21
+ "dataset_name": "araPro",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "{{choices}}",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": true,
54
+ "doc_to_decontamination_query": "Question",
55
+ "metadata": {
56
+ "version": 2.0
57
+ }
58
+ }
59
+ },
60
+ "versions": {
61
+ "araPro": 2.0
62
+ },
63
+ "n-shot": {
64
+ "araPro": 0
65
+ },
66
+ "higher_is_better": {
67
+ "araPro": {
68
+ "acc": true,
69
+ "acc_norm": true
70
+ }
71
+ },
72
+ "n-samples": {
73
+ "araPro": {
74
+ "original": 5001,
75
+ "effective": 5001
76
+ }
77
+ },
78
+ "config": {
79
+ "model": "hf",
80
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
81
+ "model_num_parameters": 32512545792,
82
+ "model_dtype": "torch.float16",
83
+ "model_revision": "main",
84
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
85
+ "batch_size": 1,
86
+ "batch_sizes": [],
87
+ "device": null,
88
+ "use_cache": null,
89
+ "limit": null,
90
+ "bootstrap_iters": 100000,
91
+ "gen_kwargs": null,
92
+ "random_seed": 0,
93
+ "numpy_seed": 1234,
94
+ "torch_seed": 1234,
95
+ "fewshot_seed": 1234
96
+ },
97
+ "git_hash": "788a3672",
98
+ "date": 1738802810.5474553,
99
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
100
+ "transformers_version": "4.48.2",
101
+ "upper_git_hash": null,
102
+ "tokenizer_pad_token": [
103
+ "<|endoftext|>",
104
+ "151643"
105
+ ],
106
+ "tokenizer_eos_token": [
107
+ "<|endoftext|>",
108
+ "151643"
109
+ ],
110
+ "tokenizer_bos_token": [
111
+ null,
112
+ "None"
113
+ ],
114
+ "eot_token_id": 151643,
115
+ "max_length": 32768,
116
+ "task_hashes": {
117
+ "araPro": "2f706897ad0129e016cc8d6907f8bb4359c32403fc2d1b0a4e78717f424793da"
118
+ },
119
+ "model_source": "hf",
120
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
121
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
122
+ "system_instruction": null,
123
+ "system_instruction_sha": null,
124
+ "fewshot_as_multiturn": false,
125
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
126
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
127
+ "start_time": 1761786.552693387,
128
+ "end_time": 1761894.218775138,
129
+ "total_evaluation_time_seconds": "107.66608175099827"
130
+ }
evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "etec_v2": {
4
+ "alias": "etec_v2",
5
+ "acc,none": 0.6481187069422364,
6
+ "acc_stderr,none": 0.010996501146375258,
7
+ "acc_norm,none": 0.6481187069422364,
8
+ "acc_norm_stderr,none": 0.010996501146375258
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "etec_v2": []
13
+ },
14
+ "configs": {
15
+ "etec_v2": {
16
+ "task": "etec_v2",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
21
+ "dataset_name": "etec_v2",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 0,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "etec_v2": 0.0
58
+ },
59
+ "n-shot": {
60
+ "etec_v2": 0
61
+ },
62
+ "higher_is_better": {
63
+ "etec_v2": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "etec_v2": {
70
+ "original": 1887,
71
+ "effective": 1887
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
77
+ "model_num_parameters": 32512545792,
78
+ "model_dtype": "torch.float16",
79
+ "model_revision": "main",
80
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "788a3672",
94
+ "date": 1738805984.3189015,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.2",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|endoftext|>",
100
+ "151643"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|endoftext|>",
104
+ "151643"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ null,
108
+ "None"
109
+ ],
110
+ "eot_token_id": 151643,
111
+ "max_length": 32768,
112
+ "task_hashes": {
113
+ "etec_v2": "697b8bfc7d6b0f85165e5cca6953182b09b7a2b0d79fa31e74cc3897f432de41"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
117
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
122
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
123
+ "start_time": 1764960.166542801,
124
+ "end_time": 1765035.801506021,
125
+ "total_evaluation_time_seconds": "75.63496321998537"
126
+ }
evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "exams_ar": {
4
+ "alias": "exams_ar",
5
+ "acc,none": 0.553072625698324,
6
+ "acc_stderr,none": 0.021474702941383872,
7
+ "acc_norm,none": 0.553072625698324,
8
+ "acc_norm_stderr,none": 0.021474702941383872
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "exams_ar": []
13
+ },
14
+ "configs": {
15
+ "exams_ar": {
16
+ "task": "exams_ar",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/exams_ar",
21
+ "dataset_name": "exams_ar",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "choices",
32
+ "description": "description",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "num_fewshot": 5,
36
+ "metric_list": [
37
+ {
38
+ "metric": "acc",
39
+ "aggregation": "mean",
40
+ "higher_is_better": true
41
+ },
42
+ {
43
+ "metric": "acc_norm",
44
+ "aggregation": "mean",
45
+ "higher_is_better": true
46
+ }
47
+ ],
48
+ "output_type": "multiple_choice",
49
+ "repeats": 1,
50
+ "should_decontaminate": true,
51
+ "doc_to_decontamination_query": "query",
52
+ "metadata": {
53
+ "version": 1.0
54
+ }
55
+ }
56
+ },
57
+ "versions": {
58
+ "exams_ar": 1.0
59
+ },
60
+ "n-shot": {
61
+ "exams_ar": 5
62
+ },
63
+ "higher_is_better": {
64
+ "exams_ar": {
65
+ "acc": true,
66
+ "acc_norm": true
67
+ }
68
+ },
69
+ "n-samples": {
70
+ "exams_ar": {
71
+ "original": 537,
72
+ "effective": 537
73
+ }
74
+ },
75
+ "config": {
76
+ "model": "hf",
77
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
78
+ "model_num_parameters": 32512545792,
79
+ "model_dtype": "torch.float16",
80
+ "model_revision": "main",
81
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
82
+ "batch_size": "auto",
83
+ "batch_sizes": [
84
+ 16
85
+ ],
86
+ "device": null,
87
+ "use_cache": null,
88
+ "limit": null,
89
+ "bootstrap_iters": 100000,
90
+ "gen_kwargs": null,
91
+ "random_seed": 0,
92
+ "numpy_seed": 1234,
93
+ "torch_seed": 1234,
94
+ "fewshot_seed": 1234
95
+ },
96
+ "git_hash": "788a3672",
97
+ "date": 1737780545.20475,
98
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
99
+ "transformers_version": "4.48.1",
100
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
101
+ "tokenizer_pad_token": [
102
+ "<|endoftext|>",
103
+ "151643"
104
+ ],
105
+ "tokenizer_eos_token": [
106
+ "<|endoftext|>",
107
+ "151643"
108
+ ],
109
+ "tokenizer_bos_token": [
110
+ null,
111
+ "None"
112
+ ],
113
+ "eot_token_id": 151643,
114
+ "max_length": 32768,
115
+ "task_hashes": {},
116
+ "model_source": "hf",
117
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
118
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
119
+ "system_instruction": null,
120
+ "system_instruction_sha": null,
121
+ "fewshot_as_multiturn": false,
122
+ "chat_template": null,
123
+ "chat_template_sha": null,
124
+ "start_time": 27395.295045238,
125
+ "end_time": 27506.949709817,
126
+ "total_evaluation_time_seconds": "111.65466457900038"
127
+ }
evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json ADDED
@@ -0,0 +1,543 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "gat": {
4
+ "acc,none": 0.4321459927254484,
5
+ "acc_stderr,none": 0.0038347299693873033,
6
+ "alias": "gat"
7
+ },
8
+ "gat_algebra": {
9
+ "alias": " - gat_algebra",
10
+ "acc,none": 0.3992578849721707,
11
+ "acc_stderr,none": 0.009435653731651068
12
+ },
13
+ "gat_analogy": {
14
+ "alias": " - gat_analogy",
15
+ "acc,none": 0.2867030965391621,
16
+ "acc_stderr,none": 0.00863295163043938
17
+ },
18
+ "gat_arithmetic": {
19
+ "alias": " - gat_arithmetic",
20
+ "acc,none": 0.3894000736105999,
21
+ "acc_stderr,none": 0.009356458715331561
22
+ },
23
+ "gat_association": {
24
+ "alias": " - gat_association",
25
+ "acc,none": 0.4143540669856459,
26
+ "acc_stderr,none": 0.01524590184737997
27
+ },
28
+ "gat_comparisons": {
29
+ "alias": " - gat_comparisons",
30
+ "acc,none": 0.34672131147540985,
31
+ "acc_stderr,none": 0.013631312083187472
32
+ },
33
+ "gat_completion": {
34
+ "alias": " - gat_completion",
35
+ "acc,none": 0.5793388429752067,
36
+ "acc_stderr,none": 0.014197745251253151
37
+ },
38
+ "gat_contextual": {
39
+ "alias": " - gat_contextual",
40
+ "acc,none": 0.522239263803681,
41
+ "acc_stderr,none": 0.013837823280527494
42
+ },
43
+ "gat_geometry": {
44
+ "alias": " - gat_geometry",
45
+ "acc,none": 0.5013698630136987,
46
+ "acc_stderr,none": 0.026207022561245137
47
+ },
48
+ "gat_reading": {
49
+ "alias": " - gat_reading",
50
+ "acc,none": 0.585633270321361,
51
+ "acc_stderr,none": 0.009580200187530542
52
+ }
53
+ },
54
+ "groups": {
55
+ "gat": {
56
+ "acc,none": 0.4321459927254484,
57
+ "acc_stderr,none": 0.0038347299693873033,
58
+ "alias": "gat"
59
+ }
60
+ },
61
+ "group_subtasks": {
62
+ "gat": [
63
+ "gat_analogy",
64
+ "gat_association",
65
+ "gat_completion",
66
+ "gat_reading",
67
+ "gat_algebra",
68
+ "gat_arithmetic",
69
+ "gat_comparisons",
70
+ "gat_contextual",
71
+ "gat_geometry"
72
+ ]
73
+ },
74
+ "configs": {
75
+ "gat_algebra": {
76
+ "task": "gat_algebra",
77
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
78
+ "dataset_name": "algebra",
79
+ "dataset_kwargs": {
80
+ "trust_remote_code": true
81
+ },
82
+ "test_split": "test",
83
+ "fewshot_split": "validation",
84
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
85
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
86
+ "doc_to_target": "{{label}}",
87
+ "doc_to_choice": [
88
+ "\u0623",
89
+ "\u0628",
90
+ "\u062c",
91
+ "\u062f"
92
+ ],
93
+ "description": "",
94
+ "target_delimiter": " ",
95
+ "fewshot_delimiter": "\n\n",
96
+ "num_fewshot": 0,
97
+ "metric_list": [
98
+ {
99
+ "metric": "acc",
100
+ "aggregation": "mean",
101
+ "higher_is_better": true
102
+ }
103
+ ],
104
+ "output_type": "multiple_choice",
105
+ "repeats": 1,
106
+ "should_decontaminate": false,
107
+ "metadata": {
108
+ "version": 0.0
109
+ }
110
+ },
111
+ "gat_analogy": {
112
+ "task": "gat_analogy",
113
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
114
+ "dataset_name": "analogy",
115
+ "dataset_kwargs": {
116
+ "trust_remote_code": true
117
+ },
118
+ "test_split": "test",
119
+ "fewshot_split": "validation",
120
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
121
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
122
+ "doc_to_target": "{{label}}",
123
+ "doc_to_choice": [
124
+ "\u0623",
125
+ "\u0628",
126
+ "\u062c",
127
+ "\u062f"
128
+ ],
129
+ "description": "",
130
+ "target_delimiter": " ",
131
+ "fewshot_delimiter": "\n\n",
132
+ "num_fewshot": 0,
133
+ "metric_list": [
134
+ {
135
+ "metric": "acc",
136
+ "aggregation": "mean",
137
+ "higher_is_better": true
138
+ }
139
+ ],
140
+ "output_type": "multiple_choice",
141
+ "repeats": 1,
142
+ "should_decontaminate": false,
143
+ "metadata": {
144
+ "version": 0.0
145
+ }
146
+ },
147
+ "gat_arithmetic": {
148
+ "task": "gat_arithmetic",
149
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
150
+ "dataset_name": "arithmetic",
151
+ "dataset_kwargs": {
152
+ "trust_remote_code": true
153
+ },
154
+ "test_split": "test",
155
+ "fewshot_split": "validation",
156
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
157
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
158
+ "doc_to_target": "{{label}}",
159
+ "doc_to_choice": [
160
+ "\u0623",
161
+ "\u0628",
162
+ "\u062c",
163
+ "\u062f"
164
+ ],
165
+ "description": "",
166
+ "target_delimiter": " ",
167
+ "fewshot_delimiter": "\n\n",
168
+ "num_fewshot": 0,
169
+ "metric_list": [
170
+ {
171
+ "metric": "acc",
172
+ "aggregation": "mean",
173
+ "higher_is_better": true
174
+ }
175
+ ],
176
+ "output_type": "multiple_choice",
177
+ "repeats": 1,
178
+ "should_decontaminate": false,
179
+ "metadata": {
180
+ "version": 0.0
181
+ }
182
+ },
183
+ "gat_association": {
184
+ "task": "gat_association",
185
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
186
+ "dataset_name": "association",
187
+ "dataset_kwargs": {
188
+ "trust_remote_code": true
189
+ },
190
+ "test_split": "test",
191
+ "fewshot_split": "validation",
192
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
193
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
194
+ "doc_to_target": "{{label}}",
195
+ "doc_to_choice": [
196
+ "\u0623",
197
+ "\u0628",
198
+ "\u062c",
199
+ "\u062f"
200
+ ],
201
+ "description": "",
202
+ "target_delimiter": " ",
203
+ "fewshot_delimiter": "\n\n",
204
+ "num_fewshot": 0,
205
+ "metric_list": [
206
+ {
207
+ "metric": "acc",
208
+ "aggregation": "mean",
209
+ "higher_is_better": true
210
+ }
211
+ ],
212
+ "output_type": "multiple_choice",
213
+ "repeats": 1,
214
+ "should_decontaminate": false,
215
+ "metadata": {
216
+ "version": 0.0
217
+ }
218
+ },
219
+ "gat_comparisons": {
220
+ "task": "gat_comparisons",
221
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
222
+ "dataset_name": "comparisons",
223
+ "dataset_kwargs": {
224
+ "trust_remote_code": true
225
+ },
226
+ "test_split": "test",
227
+ "fewshot_split": "validation",
228
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
229
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
230
+ "doc_to_target": "{{label}}",
231
+ "doc_to_choice": [
232
+ "\u0623",
233
+ "\u0628",
234
+ "\u062c",
235
+ "\u062f"
236
+ ],
237
+ "description": "",
238
+ "target_delimiter": " ",
239
+ "fewshot_delimiter": "\n\n",
240
+ "num_fewshot": 0,
241
+ "metric_list": [
242
+ {
243
+ "metric": "acc",
244
+ "aggregation": "mean",
245
+ "higher_is_better": true
246
+ }
247
+ ],
248
+ "output_type": "multiple_choice",
249
+ "repeats": 1,
250
+ "should_decontaminate": false,
251
+ "metadata": {
252
+ "version": 0.0
253
+ }
254
+ },
255
+ "gat_completion": {
256
+ "task": "gat_completion",
257
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
258
+ "dataset_name": "completion",
259
+ "dataset_kwargs": {
260
+ "trust_remote_code": true
261
+ },
262
+ "test_split": "test",
263
+ "fewshot_split": "validation",
264
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
265
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
266
+ "doc_to_target": "{{label}}",
267
+ "doc_to_choice": [
268
+ "\u0623",
269
+ "\u0628",
270
+ "\u062c",
271
+ "\u062f"
272
+ ],
273
+ "description": "",
274
+ "target_delimiter": " ",
275
+ "fewshot_delimiter": "\n\n",
276
+ "num_fewshot": 0,
277
+ "metric_list": [
278
+ {
279
+ "metric": "acc",
280
+ "aggregation": "mean",
281
+ "higher_is_better": true
282
+ }
283
+ ],
284
+ "output_type": "multiple_choice",
285
+ "repeats": 1,
286
+ "should_decontaminate": false,
287
+ "metadata": {
288
+ "version": 0.0
289
+ }
290
+ },
291
+ "gat_contextual": {
292
+ "task": "gat_contextual",
293
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
294
+ "dataset_name": "contextual",
295
+ "dataset_kwargs": {
296
+ "trust_remote_code": true
297
+ },
298
+ "test_split": "test",
299
+ "fewshot_split": "validation",
300
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
301
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
302
+ "doc_to_target": "{{label}}",
303
+ "doc_to_choice": [
304
+ "\u0623",
305
+ "\u0628",
306
+ "\u062c",
307
+ "\u062f"
308
+ ],
309
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
310
+ "target_delimiter": " ",
311
+ "fewshot_delimiter": "\n\n",
312
+ "num_fewshot": 0,
313
+ "metric_list": [
314
+ {
315
+ "metric": "acc",
316
+ "aggregation": "mean",
317
+ "higher_is_better": true
318
+ }
319
+ ],
320
+ "output_type": "multiple_choice",
321
+ "repeats": 1,
322
+ "should_decontaminate": false,
323
+ "metadata": {
324
+ "version": 0.0
325
+ }
326
+ },
327
+ "gat_geometry": {
328
+ "task": "gat_geometry",
329
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
330
+ "dataset_name": "geometry",
331
+ "dataset_kwargs": {
332
+ "trust_remote_code": true
333
+ },
334
+ "test_split": "test",
335
+ "fewshot_split": "validation",
336
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
337
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
338
+ "doc_to_target": "{{label}}",
339
+ "doc_to_choice": [
340
+ "\u0623",
341
+ "\u0628",
342
+ "\u062c",
343
+ "\u062f"
344
+ ],
345
+ "description": "",
346
+ "target_delimiter": " ",
347
+ "fewshot_delimiter": "\n\n",
348
+ "num_fewshot": 0,
349
+ "metric_list": [
350
+ {
351
+ "metric": "acc",
352
+ "aggregation": "mean",
353
+ "higher_is_better": true
354
+ }
355
+ ],
356
+ "output_type": "multiple_choice",
357
+ "repeats": 1,
358
+ "should_decontaminate": false,
359
+ "metadata": {
360
+ "version": 0.0
361
+ }
362
+ },
363
+ "gat_reading": {
364
+ "task": "gat_reading",
365
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
366
+ "dataset_name": "reading",
367
+ "dataset_kwargs": {
368
+ "trust_remote_code": true
369
+ },
370
+ "test_split": "test",
371
+ "fewshot_split": "validation",
372
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
373
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
374
+ "doc_to_target": "{{label}}",
375
+ "doc_to_choice": [
376
+ "\u0623",
377
+ "\u0628",
378
+ "\u062c",
379
+ "\u062f"
380
+ ],
381
+ "description": "",
382
+ "target_delimiter": " ",
383
+ "fewshot_delimiter": "\n\n",
384
+ "num_fewshot": 0,
385
+ "metric_list": [
386
+ {
387
+ "metric": "acc",
388
+ "aggregation": "mean",
389
+ "higher_is_better": true
390
+ }
391
+ ],
392
+ "output_type": "multiple_choice",
393
+ "repeats": 1,
394
+ "should_decontaminate": false,
395
+ "metadata": {
396
+ "version": 0.0
397
+ }
398
+ }
399
+ },
400
+ "versions": {
401
+ "gat": 0,
402
+ "gat_algebra": 0.0,
403
+ "gat_analogy": 0.0,
404
+ "gat_arithmetic": 0.0,
405
+ "gat_association": 0.0,
406
+ "gat_comparisons": 0.0,
407
+ "gat_completion": 0.0,
408
+ "gat_contextual": 0.0,
409
+ "gat_geometry": 0.0,
410
+ "gat_reading": 0.0
411
+ },
412
+ "n-shot": {
413
+ "gat_algebra": 0,
414
+ "gat_analogy": 0,
415
+ "gat_arithmetic": 0,
416
+ "gat_association": 0,
417
+ "gat_comparisons": 0,
418
+ "gat_completion": 0,
419
+ "gat_contextual": 0,
420
+ "gat_geometry": 0,
421
+ "gat_reading": 0
422
+ },
423
+ "higher_is_better": {
424
+ "gat": {
425
+ "acc": true
426
+ },
427
+ "gat_algebra": {
428
+ "acc": true
429
+ },
430
+ "gat_analogy": {
431
+ "acc": true
432
+ },
433
+ "gat_arithmetic": {
434
+ "acc": true
435
+ },
436
+ "gat_association": {
437
+ "acc": true
438
+ },
439
+ "gat_comparisons": {
440
+ "acc": true
441
+ },
442
+ "gat_completion": {
443
+ "acc": true
444
+ },
445
+ "gat_contextual": {
446
+ "acc": true
447
+ },
448
+ "gat_geometry": {
449
+ "acc": true
450
+ },
451
+ "gat_reading": {
452
+ "acc": true
453
+ }
454
+ },
455
+ "n-samples": {
456
+ "gat_analogy": {
457
+ "original": 2745,
458
+ "effective": 2745
459
+ },
460
+ "gat_association": {
461
+ "original": 1045,
462
+ "effective": 1045
463
+ },
464
+ "gat_completion": {
465
+ "original": 1210,
466
+ "effective": 1210
467
+ },
468
+ "gat_reading": {
469
+ "original": 2645,
470
+ "effective": 2645
471
+ },
472
+ "gat_algebra": {
473
+ "original": 2695,
474
+ "effective": 2695
475
+ },
476
+ "gat_arithmetic": {
477
+ "original": 2717,
478
+ "effective": 2717
479
+ },
480
+ "gat_comparisons": {
481
+ "original": 1220,
482
+ "effective": 1220
483
+ },
484
+ "gat_contextual": {
485
+ "original": 1304,
486
+ "effective": 1304
487
+ },
488
+ "gat_geometry": {
489
+ "original": 365,
490
+ "effective": 365
491
+ }
492
+ },
493
+ "config": {
494
+ "model": "hf",
495
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
496
+ "model_num_parameters": 32512545792,
497
+ "model_dtype": "torch.float16",
498
+ "model_revision": "main",
499
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
500
+ "batch_size": 1,
501
+ "batch_sizes": [],
502
+ "device": null,
503
+ "use_cache": null,
504
+ "limit": null,
505
+ "bootstrap_iters": 100000,
506
+ "gen_kwargs": null,
507
+ "random_seed": 0,
508
+ "numpy_seed": 1234,
509
+ "torch_seed": 1234,
510
+ "fewshot_seed": 1234
511
+ },
512
+ "git_hash": "ef4b2026",
513
+ "date": 1733932681.9722512,
514
+ "pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
515
+ "transformers_version": "4.47.0",
516
+ "upper_git_hash": "27ba526c4b16ee30604687f8bfd4c19680101dd1",
517
+ "tokenizer_pad_token": [
518
+ "<|endoftext|>",
519
+ "151643"
520
+ ],
521
+ "tokenizer_eos_token": [
522
+ "<|endoftext|>",
523
+ "151643"
524
+ ],
525
+ "tokenizer_bos_token": [
526
+ null,
527
+ "None"
528
+ ],
529
+ "eot_token_id": 151643,
530
+ "max_length": 32768,
531
+ "task_hashes": {},
532
+ "model_source": "hf",
533
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
534
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
535
+ "system_instruction": null,
536
+ "system_instruction_sha": null,
537
+ "fewshot_as_multiturn": false,
538
+ "chat_template": null,
539
+ "chat_template_sha": null,
540
+ "start_time": 2367.995520754,
541
+ "end_time": 5482.980996963,
542
+ "total_evaluation_time_seconds": "3114.9854762089994"
543
+ }
evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_mcq": {
4
+ "alias": "moe_ien_mcq",
5
+ "acc,none": 0.816016016016016,
6
+ "acc_stderr,none": 0.0038768441643790346,
7
+ "acc_norm,none": 0.816016016016016,
8
+ "acc_norm_stderr,none": 0.0038768441643790346
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_mcq": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_mcq": {
16
+ "task": "moe_ien_mcq",
17
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
18
+ "dataset_name": "moe_ien_mcq",
19
+ "dataset_kwargs": {
20
+ "trust_remote_code": true
21
+ },
22
+ "validation_split": "validation",
23
+ "test_split": "test",
24
+ "fewshot_split": "validation",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "Query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "{{Choices}}",
29
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "fewshot_config": {
33
+ "sampler": "balanced_cat"
34
+ },
35
+ "num_fewshot": 0,
36
+ "metric_list": [
37
+ {
38
+ "metric": "acc",
39
+ "aggregation": "mean",
40
+ "higher_is_better": true
41
+ },
42
+ {
43
+ "metric": "acc_norm",
44
+ "aggregation": "mean",
45
+ "higher_is_better": true
46
+ }
47
+ ],
48
+ "output_type": "multiple_choice",
49
+ "repeats": 1,
50
+ "should_decontaminate": true,
51
+ "doc_to_decontamination_query": "Query",
52
+ "metadata": {
53
+ "version": 0.0
54
+ }
55
+ }
56
+ },
57
+ "versions": {
58
+ "moe_ien_mcq": 0.0
59
+ },
60
+ "n-shot": {
61
+ "moe_ien_mcq": 0
62
+ },
63
+ "higher_is_better": {
64
+ "moe_ien_mcq": {
65
+ "acc": true,
66
+ "acc_norm": true
67
+ }
68
+ },
69
+ "n-samples": {
70
+ "moe_ien_mcq": {
71
+ "original": 9990,
72
+ "effective": 9990
73
+ }
74
+ },
75
+ "config": {
76
+ "model": "hf",
77
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
78
+ "model_num_parameters": 32512545792,
79
+ "model_dtype": "torch.float16",
80
+ "model_revision": "main",
81
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
82
+ "batch_size": 1,
83
+ "batch_sizes": [],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "788a3672",
95
+ "date": 1738807582.4110897,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.2",
98
+ "upper_git_hash": null,
99
+ "tokenizer_pad_token": [
100
+ "<|endoftext|>",
101
+ "151643"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "<|endoftext|>",
105
+ "151643"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ null,
109
+ "None"
110
+ ],
111
+ "eot_token_id": 151643,
112
+ "max_length": 32768,
113
+ "task_hashes": {
114
+ "moe_ien_mcq": "e5422ff2f277b9bfffeb1b5ad185b714804b5a3d276dfff99a29eb88d9a41683"
115
+ },
116
+ "model_source": "hf",
117
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
118
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
119
+ "system_instruction": null,
120
+ "system_instruction_sha": null,
121
+ "fewshot_as_multiturn": false,
122
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
123
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
124
+ "start_time": 1766558.431540363,
125
+ "end_time": 1766704.504224634,
126
+ "total_evaluation_time_seconds": "146.07268427102827"
127
+ }
evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_tf": {
4
+ "alias": "moe_ien_tf",
5
+ "acc,none": 0.8035376953460416,
6
+ "acc_stderr,none": 0.005207228603848848,
7
+ "acc_norm,none": 0.8035376953460416,
8
+ "acc_norm_stderr,none": 0.005207228603848848
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_tf": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_tf": {
16
+ "task": "moe_ien_tf",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
21
+ "dataset_name": "moe_ien_tf",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "choices",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": false,
54
+ "metadata": {
55
+ "version": 2.0
56
+ }
57
+ }
58
+ },
59
+ "versions": {
60
+ "moe_ien_tf": 2.0
61
+ },
62
+ "n-shot": {
63
+ "moe_ien_tf": 0
64
+ },
65
+ "higher_is_better": {
66
+ "moe_ien_tf": {
67
+ "acc": true,
68
+ "acc_norm": true
69
+ }
70
+ },
71
+ "n-samples": {
72
+ "moe_ien_tf": {
73
+ "original": 5823,
74
+ "effective": 5823
75
+ }
76
+ },
77
+ "config": {
78
+ "model": "hf",
79
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
80
+ "model_num_parameters": 32512545792,
81
+ "model_dtype": "torch.float16",
82
+ "model_revision": "main",
83
+ "model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
84
+ "batch_size": 1,
85
+ "batch_sizes": [],
86
+ "device": null,
87
+ "use_cache": null,
88
+ "limit": null,
89
+ "bootstrap_iters": 100000,
90
+ "gen_kwargs": null,
91
+ "random_seed": 0,
92
+ "numpy_seed": 1234,
93
+ "torch_seed": 1234,
94
+ "fewshot_seed": 1234
95
+ },
96
+ "git_hash": "788a3672",
97
+ "date": 1738809377.2163908,
98
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
99
+ "transformers_version": "4.48.2",
100
+ "upper_git_hash": null,
101
+ "tokenizer_pad_token": [
102
+ "<|endoftext|>",
103
+ "151643"
104
+ ],
105
+ "tokenizer_eos_token": [
106
+ "<|endoftext|>",
107
+ "151643"
108
+ ],
109
+ "tokenizer_bos_token": [
110
+ null,
111
+ "None"
112
+ ],
113
+ "eot_token_id": 151643,
114
+ "max_length": 32768,
115
+ "task_hashes": {
116
+ "moe_ien_tf": "116cb28cd11c72b01c3d52d75d3918c312d0a4f569bfdb8b2219398ec576a3f4"
117
+ },
118
+ "model_source": "hf",
119
+ "model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
120
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
121
+ "system_instruction": null,
122
+ "system_instruction_sha": null,
123
+ "fewshot_as_multiturn": false,
124
+ "chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
125
+ "chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
126
+ "start_time": 1768353.06839988,
127
+ "end_time": 1768502.097875321,
128
+ "total_evaluation_time_seconds": "149.0294754409697"
129
+ }
evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "acva": {
4
+ "alias": "acva",
5
+ "acc,none": 0.7415614236509759,
6
+ "acc_stderr,none": 0.004691028694524559,
7
+ "acc_norm,none": 0.7268656716417911,
8
+ "acc_norm_stderr,none": 0.004774534958083965
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "acva": []
13
+ },
14
+ "configs": {
15
+ "acva": {
16
+ "task": "acva",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
21
+ "dataset_kwargs": {
22
+ "trust_remote_code": true
23
+ },
24
+ "test_split": "test",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "choices",
29
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "num_fewshot": 5,
33
+ "metric_list": [
34
+ {
35
+ "metric": "acc",
36
+ "aggregation": "mean",
37
+ "higher_is_better": true
38
+ },
39
+ {
40
+ "metric": "acc_norm",
41
+ "aggregation": "mean",
42
+ "higher_is_better": true
43
+ }
44
+ ],
45
+ "output_type": "multiple_choice",
46
+ "repeats": 1,
47
+ "should_decontaminate": false,
48
+ "metadata": {
49
+ "version": 0.0
50
+ }
51
+ }
52
+ },
53
+ "versions": {
54
+ "acva": 0.0
55
+ },
56
+ "n-shot": {
57
+ "acva": 5
58
+ },
59
+ "higher_is_better": {
60
+ "acva": {
61
+ "acc": true,
62
+ "acc_norm": true
63
+ }
64
+ },
65
+ "n-samples": {
66
+ "acva": {
67
+ "original": 8710,
68
+ "effective": 8710
69
+ }
70
+ },
71
+ "config": {
72
+ "model": "hf",
73
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
74
+ "model_num_parameters": 8030261248,
75
+ "model_dtype": "torch.float16",
76
+ "model_revision": "main",
77
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
78
+ "batch_size": "auto",
79
+ "batch_sizes": [
80
+ 64
81
+ ],
82
+ "device": null,
83
+ "use_cache": null,
84
+ "limit": null,
85
+ "bootstrap_iters": 100000,
86
+ "gen_kwargs": null,
87
+ "random_seed": 0,
88
+ "numpy_seed": 1234,
89
+ "torch_seed": 1234,
90
+ "fewshot_seed": 1234
91
+ },
92
+ "git_hash": "5e10e017",
93
+ "date": 1736966813.484974,
94
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
95
+ "transformers_version": "4.48.0",
96
+ "upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
97
+ "tokenizer_pad_token": [
98
+ "<|end_of_text|>",
99
+ "128001"
100
+ ],
101
+ "tokenizer_eos_token": [
102
+ "<|end_of_text|>",
103
+ "128001"
104
+ ],
105
+ "tokenizer_bos_token": [
106
+ "<|begin_of_text|>",
107
+ "128000"
108
+ ],
109
+ "eot_token_id": 128001,
110
+ "max_length": 8192,
111
+ "task_hashes": {},
112
+ "model_source": "hf",
113
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
114
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
115
+ "system_instruction": null,
116
+ "system_instruction_sha": null,
117
+ "fewshot_as_multiturn": false,
118
+ "chat_template": null,
119
+ "chat_template_sha": null,
120
+ "start_time": 2430.929540314,
121
+ "end_time": 3025.204908665,
122
+ "total_evaluation_time_seconds": "594.275368351"
123
+ }
evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "ar_ifeval": {
4
+ "alias": "ar_ifeval",
5
+ "prompt_level_strict_acc,none": 0.10261194029850747,
6
+ "prompt_level_strict_acc_stderr,none": 0.01311934649092474,
7
+ "inst_level_strict_acc,none": 0.3924914675767918,
8
+ "inst_level_strict_acc_stderr,none": "N/A",
9
+ "prompt_level_loose_acc,none": 0.12126865671641791,
10
+ "prompt_level_loose_acc_stderr,none": 0.01411319854290401,
11
+ "inst_level_loose_acc,none": 0.42389078498293514,
12
+ "inst_level_loose_acc_stderr,none": "N/A"
13
+ }
14
+ },
15
+ "group_subtasks": {
16
+ "ar_ifeval": []
17
+ },
18
+ "configs": {
19
+ "ar_ifeval": {
20
+ "task": "ar_ifeval",
21
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
22
+ "dataset_name": "ar_ifeval",
23
+ "dataset_kwargs": {
24
+ "trust_remote_code": true
25
+ },
26
+ "test_split": "test",
27
+ "doc_to_text": "prompt",
28
+ "doc_to_target": 0,
29
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
30
+ "description": "",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 0,
34
+ "metric_list": [
35
+ {
36
+ "metric": "prompt_level_strict_acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "inst_level_strict_acc",
42
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "prompt_level_loose_acc",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ },
50
+ {
51
+ "metric": "inst_level_loose_acc",
52
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
53
+ "higher_is_better": true
54
+ }
55
+ ],
56
+ "output_type": "generate_until",
57
+ "generation_kwargs": {
58
+ "until": [],
59
+ "do_sample": false,
60
+ "temperature": 0.0,
61
+ "max_gen_toks": 1280
62
+ },
63
+ "repeats": 1,
64
+ "should_decontaminate": false,
65
+ "metadata": {
66
+ "version": 4.0
67
+ }
68
+ }
69
+ },
70
+ "versions": {
71
+ "ar_ifeval": 4.0
72
+ },
73
+ "n-shot": {
74
+ "ar_ifeval": 0
75
+ },
76
+ "higher_is_better": {
77
+ "ar_ifeval": {
78
+ "prompt_level_strict_acc": true,
79
+ "inst_level_strict_acc": true,
80
+ "prompt_level_loose_acc": true,
81
+ "inst_level_loose_acc": true
82
+ }
83
+ },
84
+ "n-samples": {
85
+ "ar_ifeval": {
86
+ "original": 536,
87
+ "effective": 536
88
+ }
89
+ },
90
+ "config": {
91
+ "model": "hf",
92
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
93
+ "model_num_parameters": 8030261248,
94
+ "model_dtype": "torch.float16",
95
+ "model_revision": "main",
96
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
97
+ "batch_size": 1,
98
+ "batch_sizes": [],
99
+ "device": null,
100
+ "use_cache": null,
101
+ "limit": null,
102
+ "bootstrap_iters": 100000,
103
+ "gen_kwargs": null,
104
+ "random_seed": 0,
105
+ "numpy_seed": 1234,
106
+ "torch_seed": 1234,
107
+ "fewshot_seed": 1234
108
+ },
109
+ "git_hash": "b955b2950",
110
+ "date": 1739784109.8369951,
111
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
112
+ "transformers_version": "4.48.3",
113
+ "upper_git_hash": null,
114
+ "tokenizer_pad_token": [
115
+ "<|end_of_text|>",
116
+ "128001"
117
+ ],
118
+ "tokenizer_eos_token": [
119
+ "<|end_of_text|>",
120
+ "128001"
121
+ ],
122
+ "tokenizer_bos_token": [
123
+ "<|begin_of_text|>",
124
+ "128000"
125
+ ],
126
+ "eot_token_id": 128001,
127
+ "max_length": 8192,
128
+ "task_hashes": {
129
+ "ar_ifeval": "9ce88f26b4b78e684512ecd933af67fe512192f41e27d2bedc62f288943db360"
130
+ },
131
+ "model_source": "hf",
132
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
133
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
134
+ "system_instruction": null,
135
+ "system_instruction_sha": null,
136
+ "fewshot_as_multiturn": false,
137
+ "chat_template": null,
138
+ "chat_template_sha": null,
139
+ "start_time": 62023.729831301,
140
+ "end_time": 66967.714743853,
141
+ "total_evaluation_time_seconds": "4943.98491255199"
142
+ }
evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araMath_v3": {
4
+ "alias": "araMath_v3",
5
+ "acc,none": 0.41487603305785126,
6
+ "acc_stderr,none": 0.02004770429343817,
7
+ "acc_norm,none": 0.41487603305785126,
8
+ "acc_norm_stderr,none": 0.02004770429343817
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araMath_v3": []
13
+ },
14
+ "configs": {
15
+ "araMath_v3": {
16
+ "task": "araMath_v3",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
21
+ "dataset_name": "araMath_v3",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "{{choices}}",
31
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "araMath_v3": 0.0
58
+ },
59
+ "n-shot": {
60
+ "araMath_v3": 5
61
+ },
62
+ "higher_is_better": {
63
+ "araMath_v3": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "araMath_v3": {
70
+ "original": 605,
71
+ "effective": 605
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
77
+ "model_num_parameters": 8030261248,
78
+ "model_dtype": "torch.float16",
79
+ "model_revision": "main",
80
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739784015.8084505,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|end_of_text|>",
100
+ "128001"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|end_of_text|>",
104
+ "128001"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ "<|begin_of_text|>",
108
+ "128000"
109
+ ],
110
+ "eot_token_id": 128001,
111
+ "max_length": 8192,
112
+ "task_hashes": {
113
+ "araMath_v3": "4eebd1da6e6937fc09bb9f1871adb53192dbce96733f0f8ee76d406c2fc8cad5"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
117
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": null,
122
+ "chat_template_sha": null,
123
+ "start_time": 61929.69246185,
124
+ "end_time": 61980.464828513,
125
+ "total_evaluation_time_seconds": "50.772366663004505"
126
+ }
evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araPro": {
4
+ "alias": "araPro",
5
+ "acc,none": 0.6350729854029195,
6
+ "acc_stderr,none": 0.006808161111700288,
7
+ "acc_norm,none": 0.6350729854029195,
8
+ "acc_norm_stderr,none": 0.006808161111700288
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araPro": []
13
+ },
14
+ "configs": {
15
+ "araPro": {
16
+ "task": "araPro",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
21
+ "dataset_name": "araPro",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "{{choices}}",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": true,
54
+ "doc_to_decontamination_query": "Question",
55
+ "metadata": {
56
+ "version": 2.0
57
+ }
58
+ }
59
+ },
60
+ "versions": {
61
+ "araPro": 2.0
62
+ },
63
+ "n-shot": {
64
+ "araPro": 0
65
+ },
66
+ "higher_is_better": {
67
+ "araPro": {
68
+ "acc": true,
69
+ "acc_norm": true
70
+ }
71
+ },
72
+ "n-samples": {
73
+ "araPro": {
74
+ "original": 5001,
75
+ "effective": 5001
76
+ }
77
+ },
78
+ "config": {
79
+ "model": "hf",
80
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
81
+ "model_num_parameters": 8030261248,
82
+ "model_dtype": "torch.float16",
83
+ "model_revision": "main",
84
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
85
+ "batch_size": 1,
86
+ "batch_sizes": [],
87
+ "device": null,
88
+ "use_cache": null,
89
+ "limit": null,
90
+ "bootstrap_iters": 100000,
91
+ "gen_kwargs": null,
92
+ "random_seed": 0,
93
+ "numpy_seed": 1234,
94
+ "torch_seed": 1234,
95
+ "fewshot_seed": 1234
96
+ },
97
+ "git_hash": "b955b2950",
98
+ "date": 1739782427.4652286,
99
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
100
+ "transformers_version": "4.48.3",
101
+ "upper_git_hash": null,
102
+ "tokenizer_pad_token": [
103
+ "<|end_of_text|>",
104
+ "128001"
105
+ ],
106
+ "tokenizer_eos_token": [
107
+ "<|end_of_text|>",
108
+ "128001"
109
+ ],
110
+ "tokenizer_bos_token": [
111
+ "<|begin_of_text|>",
112
+ "128000"
113
+ ],
114
+ "eot_token_id": 128001,
115
+ "max_length": 8192,
116
+ "task_hashes": {
117
+ "araPro": "655c2f6626c4b10533bba45ff63f9d4501694dea7f65d0bb251390819154f901"
118
+ },
119
+ "model_source": "hf",
120
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
121
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
122
+ "system_instruction": null,
123
+ "system_instruction_sha": null,
124
+ "fewshot_as_multiturn": false,
125
+ "chat_template": null,
126
+ "chat_template_sha": null,
127
+ "start_time": 60341.23142254,
128
+ "end_time": 60939.383586887,
129
+ "total_evaluation_time_seconds": "598.1521643470041"
130
+ }
evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "etec_v2": {
4
+ "alias": "etec_v2",
5
+ "acc,none": 0.5680975092739798,
6
+ "acc_stderr,none": 0.011406002243769559,
7
+ "acc_norm,none": 0.5680975092739798,
8
+ "acc_norm_stderr,none": 0.011406002243769559
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "etec_v2": []
13
+ },
14
+ "configs": {
15
+ "etec_v2": {
16
+ "task": "etec_v2",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
21
+ "dataset_name": "etec_v2",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 0,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "etec_v2": 0.0
58
+ },
59
+ "n-shot": {
60
+ "etec_v2": 0
61
+ },
62
+ "higher_is_better": {
63
+ "etec_v2": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "etec_v2": {
70
+ "original": 1887,
71
+ "effective": 1887
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
77
+ "model_num_parameters": 8030261248,
78
+ "model_dtype": "torch.float16",
79
+ "model_revision": "main",
80
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739783073.791851,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|end_of_text|>",
100
+ "128001"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|end_of_text|>",
104
+ "128001"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ "<|begin_of_text|>",
108
+ "128000"
109
+ ],
110
+ "eot_token_id": 128001,
111
+ "max_length": 8192,
112
+ "task_hashes": {
113
+ "etec_v2": "d371135bd6f3e91b2eb292576c3b2fae24dc4c0d7cd2a5f6eacf1fe6bc062e76"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
117
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": null,
122
+ "chat_template_sha": null,
123
+ "start_time": 60987.772646854,
124
+ "end_time": 61072.230445773,
125
+ "total_evaluation_time_seconds": "84.4577989190002"
126
+ }
evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "exams_ar": {
4
+ "alias": "exams_ar",
5
+ "acc,none": 0.5195530726256983,
6
+ "acc_stderr,none": 0.02158019049784565,
7
+ "acc_norm,none": 0.5195530726256983,
8
+ "acc_norm_stderr,none": 0.02158019049784565
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "exams_ar": []
13
+ },
14
+ "configs": {
15
+ "exams_ar": {
16
+ "task": "exams_ar",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/exams_ar",
21
+ "dataset_name": "exams_ar",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "test_split": "test",
26
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
27
+ "doc_to_text": "query",
28
+ "doc_to_target": "gold",
29
+ "doc_to_choice": "choices",
30
+ "description": "description",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 5,
34
+ "metric_list": [
35
+ {
36
+ "metric": "acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "acc_norm",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ }
45
+ ],
46
+ "output_type": "multiple_choice",
47
+ "repeats": 1,
48
+ "should_decontaminate": true,
49
+ "doc_to_decontamination_query": "query",
50
+ "metadata": {
51
+ "version": 0.0
52
+ }
53
+ }
54
+ },
55
+ "versions": {
56
+ "exams_ar": 0.0
57
+ },
58
+ "n-shot": {
59
+ "exams_ar": 5
60
+ },
61
+ "higher_is_better": {
62
+ "exams_ar": {
63
+ "acc": true,
64
+ "acc_norm": true
65
+ }
66
+ },
67
+ "n-samples": {
68
+ "exams_ar": {
69
+ "original": 537,
70
+ "effective": 537
71
+ }
72
+ },
73
+ "config": {
74
+ "model": "vllm",
75
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
76
+ "batch_size": 1,
77
+ "batch_sizes": [],
78
+ "device": null,
79
+ "use_cache": null,
80
+ "limit": null,
81
+ "bootstrap_iters": 100000,
82
+ "gen_kwargs": null,
83
+ "random_seed": 0,
84
+ "numpy_seed": 1234,
85
+ "torch_seed": 1234,
86
+ "fewshot_seed": 1234
87
+ },
88
+ "git_hash": "8e1bd48d",
89
+ "date": 1735747770.5687191,
90
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
91
+ "transformers_version": "4.47.1",
92
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
93
+ "tokenizer_pad_token": [
94
+ "<|end_of_text|>",
95
+ "128001"
96
+ ],
97
+ "tokenizer_eos_token": [
98
+ "<|end_of_text|>",
99
+ "128001"
100
+ ],
101
+ "tokenizer_bos_token": [
102
+ "<|begin_of_text|>",
103
+ "128000"
104
+ ],
105
+ "eot_token_id": 128001,
106
+ "max_length": 8192,
107
+ "task_hashes": {},
108
+ "model_source": "vllm",
109
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
110
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
111
+ "system_instruction": null,
112
+ "system_instruction_sha": null,
113
+ "fewshot_as_multiturn": false,
114
+ "chat_template": null,
115
+ "chat_template_sha": null,
116
+ "start_time": 8055.848670643,
117
+ "end_time": 8272.25518881,
118
+ "total_evaluation_time_seconds": "216.40651816700029"
119
+ }
evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json ADDED
@@ -0,0 +1,539 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "gat": {
4
+ "acc,none": 0.3615326727706008,
5
+ "acc_stderr,none": 0.003748588350676633,
6
+ "alias": "gat"
7
+ },
8
+ "gat_algebra": {
9
+ "alias": " - gat_algebra",
10
+ "acc,none": 0.30241187384044527,
11
+ "acc_stderr,none": 0.008849121616191958
12
+ },
13
+ "gat_analogy": {
14
+ "alias": " - gat_analogy",
15
+ "acc,none": 0.3227686703096539,
16
+ "acc_stderr,none": 0.008925286248200312
17
+ },
18
+ "gat_arithmetic": {
19
+ "alias": " - gat_arithmetic",
20
+ "acc,none": 0.3213102686786897,
21
+ "acc_stderr,none": 0.008960516811645579
22
+ },
23
+ "gat_association": {
24
+ "alias": " - gat_association",
25
+ "acc,none": 0.39425837320574164,
26
+ "acc_stderr,none": 0.01512460088966808
27
+ },
28
+ "gat_comparisons": {
29
+ "alias": " - gat_comparisons",
30
+ "acc,none": 0.28114754098360656,
31
+ "acc_stderr,none": 0.012876124676937594
32
+ },
33
+ "gat_completion": {
34
+ "alias": " - gat_completion",
35
+ "acc,none": 0.46115702479338844,
36
+ "acc_stderr,none": 0.014336474830596175
37
+ },
38
+ "gat_contextual": {
39
+ "alias": " - gat_contextual",
40
+ "acc,none": 0.2983128834355828,
41
+ "acc_stderr,none": 0.012674637536976358
42
+ },
43
+ "gat_geometry": {
44
+ "alias": " - gat_geometry",
45
+ "acc,none": 0.3232876712328767,
46
+ "acc_stderr,none": 0.024515791774351408
47
+ },
48
+ "gat_reading": {
49
+ "alias": " - gat_reading",
50
+ "acc,none": 0.5183364839319471,
51
+ "acc_stderr,none": 0.009717331969425425
52
+ }
53
+ },
54
+ "groups": {
55
+ "gat": {
56
+ "acc,none": 0.3615326727706008,
57
+ "acc_stderr,none": 0.003748588350676633,
58
+ "alias": "gat"
59
+ }
60
+ },
61
+ "group_subtasks": {
62
+ "gat": [
63
+ "gat_analogy",
64
+ "gat_association",
65
+ "gat_completion",
66
+ "gat_reading",
67
+ "gat_algebra",
68
+ "gat_arithmetic",
69
+ "gat_comparisons",
70
+ "gat_contextual",
71
+ "gat_geometry"
72
+ ]
73
+ },
74
+ "configs": {
75
+ "gat_algebra": {
76
+ "task": "gat_algebra",
77
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
78
+ "dataset_name": "algebra",
79
+ "dataset_kwargs": {
80
+ "trust_remote_code": true
81
+ },
82
+ "test_split": "test",
83
+ "fewshot_split": "validation",
84
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
85
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
86
+ "doc_to_target": "{{label}}",
87
+ "doc_to_choice": [
88
+ "\u0623",
89
+ "\u0628",
90
+ "\u062c",
91
+ "\u062f"
92
+ ],
93
+ "description": "",
94
+ "target_delimiter": " ",
95
+ "fewshot_delimiter": "\n\n",
96
+ "num_fewshot": 0,
97
+ "metric_list": [
98
+ {
99
+ "metric": "acc",
100
+ "aggregation": "mean",
101
+ "higher_is_better": true
102
+ }
103
+ ],
104
+ "output_type": "multiple_choice",
105
+ "repeats": 1,
106
+ "should_decontaminate": false,
107
+ "metadata": {
108
+ "version": 0.0
109
+ }
110
+ },
111
+ "gat_analogy": {
112
+ "task": "gat_analogy",
113
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
114
+ "dataset_name": "analogy",
115
+ "dataset_kwargs": {
116
+ "trust_remote_code": true
117
+ },
118
+ "test_split": "test",
119
+ "fewshot_split": "validation",
120
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
121
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
122
+ "doc_to_target": "{{label}}",
123
+ "doc_to_choice": [
124
+ "\u0623",
125
+ "\u0628",
126
+ "\u062c",
127
+ "\u062f"
128
+ ],
129
+ "description": "",
130
+ "target_delimiter": " ",
131
+ "fewshot_delimiter": "\n\n",
132
+ "num_fewshot": 0,
133
+ "metric_list": [
134
+ {
135
+ "metric": "acc",
136
+ "aggregation": "mean",
137
+ "higher_is_better": true
138
+ }
139
+ ],
140
+ "output_type": "multiple_choice",
141
+ "repeats": 1,
142
+ "should_decontaminate": false,
143
+ "metadata": {
144
+ "version": 0.0
145
+ }
146
+ },
147
+ "gat_arithmetic": {
148
+ "task": "gat_arithmetic",
149
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
150
+ "dataset_name": "arithmetic",
151
+ "dataset_kwargs": {
152
+ "trust_remote_code": true
153
+ },
154
+ "test_split": "test",
155
+ "fewshot_split": "validation",
156
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
157
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
158
+ "doc_to_target": "{{label}}",
159
+ "doc_to_choice": [
160
+ "\u0623",
161
+ "\u0628",
162
+ "\u062c",
163
+ "\u062f"
164
+ ],
165
+ "description": "",
166
+ "target_delimiter": " ",
167
+ "fewshot_delimiter": "\n\n",
168
+ "num_fewshot": 0,
169
+ "metric_list": [
170
+ {
171
+ "metric": "acc",
172
+ "aggregation": "mean",
173
+ "higher_is_better": true
174
+ }
175
+ ],
176
+ "output_type": "multiple_choice",
177
+ "repeats": 1,
178
+ "should_decontaminate": false,
179
+ "metadata": {
180
+ "version": 0.0
181
+ }
182
+ },
183
+ "gat_association": {
184
+ "task": "gat_association",
185
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
186
+ "dataset_name": "association",
187
+ "dataset_kwargs": {
188
+ "trust_remote_code": true
189
+ },
190
+ "test_split": "test",
191
+ "fewshot_split": "validation",
192
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
193
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
194
+ "doc_to_target": "{{label}}",
195
+ "doc_to_choice": [
196
+ "\u0623",
197
+ "\u0628",
198
+ "\u062c",
199
+ "\u062f"
200
+ ],
201
+ "description": "",
202
+ "target_delimiter": " ",
203
+ "fewshot_delimiter": "\n\n",
204
+ "num_fewshot": 0,
205
+ "metric_list": [
206
+ {
207
+ "metric": "acc",
208
+ "aggregation": "mean",
209
+ "higher_is_better": true
210
+ }
211
+ ],
212
+ "output_type": "multiple_choice",
213
+ "repeats": 1,
214
+ "should_decontaminate": false,
215
+ "metadata": {
216
+ "version": 0.0
217
+ }
218
+ },
219
+ "gat_comparisons": {
220
+ "task": "gat_comparisons",
221
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
222
+ "dataset_name": "comparisons",
223
+ "dataset_kwargs": {
224
+ "trust_remote_code": true
225
+ },
226
+ "test_split": "test",
227
+ "fewshot_split": "validation",
228
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
229
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
230
+ "doc_to_target": "{{label}}",
231
+ "doc_to_choice": [
232
+ "\u0623",
233
+ "\u0628",
234
+ "\u062c",
235
+ "\u062f"
236
+ ],
237
+ "description": "",
238
+ "target_delimiter": " ",
239
+ "fewshot_delimiter": "\n\n",
240
+ "num_fewshot": 0,
241
+ "metric_list": [
242
+ {
243
+ "metric": "acc",
244
+ "aggregation": "mean",
245
+ "higher_is_better": true
246
+ }
247
+ ],
248
+ "output_type": "multiple_choice",
249
+ "repeats": 1,
250
+ "should_decontaminate": false,
251
+ "metadata": {
252
+ "version": 0.0
253
+ }
254
+ },
255
+ "gat_completion": {
256
+ "task": "gat_completion",
257
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
258
+ "dataset_name": "completion",
259
+ "dataset_kwargs": {
260
+ "trust_remote_code": true
261
+ },
262
+ "test_split": "test",
263
+ "fewshot_split": "validation",
264
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
265
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
266
+ "doc_to_target": "{{label}}",
267
+ "doc_to_choice": [
268
+ "\u0623",
269
+ "\u0628",
270
+ "\u062c",
271
+ "\u062f"
272
+ ],
273
+ "description": "",
274
+ "target_delimiter": " ",
275
+ "fewshot_delimiter": "\n\n",
276
+ "num_fewshot": 0,
277
+ "metric_list": [
278
+ {
279
+ "metric": "acc",
280
+ "aggregation": "mean",
281
+ "higher_is_better": true
282
+ }
283
+ ],
284
+ "output_type": "multiple_choice",
285
+ "repeats": 1,
286
+ "should_decontaminate": false,
287
+ "metadata": {
288
+ "version": 0.0
289
+ }
290
+ },
291
+ "gat_contextual": {
292
+ "task": "gat_contextual",
293
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
294
+ "dataset_name": "contextual",
295
+ "dataset_kwargs": {
296
+ "trust_remote_code": true
297
+ },
298
+ "test_split": "test",
299
+ "fewshot_split": "validation",
300
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
301
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
302
+ "doc_to_target": "{{label}}",
303
+ "doc_to_choice": [
304
+ "\u0623",
305
+ "\u0628",
306
+ "\u062c",
307
+ "\u062f"
308
+ ],
309
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
310
+ "target_delimiter": " ",
311
+ "fewshot_delimiter": "\n\n",
312
+ "num_fewshot": 0,
313
+ "metric_list": [
314
+ {
315
+ "metric": "acc",
316
+ "aggregation": "mean",
317
+ "higher_is_better": true
318
+ }
319
+ ],
320
+ "output_type": "multiple_choice",
321
+ "repeats": 1,
322
+ "should_decontaminate": false,
323
+ "metadata": {
324
+ "version": 0.0
325
+ }
326
+ },
327
+ "gat_geometry": {
328
+ "task": "gat_geometry",
329
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
330
+ "dataset_name": "geometry",
331
+ "dataset_kwargs": {
332
+ "trust_remote_code": true
333
+ },
334
+ "test_split": "test",
335
+ "fewshot_split": "validation",
336
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
337
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
338
+ "doc_to_target": "{{label}}",
339
+ "doc_to_choice": [
340
+ "\u0623",
341
+ "\u0628",
342
+ "\u062c",
343
+ "\u062f"
344
+ ],
345
+ "description": "",
346
+ "target_delimiter": " ",
347
+ "fewshot_delimiter": "\n\n",
348
+ "num_fewshot": 0,
349
+ "metric_list": [
350
+ {
351
+ "metric": "acc",
352
+ "aggregation": "mean",
353
+ "higher_is_better": true
354
+ }
355
+ ],
356
+ "output_type": "multiple_choice",
357
+ "repeats": 1,
358
+ "should_decontaminate": false,
359
+ "metadata": {
360
+ "version": 0.0
361
+ }
362
+ },
363
+ "gat_reading": {
364
+ "task": "gat_reading",
365
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
366
+ "dataset_name": "reading",
367
+ "dataset_kwargs": {
368
+ "trust_remote_code": true
369
+ },
370
+ "test_split": "test",
371
+ "fewshot_split": "validation",
372
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
373
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
374
+ "doc_to_target": "{{label}}",
375
+ "doc_to_choice": [
376
+ "\u0623",
377
+ "\u0628",
378
+ "\u062c",
379
+ "\u062f"
380
+ ],
381
+ "description": "",
382
+ "target_delimiter": " ",
383
+ "fewshot_delimiter": "\n\n",
384
+ "num_fewshot": 0,
385
+ "metric_list": [
386
+ {
387
+ "metric": "acc",
388
+ "aggregation": "mean",
389
+ "higher_is_better": true
390
+ }
391
+ ],
392
+ "output_type": "multiple_choice",
393
+ "repeats": 1,
394
+ "should_decontaminate": false,
395
+ "metadata": {
396
+ "version": 0.0
397
+ }
398
+ }
399
+ },
400
+ "versions": {
401
+ "gat": 0,
402
+ "gat_algebra": 0.0,
403
+ "gat_analogy": 0.0,
404
+ "gat_arithmetic": 0.0,
405
+ "gat_association": 0.0,
406
+ "gat_comparisons": 0.0,
407
+ "gat_completion": 0.0,
408
+ "gat_contextual": 0.0,
409
+ "gat_geometry": 0.0,
410
+ "gat_reading": 0.0
411
+ },
412
+ "n-shot": {
413
+ "gat_algebra": 0,
414
+ "gat_analogy": 0,
415
+ "gat_arithmetic": 0,
416
+ "gat_association": 0,
417
+ "gat_comparisons": 0,
418
+ "gat_completion": 0,
419
+ "gat_contextual": 0,
420
+ "gat_geometry": 0,
421
+ "gat_reading": 0
422
+ },
423
+ "higher_is_better": {
424
+ "gat": {
425
+ "acc": true
426
+ },
427
+ "gat_algebra": {
428
+ "acc": true
429
+ },
430
+ "gat_analogy": {
431
+ "acc": true
432
+ },
433
+ "gat_arithmetic": {
434
+ "acc": true
435
+ },
436
+ "gat_association": {
437
+ "acc": true
438
+ },
439
+ "gat_comparisons": {
440
+ "acc": true
441
+ },
442
+ "gat_completion": {
443
+ "acc": true
444
+ },
445
+ "gat_contextual": {
446
+ "acc": true
447
+ },
448
+ "gat_geometry": {
449
+ "acc": true
450
+ },
451
+ "gat_reading": {
452
+ "acc": true
453
+ }
454
+ },
455
+ "n-samples": {
456
+ "gat_analogy": {
457
+ "original": 2745,
458
+ "effective": 2745
459
+ },
460
+ "gat_association": {
461
+ "original": 1045,
462
+ "effective": 1045
463
+ },
464
+ "gat_completion": {
465
+ "original": 1210,
466
+ "effective": 1210
467
+ },
468
+ "gat_reading": {
469
+ "original": 2645,
470
+ "effective": 2645
471
+ },
472
+ "gat_algebra": {
473
+ "original": 2695,
474
+ "effective": 2695
475
+ },
476
+ "gat_arithmetic": {
477
+ "original": 2717,
478
+ "effective": 2717
479
+ },
480
+ "gat_comparisons": {
481
+ "original": 1220,
482
+ "effective": 1220
483
+ },
484
+ "gat_contextual": {
485
+ "original": 1304,
486
+ "effective": 1304
487
+ },
488
+ "gat_geometry": {
489
+ "original": 365,
490
+ "effective": 365
491
+ }
492
+ },
493
+ "config": {
494
+ "model": "vllm",
495
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
496
+ "batch_size": 1,
497
+ "batch_sizes": [],
498
+ "device": null,
499
+ "use_cache": null,
500
+ "limit": null,
501
+ "bootstrap_iters": 100000,
502
+ "gen_kwargs": null,
503
+ "random_seed": 0,
504
+ "numpy_seed": 1234,
505
+ "torch_seed": 1234,
506
+ "fewshot_seed": 1234
507
+ },
508
+ "git_hash": "8e1bd48d",
509
+ "date": 1735749781.6371627,
510
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
511
+ "transformers_version": "4.47.1",
512
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
513
+ "tokenizer_pad_token": [
514
+ "<|end_of_text|>",
515
+ "128001"
516
+ ],
517
+ "tokenizer_eos_token": [
518
+ "<|end_of_text|>",
519
+ "128001"
520
+ ],
521
+ "tokenizer_bos_token": [
522
+ "<|begin_of_text|>",
523
+ "128000"
524
+ ],
525
+ "eot_token_id": 128001,
526
+ "max_length": 8192,
527
+ "task_hashes": {},
528
+ "model_source": "vllm",
529
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
530
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
531
+ "system_instruction": null,
532
+ "system_instruction_sha": null,
533
+ "fewshot_as_multiturn": false,
534
+ "chat_template": null,
535
+ "chat_template_sha": null,
536
+ "start_time": 10066.91226392,
537
+ "end_time": 10586.891967311,
538
+ "total_evaluation_time_seconds": "519.9797033909999"
539
+ }
evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_mcq": {
4
+ "alias": "moe_ien_mcq",
5
+ "acc,none": 0.7700700700700701,
6
+ "acc_stderr,none": 0.0042101916833611345,
7
+ "acc_norm,none": 0.7700700700700701,
8
+ "acc_norm_stderr,none": 0.0042101916833611345
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_mcq": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_mcq": {
16
+ "task": "moe_ien_mcq",
17
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
18
+ "dataset_name": "moe_ien_mcq",
19
+ "dataset_kwargs": {
20
+ "trust_remote_code": true
21
+ },
22
+ "validation_split": "validation",
23
+ "test_split": "test",
24
+ "fewshot_split": "validation",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "Query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "{{Choices}}",
29
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "fewshot_config": {
33
+ "sampler": "balanced_cat"
34
+ },
35
+ "num_fewshot": 0,
36
+ "metric_list": [
37
+ {
38
+ "metric": "acc",
39
+ "aggregation": "mean",
40
+ "higher_is_better": true
41
+ },
42
+ {
43
+ "metric": "acc_norm",
44
+ "aggregation": "mean",
45
+ "higher_is_better": true
46
+ }
47
+ ],
48
+ "output_type": "multiple_choice",
49
+ "repeats": 1,
50
+ "should_decontaminate": true,
51
+ "doc_to_decontamination_query": "Query",
52
+ "metadata": {
53
+ "version": 0.0
54
+ }
55
+ }
56
+ },
57
+ "versions": {
58
+ "moe_ien_mcq": 0.0
59
+ },
60
+ "n-shot": {
61
+ "moe_ien_mcq": 0
62
+ },
63
+ "higher_is_better": {
64
+ "moe_ien_mcq": {
65
+ "acc": true,
66
+ "acc_norm": true
67
+ }
68
+ },
69
+ "n-samples": {
70
+ "moe_ien_mcq": {
71
+ "original": 9990,
72
+ "effective": 9990
73
+ }
74
+ },
75
+ "config": {
76
+ "model": "hf",
77
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
78
+ "model_num_parameters": 8030261248,
79
+ "model_dtype": "torch.float16",
80
+ "model_revision": "main",
81
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
82
+ "batch_size": 1,
83
+ "batch_sizes": [],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "b955b2950",
95
+ "date": 1739783202.062394,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.3",
98
+ "upper_git_hash": null,
99
+ "tokenizer_pad_token": [
100
+ "<|end_of_text|>",
101
+ "128001"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "<|end_of_text|>",
105
+ "128001"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ "<|begin_of_text|>",
109
+ "128000"
110
+ ],
111
+ "eot_token_id": 128001,
112
+ "max_length": 8192,
113
+ "task_hashes": {
114
+ "moe_ien_mcq": "99731f9d1bb76d010da5a439ea1b0bb7695451459d680f708f7222f02ba8e831"
115
+ },
116
+ "model_source": "hf",
117
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
118
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
119
+ "system_instruction": null,
120
+ "system_instruction_sha": null,
121
+ "fewshot_as_multiturn": false,
122
+ "chat_template": null,
123
+ "chat_template_sha": null,
124
+ "start_time": 61116.014324615,
125
+ "end_time": 61463.567260828,
126
+ "total_evaluation_time_seconds": "347.5529362130037"
127
+ }
evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_tf": {
4
+ "alias": "moe_ien_tf",
5
+ "acc,none": 0.7590589043448395,
6
+ "acc_stderr,none": 0.00560476076159517,
7
+ "acc_norm,none": 0.7590589043448395,
8
+ "acc_norm_stderr,none": 0.00560476076159517
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_tf": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_tf": {
16
+ "task": "moe_ien_tf",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
21
+ "dataset_name": "moe_ien_tf",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "choices",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": false,
54
+ "metadata": {
55
+ "version": 2.0
56
+ }
57
+ }
58
+ },
59
+ "versions": {
60
+ "moe_ien_tf": 2.0
61
+ },
62
+ "n-shot": {
63
+ "moe_ien_tf": 0
64
+ },
65
+ "higher_is_better": {
66
+ "moe_ien_tf": {
67
+ "acc": true,
68
+ "acc_norm": true
69
+ }
70
+ },
71
+ "n-samples": {
72
+ "moe_ien_tf": {
73
+ "original": 5823,
74
+ "effective": 5823
75
+ }
76
+ },
77
+ "config": {
78
+ "model": "hf",
79
+ "model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
80
+ "model_num_parameters": 8030261248,
81
+ "model_dtype": "torch.float16",
82
+ "model_revision": "main",
83
+ "model_sha": "562d0998c03c02d315e346f81650a43955711901",
84
+ "batch_size": 1,
85
+ "batch_sizes": [],
86
+ "device": null,
87
+ "use_cache": null,
88
+ "limit": null,
89
+ "bootstrap_iters": 100000,
90
+ "gen_kwargs": null,
91
+ "random_seed": 0,
92
+ "numpy_seed": 1234,
93
+ "torch_seed": 1234,
94
+ "fewshot_seed": 1234
95
+ },
96
+ "git_hash": "b955b2950",
97
+ "date": 1739783594.7150183,
98
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
99
+ "transformers_version": "4.48.3",
100
+ "upper_git_hash": null,
101
+ "tokenizer_pad_token": [
102
+ "<|end_of_text|>",
103
+ "128001"
104
+ ],
105
+ "tokenizer_eos_token": [
106
+ "<|end_of_text|>",
107
+ "128001"
108
+ ],
109
+ "tokenizer_bos_token": [
110
+ "<|begin_of_text|>",
111
+ "128000"
112
+ ],
113
+ "eot_token_id": 128001,
114
+ "max_length": 8192,
115
+ "task_hashes": {
116
+ "moe_ien_tf": "a8315c59ec304a82f04395ff5e7728d6586b1b0b5f569486840b7d29d76a8dd8"
117
+ },
118
+ "model_source": "hf",
119
+ "model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
120
+ "model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
121
+ "system_instruction": null,
122
+ "system_instruction_sha": null,
123
+ "fewshot_as_multiturn": false,
124
+ "chat_template": null,
125
+ "chat_template_sha": null,
126
+ "start_time": 61508.598662402,
127
+ "end_time": 61883.458017876,
128
+ "total_evaluation_time_seconds": "374.85935547400004"
129
+ }
evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json ADDED
@@ -0,0 +1,119 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "acva": {
4
+ "alias": "acva",
5
+ "acc,none": 0.7746268656716417,
6
+ "acc_stderr,none": 0.004477269169728854,
7
+ "acc_norm,none": 0.7632606199770379,
8
+ "acc_norm_stderr,none": 0.004554991129754026
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "acva": []
13
+ },
14
+ "configs": {
15
+ "acva": {
16
+ "task": "acva",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
21
+ "dataset_kwargs": {
22
+ "trust_remote_code": true
23
+ },
24
+ "test_split": "test",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "choices",
29
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "num_fewshot": 5,
33
+ "metric_list": [
34
+ {
35
+ "metric": "acc",
36
+ "aggregation": "mean",
37
+ "higher_is_better": true
38
+ },
39
+ {
40
+ "metric": "acc_norm",
41
+ "aggregation": "mean",
42
+ "higher_is_better": true
43
+ }
44
+ ],
45
+ "output_type": "multiple_choice",
46
+ "repeats": 1,
47
+ "should_decontaminate": false,
48
+ "metadata": {
49
+ "version": 0.0
50
+ }
51
+ }
52
+ },
53
+ "versions": {
54
+ "acva": 0.0
55
+ },
56
+ "n-shot": {
57
+ "acva": 5
58
+ },
59
+ "higher_is_better": {
60
+ "acva": {
61
+ "acc": true,
62
+ "acc_norm": true
63
+ }
64
+ },
65
+ "n-samples": {
66
+ "acva": {
67
+ "original": 8710,
68
+ "effective": 8710
69
+ }
70
+ },
71
+ "config": {
72
+ "model": "vllm",
73
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
74
+ "batch_size": 1,
75
+ "batch_sizes": [],
76
+ "device": null,
77
+ "use_cache": null,
78
+ "limit": null,
79
+ "bootstrap_iters": 100000,
80
+ "gen_kwargs": null,
81
+ "random_seed": 0,
82
+ "numpy_seed": 1234,
83
+ "torch_seed": 1234,
84
+ "fewshot_seed": 1234
85
+ },
86
+ "git_hash": "8e1bd48d",
87
+ "date": 1735662713.7617116,
88
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
89
+ "transformers_version": "4.47.1",
90
+ "upper_git_hash": null,
91
+ "tokenizer_pad_token": [
92
+ "<unk>",
93
+ "0"
94
+ ],
95
+ "tokenizer_eos_token": [
96
+ "</s>",
97
+ "2"
98
+ ],
99
+ "tokenizer_bos_token": [
100
+ "<s>",
101
+ "1"
102
+ ],
103
+ "eot_token_id": 2,
104
+ "max_length": 4096,
105
+ "task_hashes": {
106
+ "acva": "d007c508f0accdd697f549d7cbe7f960f1470c8f86f1a0969355a6ef33108edb"
107
+ },
108
+ "model_source": "vllm",
109
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
110
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
111
+ "system_instruction": null,
112
+ "system_instruction_sha": null,
113
+ "fewshot_as_multiturn": false,
114
+ "chat_template": null,
115
+ "chat_template_sha": null,
116
+ "start_time": 3374.021232778,
117
+ "end_time": 3578.563943596,
118
+ "total_evaluation_time_seconds": "204.54271081800016"
119
+ }
evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "ar_ifeval": {
4
+ "alias": "ar_ifeval",
5
+ "prompt_level_strict_acc,none": 0.31343283582089554,
6
+ "prompt_level_strict_acc_stderr,none": 0.020055655889994813,
7
+ "inst_level_strict_acc,none": 0.6764505119453925,
8
+ "inst_level_strict_acc_stderr,none": "N/A",
9
+ "prompt_level_loose_acc,none": 0.3656716417910448,
10
+ "prompt_level_loose_acc_stderr,none": 0.020822161638297296,
11
+ "inst_level_loose_acc,none": 0.7051194539249147,
12
+ "inst_level_loose_acc_stderr,none": "N/A"
13
+ }
14
+ },
15
+ "group_subtasks": {
16
+ "ar_ifeval": []
17
+ },
18
+ "configs": {
19
+ "ar_ifeval": {
20
+ "task": "ar_ifeval",
21
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
22
+ "dataset_name": "ar_ifeval",
23
+ "dataset_kwargs": {
24
+ "trust_remote_code": true
25
+ },
26
+ "test_split": "test",
27
+ "doc_to_text": "prompt",
28
+ "doc_to_target": 0,
29
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
30
+ "description": "",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 0,
34
+ "metric_list": [
35
+ {
36
+ "metric": "prompt_level_strict_acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "inst_level_strict_acc",
42
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "prompt_level_loose_acc",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ },
50
+ {
51
+ "metric": "inst_level_loose_acc",
52
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
53
+ "higher_is_better": true
54
+ }
55
+ ],
56
+ "output_type": "generate_until",
57
+ "generation_kwargs": {
58
+ "until": [],
59
+ "do_sample": false,
60
+ "temperature": 0.0,
61
+ "max_gen_toks": 1280
62
+ },
63
+ "repeats": 1,
64
+ "should_decontaminate": false,
65
+ "metadata": {
66
+ "version": 4.0
67
+ }
68
+ }
69
+ },
70
+ "versions": {
71
+ "ar_ifeval": 4.0
72
+ },
73
+ "n-shot": {
74
+ "ar_ifeval": 0
75
+ },
76
+ "higher_is_better": {
77
+ "ar_ifeval": {
78
+ "prompt_level_strict_acc": true,
79
+ "inst_level_strict_acc": true,
80
+ "prompt_level_loose_acc": true,
81
+ "inst_level_loose_acc": true
82
+ }
83
+ },
84
+ "n-samples": {
85
+ "ar_ifeval": {
86
+ "original": 536,
87
+ "effective": 536
88
+ }
89
+ },
90
+ "config": {
91
+ "model": "hf",
92
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
93
+ "model_num_parameters": 7000559616,
94
+ "model_dtype": "torch.bfloat16",
95
+ "model_revision": "main",
96
+ "model_sha": "",
97
+ "batch_size": 1,
98
+ "batch_sizes": [],
99
+ "device": null,
100
+ "use_cache": null,
101
+ "limit": null,
102
+ "bootstrap_iters": 100000,
103
+ "gen_kwargs": null,
104
+ "random_seed": 0,
105
+ "numpy_seed": 1234,
106
+ "torch_seed": 1234,
107
+ "fewshot_seed": 1234
108
+ },
109
+ "git_hash": "b955b2950",
110
+ "date": 1739618378.981141,
111
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
112
+ "transformers_version": "4.48.3",
113
+ "upper_git_hash": null,
114
+ "tokenizer_pad_token": [
115
+ "<unk>",
116
+ "0"
117
+ ],
118
+ "tokenizer_eos_token": [
119
+ "</s>",
120
+ "2"
121
+ ],
122
+ "tokenizer_bos_token": [
123
+ "<s>",
124
+ "1"
125
+ ],
126
+ "eot_token_id": 2,
127
+ "max_length": 4096,
128
+ "task_hashes": {
129
+ "ar_ifeval": "d0db7903ef270d7dc54efe4e7713be0de9864fc3a36c901c6e5777a6a5f69aa9"
130
+ },
131
+ "model_source": "hf",
132
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
133
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
134
+ "system_instruction": null,
135
+ "system_instruction_sha": null,
136
+ "fewshot_as_multiturn": false,
137
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
138
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
139
+ "start_time": 1393068.333905473,
140
+ "end_time": 1397143.169266589,
141
+ "total_evaluation_time_seconds": "4074.8353611161"
142
+ }
evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araMath_v3": {
4
+ "alias": "araMath_v3",
5
+ "acc,none": 0.6677685950413224,
6
+ "acc_stderr,none": 0.019165266705090528,
7
+ "acc_norm,none": 0.6677685950413224,
8
+ "acc_norm_stderr,none": 0.019165266705090528
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araMath_v3": []
13
+ },
14
+ "configs": {
15
+ "araMath_v3": {
16
+ "task": "araMath_v3",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
21
+ "dataset_name": "araMath_v3",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "{{choices}}",
31
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "araMath_v3": 0.0
58
+ },
59
+ "n-shot": {
60
+ "araMath_v3": 5
61
+ },
62
+ "higher_is_better": {
63
+ "araMath_v3": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "araMath_v3": {
70
+ "original": 605,
71
+ "effective": 605
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
77
+ "model_num_parameters": 7000559616,
78
+ "model_dtype": "torch.bfloat16",
79
+ "model_revision": "main",
80
+ "model_sha": "",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739618269.6292942,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<unk>",
100
+ "0"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "</s>",
104
+ "2"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ "<s>",
108
+ "1"
109
+ ],
110
+ "eot_token_id": 2,
111
+ "max_length": 4096,
112
+ "task_hashes": {
113
+ "araMath_v3": "e7f60b63c44ee90c76a61f37207fa1f812622b6662200911fcfd7dabe78ada66"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
117
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
122
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
123
+ "start_time": 1392959.193182268,
124
+ "end_time": 1393012.133225703,
125
+ "total_evaluation_time_seconds": "52.940043434966356"
126
+ }
evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araPro": {
4
+ "alias": "araPro",
5
+ "acc,none": 0.6970605878824235,
6
+ "acc_stderr,none": 0.006498724870364006,
7
+ "acc_norm,none": 0.6970605878824235,
8
+ "acc_norm_stderr,none": 0.006498724870364006
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araPro": []
13
+ },
14
+ "configs": {
15
+ "araPro": {
16
+ "task": "araPro",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
21
+ "dataset_name": "araPro",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "{{choices}}",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": true,
54
+ "doc_to_decontamination_query": "Question",
55
+ "metadata": {
56
+ "version": 2.0
57
+ }
58
+ }
59
+ },
60
+ "versions": {
61
+ "araPro": 2.0
62
+ },
63
+ "n-shot": {
64
+ "araPro": 0
65
+ },
66
+ "higher_is_better": {
67
+ "araPro": {
68
+ "acc": true,
69
+ "acc_norm": true
70
+ }
71
+ },
72
+ "n-samples": {
73
+ "araPro": {
74
+ "original": 5001,
75
+ "effective": 5001
76
+ }
77
+ },
78
+ "config": {
79
+ "model": "hf",
80
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
81
+ "model_num_parameters": 7000559616,
82
+ "model_dtype": "torch.bfloat16",
83
+ "model_revision": "main",
84
+ "model_sha": "",
85
+ "batch_size": 1,
86
+ "batch_sizes": [],
87
+ "device": null,
88
+ "use_cache": null,
89
+ "limit": null,
90
+ "bootstrap_iters": 100000,
91
+ "gen_kwargs": null,
92
+ "random_seed": 0,
93
+ "numpy_seed": 1234,
94
+ "torch_seed": 1234,
95
+ "fewshot_seed": 1234
96
+ },
97
+ "git_hash": "b955b2950",
98
+ "date": 1739617164.0204737,
99
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
100
+ "transformers_version": "4.48.3",
101
+ "upper_git_hash": null,
102
+ "tokenizer_pad_token": [
103
+ "<unk>",
104
+ "0"
105
+ ],
106
+ "tokenizer_eos_token": [
107
+ "</s>",
108
+ "2"
109
+ ],
110
+ "tokenizer_bos_token": [
111
+ "<s>",
112
+ "1"
113
+ ],
114
+ "eot_token_id": 2,
115
+ "max_length": 4096,
116
+ "task_hashes": {
117
+ "araPro": "01340c360a1565c46298c4c24dd3fdfe1ea614c6eef6e4d4f021f1da83da2584"
118
+ },
119
+ "model_source": "hf",
120
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
121
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
122
+ "system_instruction": null,
123
+ "system_instruction_sha": null,
124
+ "fewshot_as_multiturn": false,
125
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
126
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
127
+ "start_time": 1391853.516943726,
128
+ "end_time": 1392050.054185297,
129
+ "total_evaluation_time_seconds": "196.5372415711172"
130
+ }
evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "etec_v2": {
4
+ "alias": "etec_v2",
5
+ "acc,none": 0.6666666666666666,
6
+ "acc_stderr,none": 0.010854826817097195,
7
+ "acc_norm,none": 0.6666666666666666,
8
+ "acc_norm_stderr,none": 0.010854826817097195
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "etec_v2": []
13
+ },
14
+ "configs": {
15
+ "etec_v2": {
16
+ "task": "etec_v2",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
21
+ "dataset_name": "etec_v2",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 0,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "etec_v2": 0.0
58
+ },
59
+ "n-shot": {
60
+ "etec_v2": 0
61
+ },
62
+ "higher_is_better": {
63
+ "etec_v2": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "etec_v2": {
70
+ "original": 1887,
71
+ "effective": 1887
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
77
+ "model_num_parameters": 7000559616,
78
+ "model_dtype": "torch.bfloat16",
79
+ "model_revision": "main",
80
+ "model_sha": "",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739617421.4265695,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<unk>",
100
+ "0"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "</s>",
104
+ "2"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ "<s>",
108
+ "1"
109
+ ],
110
+ "eot_token_id": 2,
111
+ "max_length": 4096,
112
+ "task_hashes": {
113
+ "etec_v2": "a0d87bf7eb82815b66ea544cb632aafb803526dee24b399f30fdc751be442b60"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
117
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
122
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
123
+ "start_time": 1392110.980523203,
124
+ "end_time": 1392198.883363127,
125
+ "total_evaluation_time_seconds": "87.90283992397599"
126
+ }
evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "exams_ar": {
4
+ "alias": "exams_ar",
5
+ "acc,none": 0.515828677839851,
6
+ "acc_stderr,none": 0.021585885942816244,
7
+ "acc_norm,none": 0.515828677839851,
8
+ "acc_norm_stderr,none": 0.021585885942816244
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "exams_ar": []
13
+ },
14
+ "configs": {
15
+ "exams_ar": {
16
+ "task": "exams_ar",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/exams_ar",
21
+ "dataset_name": "exams_ar",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "test_split": "test",
26
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
27
+ "doc_to_text": "query",
28
+ "doc_to_target": "gold",
29
+ "doc_to_choice": "choices",
30
+ "description": "description",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 5,
34
+ "metric_list": [
35
+ {
36
+ "metric": "acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "acc_norm",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ }
45
+ ],
46
+ "output_type": "multiple_choice",
47
+ "repeats": 1,
48
+ "should_decontaminate": true,
49
+ "doc_to_decontamination_query": "query",
50
+ "metadata": {
51
+ "version": 0.0
52
+ }
53
+ }
54
+ },
55
+ "versions": {
56
+ "exams_ar": 0.0
57
+ },
58
+ "n-shot": {
59
+ "exams_ar": 5
60
+ },
61
+ "higher_is_better": {
62
+ "exams_ar": {
63
+ "acc": true,
64
+ "acc_norm": true
65
+ }
66
+ },
67
+ "n-samples": {
68
+ "exams_ar": {
69
+ "original": 537,
70
+ "effective": 537
71
+ }
72
+ },
73
+ "config": {
74
+ "model": "vllm",
75
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
76
+ "batch_size": 1,
77
+ "batch_sizes": [],
78
+ "device": null,
79
+ "use_cache": null,
80
+ "limit": null,
81
+ "bootstrap_iters": 100000,
82
+ "gen_kwargs": null,
83
+ "random_seed": 0,
84
+ "numpy_seed": 1234,
85
+ "torch_seed": 1234,
86
+ "fewshot_seed": 1234
87
+ },
88
+ "git_hash": "8e1bd48d",
89
+ "date": 1735662207.0830526,
90
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
91
+ "transformers_version": "4.47.1",
92
+ "upper_git_hash": null,
93
+ "tokenizer_pad_token": [
94
+ "<unk>",
95
+ "0"
96
+ ],
97
+ "tokenizer_eos_token": [
98
+ "</s>",
99
+ "2"
100
+ ],
101
+ "tokenizer_bos_token": [
102
+ "<s>",
103
+ "1"
104
+ ],
105
+ "eot_token_id": 2,
106
+ "max_length": 4096,
107
+ "task_hashes": {
108
+ "exams_ar": "b1561abd56354d570ac16bf64163b0ee8dc6c507234b05f678576b09c26c644a"
109
+ },
110
+ "model_source": "vllm",
111
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
112
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
113
+ "system_instruction": null,
114
+ "system_instruction_sha": null,
115
+ "fewshot_as_multiturn": false,
116
+ "chat_template": null,
117
+ "chat_template_sha": null,
118
+ "start_time": 2867.397536365,
119
+ "end_time": 2948.510496752,
120
+ "total_evaluation_time_seconds": "81.11296038699993"
121
+ }
evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json ADDED
@@ -0,0 +1,549 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "gat": {
4
+ "acc,none": 0.4452527279568544,
5
+ "acc_stderr,none": 0.0038711388833064567,
6
+ "alias": "gat"
7
+ },
8
+ "gat_algebra": {
9
+ "alias": " - gat_algebra",
10
+ "acc,none": 0.40667903525046384,
11
+ "acc_stderr,none": 0.009463939247454995
12
+ },
13
+ "gat_analogy": {
14
+ "alias": " - gat_analogy",
15
+ "acc,none": 0.35919854280510016,
16
+ "acc_stderr,none": 0.009158766245747282
17
+ },
18
+ "gat_arithmetic": {
19
+ "alias": " - gat_arithmetic",
20
+ "acc,none": 0.40154582259845417,
21
+ "acc_stderr,none": 0.009406284814832203
22
+ },
23
+ "gat_association": {
24
+ "alias": " - gat_association",
25
+ "acc,none": 0.5464114832535886,
26
+ "acc_stderr,none": 0.015407801869520031
27
+ },
28
+ "gat_comparisons": {
29
+ "alias": " - gat_comparisons",
30
+ "acc,none": 0.34508196721311474,
31
+ "acc_stderr,none": 0.013616100682624904
32
+ },
33
+ "gat_completion": {
34
+ "alias": " - gat_completion",
35
+ "acc,none": 0.6057851239669422,
36
+ "acc_stderr,none": 0.014054411207805699
37
+ },
38
+ "gat_contextual": {
39
+ "alias": " - gat_contextual",
40
+ "acc,none": 0.3941717791411043,
41
+ "acc_stderr,none": 0.013537713096332765
42
+ },
43
+ "gat_geometry": {
44
+ "alias": " - gat_geometry",
45
+ "acc,none": 0.473972602739726,
46
+ "acc_stderr,none": 0.026171590093068537
47
+ },
48
+ "gat_reading": {
49
+ "alias": " - gat_reading",
50
+ "acc,none": 0.5727788279773157,
51
+ "acc_stderr,none": 0.009620311542503682
52
+ }
53
+ },
54
+ "groups": {
55
+ "gat": {
56
+ "acc,none": 0.4452527279568544,
57
+ "acc_stderr,none": 0.0038711388833064567,
58
+ "alias": "gat"
59
+ }
60
+ },
61
+ "group_subtasks": {
62
+ "gat": [
63
+ "gat_analogy",
64
+ "gat_association",
65
+ "gat_completion",
66
+ "gat_reading",
67
+ "gat_algebra",
68
+ "gat_arithmetic",
69
+ "gat_comparisons",
70
+ "gat_contextual",
71
+ "gat_geometry"
72
+ ]
73
+ },
74
+ "configs": {
75
+ "gat_algebra": {
76
+ "task": "gat_algebra",
77
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
78
+ "dataset_name": "algebra",
79
+ "dataset_kwargs": {
80
+ "trust_remote_code": true
81
+ },
82
+ "test_split": "test",
83
+ "fewshot_split": "validation",
84
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
85
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
86
+ "doc_to_target": "{{label}}",
87
+ "doc_to_choice": [
88
+ "\u0623",
89
+ "\u0628",
90
+ "\u062c",
91
+ "\u062f"
92
+ ],
93
+ "description": "",
94
+ "target_delimiter": " ",
95
+ "fewshot_delimiter": "\n\n",
96
+ "num_fewshot": 0,
97
+ "metric_list": [
98
+ {
99
+ "metric": "acc",
100
+ "aggregation": "mean",
101
+ "higher_is_better": true
102
+ }
103
+ ],
104
+ "output_type": "multiple_choice",
105
+ "repeats": 1,
106
+ "should_decontaminate": false,
107
+ "metadata": {
108
+ "version": 0.0
109
+ }
110
+ },
111
+ "gat_analogy": {
112
+ "task": "gat_analogy",
113
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
114
+ "dataset_name": "analogy",
115
+ "dataset_kwargs": {
116
+ "trust_remote_code": true
117
+ },
118
+ "test_split": "test",
119
+ "fewshot_split": "validation",
120
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
121
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
122
+ "doc_to_target": "{{label}}",
123
+ "doc_to_choice": [
124
+ "\u0623",
125
+ "\u0628",
126
+ "\u062c",
127
+ "\u062f"
128
+ ],
129
+ "description": "",
130
+ "target_delimiter": " ",
131
+ "fewshot_delimiter": "\n\n",
132
+ "num_fewshot": 0,
133
+ "metric_list": [
134
+ {
135
+ "metric": "acc",
136
+ "aggregation": "mean",
137
+ "higher_is_better": true
138
+ }
139
+ ],
140
+ "output_type": "multiple_choice",
141
+ "repeats": 1,
142
+ "should_decontaminate": false,
143
+ "metadata": {
144
+ "version": 0.0
145
+ }
146
+ },
147
+ "gat_arithmetic": {
148
+ "task": "gat_arithmetic",
149
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
150
+ "dataset_name": "arithmetic",
151
+ "dataset_kwargs": {
152
+ "trust_remote_code": true
153
+ },
154
+ "test_split": "test",
155
+ "fewshot_split": "validation",
156
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
157
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
158
+ "doc_to_target": "{{label}}",
159
+ "doc_to_choice": [
160
+ "\u0623",
161
+ "\u0628",
162
+ "\u062c",
163
+ "\u062f"
164
+ ],
165
+ "description": "",
166
+ "target_delimiter": " ",
167
+ "fewshot_delimiter": "\n\n",
168
+ "num_fewshot": 0,
169
+ "metric_list": [
170
+ {
171
+ "metric": "acc",
172
+ "aggregation": "mean",
173
+ "higher_is_better": true
174
+ }
175
+ ],
176
+ "output_type": "multiple_choice",
177
+ "repeats": 1,
178
+ "should_decontaminate": false,
179
+ "metadata": {
180
+ "version": 0.0
181
+ }
182
+ },
183
+ "gat_association": {
184
+ "task": "gat_association",
185
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
186
+ "dataset_name": "association",
187
+ "dataset_kwargs": {
188
+ "trust_remote_code": true
189
+ },
190
+ "test_split": "test",
191
+ "fewshot_split": "validation",
192
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
193
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
194
+ "doc_to_target": "{{label}}",
195
+ "doc_to_choice": [
196
+ "\u0623",
197
+ "\u0628",
198
+ "\u062c",
199
+ "\u062f"
200
+ ],
201
+ "description": "",
202
+ "target_delimiter": " ",
203
+ "fewshot_delimiter": "\n\n",
204
+ "num_fewshot": 0,
205
+ "metric_list": [
206
+ {
207
+ "metric": "acc",
208
+ "aggregation": "mean",
209
+ "higher_is_better": true
210
+ }
211
+ ],
212
+ "output_type": "multiple_choice",
213
+ "repeats": 1,
214
+ "should_decontaminate": false,
215
+ "metadata": {
216
+ "version": 0.0
217
+ }
218
+ },
219
+ "gat_comparisons": {
220
+ "task": "gat_comparisons",
221
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
222
+ "dataset_name": "comparisons",
223
+ "dataset_kwargs": {
224
+ "trust_remote_code": true
225
+ },
226
+ "test_split": "test",
227
+ "fewshot_split": "validation",
228
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
229
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
230
+ "doc_to_target": "{{label}}",
231
+ "doc_to_choice": [
232
+ "\u0623",
233
+ "\u0628",
234
+ "\u062c",
235
+ "\u062f"
236
+ ],
237
+ "description": "",
238
+ "target_delimiter": " ",
239
+ "fewshot_delimiter": "\n\n",
240
+ "num_fewshot": 0,
241
+ "metric_list": [
242
+ {
243
+ "metric": "acc",
244
+ "aggregation": "mean",
245
+ "higher_is_better": true
246
+ }
247
+ ],
248
+ "output_type": "multiple_choice",
249
+ "repeats": 1,
250
+ "should_decontaminate": false,
251
+ "metadata": {
252
+ "version": 0.0
253
+ }
254
+ },
255
+ "gat_completion": {
256
+ "task": "gat_completion",
257
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
258
+ "dataset_name": "completion",
259
+ "dataset_kwargs": {
260
+ "trust_remote_code": true
261
+ },
262
+ "test_split": "test",
263
+ "fewshot_split": "validation",
264
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
265
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
266
+ "doc_to_target": "{{label}}",
267
+ "doc_to_choice": [
268
+ "\u0623",
269
+ "\u0628",
270
+ "\u062c",
271
+ "\u062f"
272
+ ],
273
+ "description": "",
274
+ "target_delimiter": " ",
275
+ "fewshot_delimiter": "\n\n",
276
+ "num_fewshot": 0,
277
+ "metric_list": [
278
+ {
279
+ "metric": "acc",
280
+ "aggregation": "mean",
281
+ "higher_is_better": true
282
+ }
283
+ ],
284
+ "output_type": "multiple_choice",
285
+ "repeats": 1,
286
+ "should_decontaminate": false,
287
+ "metadata": {
288
+ "version": 0.0
289
+ }
290
+ },
291
+ "gat_contextual": {
292
+ "task": "gat_contextual",
293
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
294
+ "dataset_name": "contextual",
295
+ "dataset_kwargs": {
296
+ "trust_remote_code": true
297
+ },
298
+ "test_split": "test",
299
+ "fewshot_split": "validation",
300
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
301
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
302
+ "doc_to_target": "{{label}}",
303
+ "doc_to_choice": [
304
+ "\u0623",
305
+ "\u0628",
306
+ "\u062c",
307
+ "\u062f"
308
+ ],
309
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
310
+ "target_delimiter": " ",
311
+ "fewshot_delimiter": "\n\n",
312
+ "num_fewshot": 0,
313
+ "metric_list": [
314
+ {
315
+ "metric": "acc",
316
+ "aggregation": "mean",
317
+ "higher_is_better": true
318
+ }
319
+ ],
320
+ "output_type": "multiple_choice",
321
+ "repeats": 1,
322
+ "should_decontaminate": false,
323
+ "metadata": {
324
+ "version": 0.0
325
+ }
326
+ },
327
+ "gat_geometry": {
328
+ "task": "gat_geometry",
329
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
330
+ "dataset_name": "geometry",
331
+ "dataset_kwargs": {
332
+ "trust_remote_code": true
333
+ },
334
+ "test_split": "test",
335
+ "fewshot_split": "validation",
336
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
337
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
338
+ "doc_to_target": "{{label}}",
339
+ "doc_to_choice": [
340
+ "\u0623",
341
+ "\u0628",
342
+ "\u062c",
343
+ "\u062f"
344
+ ],
345
+ "description": "",
346
+ "target_delimiter": " ",
347
+ "fewshot_delimiter": "\n\n",
348
+ "num_fewshot": 0,
349
+ "metric_list": [
350
+ {
351
+ "metric": "acc",
352
+ "aggregation": "mean",
353
+ "higher_is_better": true
354
+ }
355
+ ],
356
+ "output_type": "multiple_choice",
357
+ "repeats": 1,
358
+ "should_decontaminate": false,
359
+ "metadata": {
360
+ "version": 0.0
361
+ }
362
+ },
363
+ "gat_reading": {
364
+ "task": "gat_reading",
365
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
366
+ "dataset_name": "reading",
367
+ "dataset_kwargs": {
368
+ "trust_remote_code": true
369
+ },
370
+ "test_split": "test",
371
+ "fewshot_split": "validation",
372
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
373
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
374
+ "doc_to_target": "{{label}}",
375
+ "doc_to_choice": [
376
+ "\u0623",
377
+ "\u0628",
378
+ "\u062c",
379
+ "\u062f"
380
+ ],
381
+ "description": "",
382
+ "target_delimiter": " ",
383
+ "fewshot_delimiter": "\n\n",
384
+ "num_fewshot": 0,
385
+ "metric_list": [
386
+ {
387
+ "metric": "acc",
388
+ "aggregation": "mean",
389
+ "higher_is_better": true
390
+ }
391
+ ],
392
+ "output_type": "multiple_choice",
393
+ "repeats": 1,
394
+ "should_decontaminate": false,
395
+ "metadata": {
396
+ "version": 0.0
397
+ }
398
+ }
399
+ },
400
+ "versions": {
401
+ "gat": 0,
402
+ "gat_algebra": 0.0,
403
+ "gat_analogy": 0.0,
404
+ "gat_arithmetic": 0.0,
405
+ "gat_association": 0.0,
406
+ "gat_comparisons": 0.0,
407
+ "gat_completion": 0.0,
408
+ "gat_contextual": 0.0,
409
+ "gat_geometry": 0.0,
410
+ "gat_reading": 0.0
411
+ },
412
+ "n-shot": {
413
+ "gat_algebra": 0,
414
+ "gat_analogy": 0,
415
+ "gat_arithmetic": 0,
416
+ "gat_association": 0,
417
+ "gat_comparisons": 0,
418
+ "gat_completion": 0,
419
+ "gat_contextual": 0,
420
+ "gat_geometry": 0,
421
+ "gat_reading": 0
422
+ },
423
+ "higher_is_better": {
424
+ "gat": {
425
+ "acc": true
426
+ },
427
+ "gat_algebra": {
428
+ "acc": true
429
+ },
430
+ "gat_analogy": {
431
+ "acc": true
432
+ },
433
+ "gat_arithmetic": {
434
+ "acc": true
435
+ },
436
+ "gat_association": {
437
+ "acc": true
438
+ },
439
+ "gat_comparisons": {
440
+ "acc": true
441
+ },
442
+ "gat_completion": {
443
+ "acc": true
444
+ },
445
+ "gat_contextual": {
446
+ "acc": true
447
+ },
448
+ "gat_geometry": {
449
+ "acc": true
450
+ },
451
+ "gat_reading": {
452
+ "acc": true
453
+ }
454
+ },
455
+ "n-samples": {
456
+ "gat_analogy": {
457
+ "original": 2745,
458
+ "effective": 2745
459
+ },
460
+ "gat_association": {
461
+ "original": 1045,
462
+ "effective": 1045
463
+ },
464
+ "gat_completion": {
465
+ "original": 1210,
466
+ "effective": 1210
467
+ },
468
+ "gat_reading": {
469
+ "original": 2645,
470
+ "effective": 2645
471
+ },
472
+ "gat_algebra": {
473
+ "original": 2695,
474
+ "effective": 2695
475
+ },
476
+ "gat_arithmetic": {
477
+ "original": 2717,
478
+ "effective": 2717
479
+ },
480
+ "gat_comparisons": {
481
+ "original": 1220,
482
+ "effective": 1220
483
+ },
484
+ "gat_contextual": {
485
+ "original": 1304,
486
+ "effective": 1304
487
+ },
488
+ "gat_geometry": {
489
+ "original": 365,
490
+ "effective": 365
491
+ }
492
+ },
493
+ "config": {
494
+ "model": "vllm",
495
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
496
+ "batch_size": 1,
497
+ "batch_sizes": [],
498
+ "device": null,
499
+ "use_cache": null,
500
+ "limit": null,
501
+ "bootstrap_iters": 100000,
502
+ "gen_kwargs": null,
503
+ "random_seed": 0,
504
+ "numpy_seed": 1234,
505
+ "torch_seed": 1234,
506
+ "fewshot_seed": 1234
507
+ },
508
+ "git_hash": "8e1bd48d",
509
+ "date": 1735664096.2650902,
510
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
511
+ "transformers_version": "4.47.1",
512
+ "upper_git_hash": null,
513
+ "tokenizer_pad_token": [
514
+ "<unk>",
515
+ "0"
516
+ ],
517
+ "tokenizer_eos_token": [
518
+ "</s>",
519
+ "2"
520
+ ],
521
+ "tokenizer_bos_token": [
522
+ "<s>",
523
+ "1"
524
+ ],
525
+ "eot_token_id": 2,
526
+ "max_length": 4096,
527
+ "task_hashes": {
528
+ "gat_analogy": "ede28dec097bfebe8a85a19fa27d001696858276df66254bdb70fc63231f1a83",
529
+ "gat_association": "5d82550d46c4f3cabf370185a8a23cc2eb5b08f1f0c5e210a8a712562a44bd08",
530
+ "gat_completion": "fc3c19dd7f1896696fec1bffc21182804c9b2f1fb8d8c882428a6bb4bb61e370",
531
+ "gat_reading": "93053b187a750d2e87f5488f2d0fda944f3da9195bb04d1c4dee9c4b56fa626a",
532
+ "gat_algebra": "77832c595eaaf156775c3dbb27da0915ef600ebf46a7113ae32a202b0359e8a6",
533
+ "gat_arithmetic": "6a498f75f5cc0ffd1b30f7a6293ba80d08f2a8876d5558d8e934bf57355ff0cc",
534
+ "gat_comparisons": "acb80c0ed8dd07e916a471189aef3a546efc289824b2cc50a32c11dc4c97c9c1",
535
+ "gat_contextual": "de063ed3b94011d74ee24a6532122c9d344fc15e42800db44f0849995a0bc37a",
536
+ "gat_geometry": "3e482885559a4404ee9e97556edc6e49959770a499f4ae2c58f18ad85b91a363"
537
+ },
538
+ "model_source": "vllm",
539
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
540
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
541
+ "system_instruction": null,
542
+ "system_instruction_sha": null,
543
+ "fewshot_as_multiturn": false,
544
+ "chat_template": null,
545
+ "chat_template_sha": null,
546
+ "start_time": 4756.376698655,
547
+ "end_time": 5124.76942052,
548
+ "total_evaluation_time_seconds": "368.39272186499966"
549
+ }
evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_mcq": {
4
+ "alias": "moe_ien_mcq",
5
+ "acc,none": 0.9177177177177177,
6
+ "acc_stderr,none": 0.002749455634736978,
7
+ "acc_norm,none": 0.9177177177177177,
8
+ "acc_norm_stderr,none": 0.002749455634736978
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_mcq": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_mcq": {
16
+ "task": "moe_ien_mcq",
17
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
18
+ "dataset_name": "moe_ien_mcq",
19
+ "dataset_kwargs": {
20
+ "trust_remote_code": true
21
+ },
22
+ "validation_split": "validation",
23
+ "test_split": "test",
24
+ "fewshot_split": "validation",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "Query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "{{Choices}}",
29
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "fewshot_config": {
33
+ "sampler": "balanced_cat"
34
+ },
35
+ "num_fewshot": 0,
36
+ "metric_list": [
37
+ {
38
+ "metric": "acc",
39
+ "aggregation": "mean",
40
+ "higher_is_better": true
41
+ },
42
+ {
43
+ "metric": "acc_norm",
44
+ "aggregation": "mean",
45
+ "higher_is_better": true
46
+ }
47
+ ],
48
+ "output_type": "multiple_choice",
49
+ "repeats": 1,
50
+ "should_decontaminate": true,
51
+ "doc_to_decontamination_query": "Query",
52
+ "metadata": {
53
+ "version": 0.0
54
+ }
55
+ }
56
+ },
57
+ "versions": {
58
+ "moe_ien_mcq": 0.0
59
+ },
60
+ "n-shot": {
61
+ "moe_ien_mcq": 0
62
+ },
63
+ "higher_is_better": {
64
+ "moe_ien_mcq": {
65
+ "acc": true,
66
+ "acc_norm": true
67
+ }
68
+ },
69
+ "n-samples": {
70
+ "moe_ien_mcq": {
71
+ "original": 9990,
72
+ "effective": 9990
73
+ }
74
+ },
75
+ "config": {
76
+ "model": "hf",
77
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
78
+ "model_num_parameters": 7000559616,
79
+ "model_dtype": "torch.bfloat16",
80
+ "model_revision": "main",
81
+ "model_sha": "",
82
+ "batch_size": 1,
83
+ "batch_sizes": [],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "b955b2950",
95
+ "date": 1739617571.8184838,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.3",
98
+ "upper_git_hash": null,
99
+ "tokenizer_pad_token": [
100
+ "<unk>",
101
+ "0"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "</s>",
105
+ "2"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ "<s>",
109
+ "1"
110
+ ],
111
+ "eot_token_id": 2,
112
+ "max_length": 4096,
113
+ "task_hashes": {
114
+ "moe_ien_mcq": "504533b140426f12c89d975ef421328fc89d69af8719c420a1bf897ed4724191"
115
+ },
116
+ "model_source": "hf",
117
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
118
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
119
+ "system_instruction": null,
120
+ "system_instruction_sha": null,
121
+ "fewshot_as_multiturn": false,
122
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
123
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
124
+ "start_time": 1392261.292633723,
125
+ "end_time": 1392626.942167409,
126
+ "total_evaluation_time_seconds": "365.64953368599527"
127
+ }
evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_tf": {
4
+ "alias": "moe_ien_tf",
5
+ "acc,none": 0.8294693456980937,
6
+ "acc_stderr,none": 0.004929073554117403,
7
+ "acc_norm,none": 0.8294693456980937,
8
+ "acc_norm_stderr,none": 0.004929073554117403
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_tf": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_tf": {
16
+ "task": "moe_ien_tf",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
21
+ "dataset_name": "moe_ien_tf",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "choices",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": false,
54
+ "metadata": {
55
+ "version": 2.0
56
+ }
57
+ }
58
+ },
59
+ "versions": {
60
+ "moe_ien_tf": 2.0
61
+ },
62
+ "n-shot": {
63
+ "moe_ien_tf": 0
64
+ },
65
+ "higher_is_better": {
66
+ "moe_ien_tf": {
67
+ "acc": true,
68
+ "acc_norm": true
69
+ }
70
+ },
71
+ "n-samples": {
72
+ "moe_ien_tf": {
73
+ "original": 5823,
74
+ "effective": 5823
75
+ }
76
+ },
77
+ "config": {
78
+ "model": "hf",
79
+ "model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
80
+ "model_num_parameters": 7000559616,
81
+ "model_dtype": "torch.bfloat16",
82
+ "model_revision": "main",
83
+ "model_sha": "",
84
+ "batch_size": 1,
85
+ "batch_sizes": [],
86
+ "device": null,
87
+ "use_cache": null,
88
+ "limit": null,
89
+ "bootstrap_iters": 100000,
90
+ "gen_kwargs": null,
91
+ "random_seed": 0,
92
+ "numpy_seed": 1234,
93
+ "torch_seed": 1234,
94
+ "fewshot_seed": 1234
95
+ },
96
+ "git_hash": "b955b2950",
97
+ "date": 1739617995.3462336,
98
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
99
+ "transformers_version": "4.48.3",
100
+ "upper_git_hash": null,
101
+ "tokenizer_pad_token": [
102
+ "<unk>",
103
+ "0"
104
+ ],
105
+ "tokenizer_eos_token": [
106
+ "</s>",
107
+ "2"
108
+ ],
109
+ "tokenizer_bos_token": [
110
+ "<s>",
111
+ "1"
112
+ ],
113
+ "eot_token_id": 2,
114
+ "max_length": 4096,
115
+ "task_hashes": {
116
+ "moe_ien_tf": "8701a646f6ea8b9bb96c028f817fbeabfb9031580f5054368b43d14d4a5a1270"
117
+ },
118
+ "model_source": "hf",
119
+ "model_name": "/tmp/7b-alpha-v1.27.2.25",
120
+ "model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
121
+ "system_instruction": null,
122
+ "system_instruction_sha": null,
123
+ "fewshot_as_multiturn": false,
124
+ "chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
125
+ "chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
126
+ "start_time": 1392684.818305694,
127
+ "end_time": 1392900.218863064,
128
+ "total_evaluation_time_seconds": "215.40055736992508"
129
+ }
evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "acva": {
4
+ "alias": "acva",
5
+ "acc,none": 0.6045924225028703,
6
+ "acc_stderr,none": 0.00523925695392083,
7
+ "acc_norm,none": 0.5897818599311137,
8
+ "acc_norm_stderr,none": 0.005270708411925859
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "acva": []
13
+ },
14
+ "configs": {
15
+ "acva": {
16
+ "task": "acva",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
21
+ "dataset_kwargs": {
22
+ "trust_remote_code": true
23
+ },
24
+ "test_split": "test",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "choices",
29
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "num_fewshot": 5,
33
+ "metric_list": [
34
+ {
35
+ "metric": "acc",
36
+ "aggregation": "mean",
37
+ "higher_is_better": true
38
+ },
39
+ {
40
+ "metric": "acc_norm",
41
+ "aggregation": "mean",
42
+ "higher_is_better": true
43
+ }
44
+ ],
45
+ "output_type": "multiple_choice",
46
+ "repeats": 1,
47
+ "should_decontaminate": false,
48
+ "metadata": {
49
+ "version": 0.0
50
+ }
51
+ }
52
+ },
53
+ "versions": {
54
+ "acva": 0.0
55
+ },
56
+ "n-shot": {
57
+ "acva": 5
58
+ },
59
+ "higher_is_better": {
60
+ "acva": {
61
+ "acc": true,
62
+ "acc_norm": true
63
+ }
64
+ },
65
+ "n-samples": {
66
+ "acva": {
67
+ "original": 8710,
68
+ "effective": 8710
69
+ }
70
+ },
71
+ "config": {
72
+ "model": "hf",
73
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
74
+ "model_num_parameters": 7455550464,
75
+ "model_dtype": "torch.bfloat16",
76
+ "model_revision": "main",
77
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
78
+ "batch_size": 1,
79
+ "batch_sizes": [],
80
+ "device": null,
81
+ "use_cache": null,
82
+ "limit": null,
83
+ "bootstrap_iters": 100000,
84
+ "gen_kwargs": null,
85
+ "random_seed": 0,
86
+ "numpy_seed": 1234,
87
+ "torch_seed": 1234,
88
+ "fewshot_seed": 1234
89
+ },
90
+ "git_hash": "5e10e017",
91
+ "date": 1736889821.9957027,
92
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
93
+ "transformers_version": "4.48.0",
94
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
95
+ "tokenizer_pad_token": [
96
+ "<|pad|>",
97
+ "2023"
98
+ ],
99
+ "tokenizer_eos_token": [
100
+ "<|endoftext|>",
101
+ "11"
102
+ ],
103
+ "tokenizer_bos_token": [
104
+ null,
105
+ "None"
106
+ ],
107
+ "eot_token_id": 11,
108
+ "max_length": 32768,
109
+ "task_hashes": {
110
+ "acva": "f573ae5740e68711d257f2dc4a23db7c6b1c04895364f1af4b4eb64bfab793a4"
111
+ },
112
+ "model_source": "hf",
113
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
114
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
115
+ "system_instruction": null,
116
+ "system_instruction_sha": null,
117
+ "fewshot_as_multiturn": false,
118
+ "chat_template": null,
119
+ "chat_template_sha": null,
120
+ "start_time": 600072.370318618,
121
+ "end_time": 600217.222010416,
122
+ "total_evaluation_time_seconds": "144.85169179795776"
123
+ }
evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "ar_ifeval": {
4
+ "alias": "ar_ifeval",
5
+ "prompt_level_strict_acc,none": 0.08582089552238806,
6
+ "prompt_level_strict_acc_stderr,none": 0.012109752724743699,
7
+ "inst_level_strict_acc,none": 0.47918088737201364,
8
+ "inst_level_strict_acc_stderr,none": "N/A",
9
+ "prompt_level_loose_acc,none": 0.13805970149253732,
10
+ "prompt_level_loose_acc_stderr,none": 0.014914035308708435,
11
+ "inst_level_loose_acc,none": 0.5276450511945392,
12
+ "inst_level_loose_acc_stderr,none": "N/A"
13
+ }
14
+ },
15
+ "group_subtasks": {
16
+ "ar_ifeval": []
17
+ },
18
+ "configs": {
19
+ "ar_ifeval": {
20
+ "task": "ar_ifeval",
21
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
22
+ "dataset_name": "ar_ifeval",
23
+ "dataset_kwargs": {
24
+ "trust_remote_code": true
25
+ },
26
+ "test_split": "test",
27
+ "doc_to_text": "prompt",
28
+ "doc_to_target": 0,
29
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
30
+ "description": "",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 0,
34
+ "metric_list": [
35
+ {
36
+ "metric": "prompt_level_strict_acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "inst_level_strict_acc",
42
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "prompt_level_loose_acc",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ },
50
+ {
51
+ "metric": "inst_level_loose_acc",
52
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
53
+ "higher_is_better": true
54
+ }
55
+ ],
56
+ "output_type": "generate_until",
57
+ "generation_kwargs": {
58
+ "until": [],
59
+ "do_sample": false,
60
+ "temperature": 0.0,
61
+ "max_gen_toks": 1280
62
+ },
63
+ "repeats": 1,
64
+ "should_decontaminate": false,
65
+ "metadata": {
66
+ "version": 4.0
67
+ }
68
+ }
69
+ },
70
+ "versions": {
71
+ "ar_ifeval": 4.0
72
+ },
73
+ "n-shot": {
74
+ "ar_ifeval": 0
75
+ },
76
+ "higher_is_better": {
77
+ "ar_ifeval": {
78
+ "prompt_level_strict_acc": true,
79
+ "inst_level_strict_acc": true,
80
+ "prompt_level_loose_acc": true,
81
+ "inst_level_loose_acc": true
82
+ }
83
+ },
84
+ "n-samples": {
85
+ "ar_ifeval": {
86
+ "original": 536,
87
+ "effective": 536
88
+ }
89
+ },
90
+ "config": {
91
+ "model": "hf",
92
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
93
+ "model_num_parameters": 7455550464,
94
+ "model_dtype": "torch.bfloat16",
95
+ "model_revision": "main",
96
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
97
+ "batch_size": 1,
98
+ "batch_sizes": [],
99
+ "device": null,
100
+ "use_cache": null,
101
+ "limit": null,
102
+ "bootstrap_iters": 100000,
103
+ "gen_kwargs": null,
104
+ "random_seed": 0,
105
+ "numpy_seed": 1234,
106
+ "torch_seed": 1234,
107
+ "fewshot_seed": 1234
108
+ },
109
+ "git_hash": "b955b2950",
110
+ "date": 1739621196.897086,
111
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
112
+ "transformers_version": "4.48.3",
113
+ "upper_git_hash": null,
114
+ "tokenizer_pad_token": [
115
+ "<|pad|>",
116
+ "2023"
117
+ ],
118
+ "tokenizer_eos_token": [
119
+ "<|endoftext|>",
120
+ "11"
121
+ ],
122
+ "tokenizer_bos_token": [
123
+ null,
124
+ "None"
125
+ ],
126
+ "eot_token_id": 11,
127
+ "max_length": 32768,
128
+ "task_hashes": {
129
+ "ar_ifeval": "ca837eed1e9f468712643d1fab81b7b48c88a8799239851476bdc889990e6b41"
130
+ },
131
+ "model_source": "hf",
132
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
133
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
134
+ "system_instruction": null,
135
+ "system_instruction_sha": null,
136
+ "fewshot_as_multiturn": false,
137
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
138
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
139
+ "start_time": 1395880.012817552,
140
+ "end_time": 1401371.318791154,
141
+ "total_evaluation_time_seconds": "5491.305973601993"
142
+ }
evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araMath_v3": {
4
+ "alias": "araMath_v3",
5
+ "acc,none": 0.5652892561983471,
6
+ "acc_stderr,none": 0.020170519477736983,
7
+ "acc_norm,none": 0.5652892561983471,
8
+ "acc_norm_stderr,none": 0.020170519477736983
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araMath_v3": []
13
+ },
14
+ "configs": {
15
+ "araMath_v3": {
16
+ "task": "araMath_v3",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
21
+ "dataset_name": "araMath_v3",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "{{choices}}",
31
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "araMath_v3": 0.0
58
+ },
59
+ "n-shot": {
60
+ "araMath_v3": 5
61
+ },
62
+ "higher_is_better": {
63
+ "araMath_v3": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "araMath_v3": {
70
+ "original": 605,
71
+ "effective": 605
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
77
+ "model_num_parameters": 7455550464,
78
+ "model_dtype": "torch.bfloat16",
79
+ "model_revision": "main",
80
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739621084.921236,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|pad|>",
100
+ "2023"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|endoftext|>",
104
+ "11"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ null,
108
+ "None"
109
+ ],
110
+ "eot_token_id": 11,
111
+ "max_length": 32768,
112
+ "task_hashes": {
113
+ "araMath_v3": "b7e29b20c532c7420cc659c6586d56642070560abff0925ed01ad8f200d8e72b"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
117
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
122
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
123
+ "start_time": 1395768.116667791,
124
+ "end_time": 1395816.745740765,
125
+ "total_evaluation_time_seconds": "48.629072973970324"
126
+ }
evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araPro": {
4
+ "alias": "araPro",
5
+ "acc,none": 0.41471705658868224,
6
+ "acc_stderr,none": 0.006967450316480296,
7
+ "acc_norm,none": 0.41471705658868224,
8
+ "acc_norm_stderr,none": 0.006967450316480296
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araPro": []
13
+ },
14
+ "configs": {
15
+ "araPro": {
16
+ "task": "araPro",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
21
+ "dataset_name": "araPro",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "{{choices}}",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": true,
54
+ "doc_to_decontamination_query": "Question",
55
+ "metadata": {
56
+ "version": 2.0
57
+ }
58
+ }
59
+ },
60
+ "versions": {
61
+ "araPro": 2.0
62
+ },
63
+ "n-shot": {
64
+ "araPro": 0
65
+ },
66
+ "higher_is_better": {
67
+ "araPro": {
68
+ "acc": true,
69
+ "acc_norm": true
70
+ }
71
+ },
72
+ "n-samples": {
73
+ "araPro": {
74
+ "original": 5001,
75
+ "effective": 5001
76
+ }
77
+ },
78
+ "config": {
79
+ "model": "hf",
80
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
81
+ "model_num_parameters": 7455550464,
82
+ "model_dtype": "torch.bfloat16",
83
+ "model_revision": "main",
84
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
85
+ "batch_size": 1,
86
+ "batch_sizes": [],
87
+ "device": null,
88
+ "use_cache": null,
89
+ "limit": null,
90
+ "bootstrap_iters": 100000,
91
+ "gen_kwargs": null,
92
+ "random_seed": 0,
93
+ "numpy_seed": 1234,
94
+ "torch_seed": 1234,
95
+ "fewshot_seed": 1234
96
+ },
97
+ "git_hash": "b955b2950",
98
+ "date": 1739617143.3614087,
99
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
100
+ "transformers_version": "4.48.3",
101
+ "upper_git_hash": null,
102
+ "tokenizer_pad_token": [
103
+ "<|pad|>",
104
+ "2023"
105
+ ],
106
+ "tokenizer_eos_token": [
107
+ "<|endoftext|>",
108
+ "11"
109
+ ],
110
+ "tokenizer_bos_token": [
111
+ null,
112
+ "None"
113
+ ],
114
+ "eot_token_id": 11,
115
+ "max_length": 32768,
116
+ "task_hashes": {
117
+ "araPro": "063166ad2e52146b6a051c978bf54b1397281e222da633e81fa50357d2409ee9"
118
+ },
119
+ "model_source": "hf",
120
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
121
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
122
+ "system_instruction": null,
123
+ "system_instruction_sha": null,
124
+ "fewshot_as_multiturn": false,
125
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
126
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
127
+ "start_time": 1391826.416201954,
128
+ "end_time": 1394850.089034202,
129
+ "total_evaluation_time_seconds": "3023.672832248034"
130
+ }
evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "etec_v2": {
4
+ "alias": "etec_v2",
5
+ "acc,none": 0.3751987281399046,
6
+ "acc_stderr,none": 0.01114886834610489,
7
+ "acc_norm,none": 0.3751987281399046,
8
+ "acc_norm_stderr,none": 0.01114886834610489
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "etec_v2": []
13
+ },
14
+ "configs": {
15
+ "etec_v2": {
16
+ "task": "etec_v2",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/etec_v2/etec.py",
21
+ "dataset_name": "etec_v2",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 0,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "etec_v2": 0.0
58
+ },
59
+ "n-shot": {
60
+ "etec_v2": 0
61
+ },
62
+ "higher_is_better": {
63
+ "etec_v2": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "etec_v2": {
70
+ "original": 1887,
71
+ "effective": 1887
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
77
+ "model_num_parameters": 7455550464,
78
+ "model_dtype": "torch.bfloat16",
79
+ "model_revision": "main",
80
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "b955b2950",
94
+ "date": 1739620236.678696,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.3",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|pad|>",
100
+ "2023"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|endoftext|>",
104
+ "11"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ null,
108
+ "None"
109
+ ],
110
+ "eot_token_id": 11,
111
+ "max_length": 32768,
112
+ "task_hashes": {
113
+ "etec_v2": "3a8dc6484af6c9538f122c1bbe5c6866dbe14df841fdf04ab7ff2b6437e8aeae"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
117
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
122
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
123
+ "start_time": 1394919.684315533,
124
+ "end_time": 1394995.42617788,
125
+ "total_evaluation_time_seconds": "75.7418623471167"
126
+ }
evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "exams_ar": {
4
+ "alias": "exams_ar",
5
+ "acc,none": 0.31843575418994413,
6
+ "acc_stderr,none": 0.020122499132803468,
7
+ "acc_norm,none": 0.31843575418994413,
8
+ "acc_norm_stderr,none": 0.020122499132803468
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "exams_ar": []
13
+ },
14
+ "configs": {
15
+ "exams_ar": {
16
+ "task": "exams_ar",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/exams_ar",
21
+ "dataset_name": "exams_ar",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "test_split": "test",
26
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
27
+ "doc_to_text": "query",
28
+ "doc_to_target": "gold",
29
+ "doc_to_choice": "choices",
30
+ "description": "description",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 5,
34
+ "metric_list": [
35
+ {
36
+ "metric": "acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "acc_norm",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ }
45
+ ],
46
+ "output_type": "multiple_choice",
47
+ "repeats": 1,
48
+ "should_decontaminate": true,
49
+ "doc_to_decontamination_query": "query",
50
+ "metadata": {
51
+ "version": 0.0
52
+ }
53
+ }
54
+ },
55
+ "versions": {
56
+ "exams_ar": 0.0
57
+ },
58
+ "n-shot": {
59
+ "exams_ar": 5
60
+ },
61
+ "higher_is_better": {
62
+ "exams_ar": {
63
+ "acc": true,
64
+ "acc_norm": true
65
+ }
66
+ },
67
+ "n-samples": {
68
+ "exams_ar": {
69
+ "original": 537,
70
+ "effective": 537
71
+ }
72
+ },
73
+ "config": {
74
+ "model": "hf",
75
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
76
+ "model_num_parameters": 7455550464,
77
+ "model_dtype": "torch.bfloat16",
78
+ "model_revision": "main",
79
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
80
+ "batch_size": 1,
81
+ "batch_sizes": [],
82
+ "device": null,
83
+ "use_cache": null,
84
+ "limit": null,
85
+ "bootstrap_iters": 100000,
86
+ "gen_kwargs": null,
87
+ "random_seed": 0,
88
+ "numpy_seed": 1234,
89
+ "torch_seed": 1234,
90
+ "fewshot_seed": 1234
91
+ },
92
+ "git_hash": "5e10e017",
93
+ "date": 1736889028.6416683,
94
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
95
+ "transformers_version": "4.48.0",
96
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
97
+ "tokenizer_pad_token": [
98
+ "<|pad|>",
99
+ "2023"
100
+ ],
101
+ "tokenizer_eos_token": [
102
+ "<|endoftext|>",
103
+ "11"
104
+ ],
105
+ "tokenizer_bos_token": [
106
+ null,
107
+ "None"
108
+ ],
109
+ "eot_token_id": 11,
110
+ "max_length": 32768,
111
+ "task_hashes": {
112
+ "exams_ar": "f52ab3f14b240558420910fdb453ccb45c945cec187c0e60ea51cf6eff08973a"
113
+ },
114
+ "model_source": "hf",
115
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
116
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
117
+ "system_instruction": null,
118
+ "system_instruction_sha": null,
119
+ "fewshot_as_multiturn": false,
120
+ "chat_template": null,
121
+ "chat_template_sha": null,
122
+ "start_time": 599279.04705073,
123
+ "end_time": 599692.233103212,
124
+ "total_evaluation_time_seconds": "413.1860524819931"
125
+ }
evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json ADDED
@@ -0,0 +1,553 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "gat": {
4
+ "acc,none": 0.27994481374639407,
5
+ "acc_stderr,none": 0.003542796359675536,
6
+ "alias": "gat"
7
+ },
8
+ "gat_algebra": {
9
+ "alias": " - gat_algebra",
10
+ "acc,none": 0.2571428571428571,
11
+ "acc_stderr,none": 0.008420562208967575
12
+ },
13
+ "gat_analogy": {
14
+ "alias": " - gat_analogy",
15
+ "acc,none": 0.24553734061930782,
16
+ "acc_stderr,none": 0.008216476082874105
17
+ },
18
+ "gat_arithmetic": {
19
+ "alias": " - gat_arithmetic",
20
+ "acc,none": 0.26573426573426573,
21
+ "acc_stderr,none": 0.008475894211016492
22
+ },
23
+ "gat_association": {
24
+ "alias": " - gat_association",
25
+ "acc,none": 0.24019138755980862,
26
+ "acc_stderr,none": 0.013221495215360054
27
+ },
28
+ "gat_comparisons": {
29
+ "alias": " - gat_comparisons",
30
+ "acc,none": 0.319672131147541,
31
+ "acc_stderr,none": 0.013357022766710734
32
+ },
33
+ "gat_completion": {
34
+ "alias": " - gat_completion",
35
+ "acc,none": 0.27520661157024795,
36
+ "acc_stderr,none": 0.012844683062506254
37
+ },
38
+ "gat_contextual": {
39
+ "alias": " - gat_contextual",
40
+ "acc,none": 0.26993865030674846,
41
+ "acc_stderr,none": 0.01229815625441917
42
+ },
43
+ "gat_geometry": {
44
+ "alias": " - gat_geometry",
45
+ "acc,none": 0.2876712328767123,
46
+ "acc_stderr,none": 0.023726723391354485
47
+ },
48
+ "gat_reading": {
49
+ "alias": " - gat_reading",
50
+ "acc,none": 0.3568998109640832,
51
+ "acc_stderr,none": 0.009317121354774414
52
+ }
53
+ },
54
+ "groups": {
55
+ "gat": {
56
+ "acc,none": 0.27994481374639407,
57
+ "acc_stderr,none": 0.003542796359675536,
58
+ "alias": "gat"
59
+ }
60
+ },
61
+ "group_subtasks": {
62
+ "gat": [
63
+ "gat_analogy",
64
+ "gat_association",
65
+ "gat_completion",
66
+ "gat_reading",
67
+ "gat_algebra",
68
+ "gat_arithmetic",
69
+ "gat_comparisons",
70
+ "gat_contextual",
71
+ "gat_geometry"
72
+ ]
73
+ },
74
+ "configs": {
75
+ "gat_algebra": {
76
+ "task": "gat_algebra",
77
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
78
+ "dataset_name": "algebra",
79
+ "dataset_kwargs": {
80
+ "trust_remote_code": true
81
+ },
82
+ "test_split": "test",
83
+ "fewshot_split": "validation",
84
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
85
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
86
+ "doc_to_target": "{{label}}",
87
+ "doc_to_choice": [
88
+ "\u0623",
89
+ "\u0628",
90
+ "\u062c",
91
+ "\u062f"
92
+ ],
93
+ "description": "",
94
+ "target_delimiter": " ",
95
+ "fewshot_delimiter": "\n\n",
96
+ "num_fewshot": 0,
97
+ "metric_list": [
98
+ {
99
+ "metric": "acc",
100
+ "aggregation": "mean",
101
+ "higher_is_better": true
102
+ }
103
+ ],
104
+ "output_type": "multiple_choice",
105
+ "repeats": 1,
106
+ "should_decontaminate": false,
107
+ "metadata": {
108
+ "version": 0.0
109
+ }
110
+ },
111
+ "gat_analogy": {
112
+ "task": "gat_analogy",
113
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
114
+ "dataset_name": "analogy",
115
+ "dataset_kwargs": {
116
+ "trust_remote_code": true
117
+ },
118
+ "test_split": "test",
119
+ "fewshot_split": "validation",
120
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
121
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
122
+ "doc_to_target": "{{label}}",
123
+ "doc_to_choice": [
124
+ "\u0623",
125
+ "\u0628",
126
+ "\u062c",
127
+ "\u062f"
128
+ ],
129
+ "description": "",
130
+ "target_delimiter": " ",
131
+ "fewshot_delimiter": "\n\n",
132
+ "num_fewshot": 0,
133
+ "metric_list": [
134
+ {
135
+ "metric": "acc",
136
+ "aggregation": "mean",
137
+ "higher_is_better": true
138
+ }
139
+ ],
140
+ "output_type": "multiple_choice",
141
+ "repeats": 1,
142
+ "should_decontaminate": false,
143
+ "metadata": {
144
+ "version": 0.0
145
+ }
146
+ },
147
+ "gat_arithmetic": {
148
+ "task": "gat_arithmetic",
149
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
150
+ "dataset_name": "arithmetic",
151
+ "dataset_kwargs": {
152
+ "trust_remote_code": true
153
+ },
154
+ "test_split": "test",
155
+ "fewshot_split": "validation",
156
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
157
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
158
+ "doc_to_target": "{{label}}",
159
+ "doc_to_choice": [
160
+ "\u0623",
161
+ "\u0628",
162
+ "\u062c",
163
+ "\u062f"
164
+ ],
165
+ "description": "",
166
+ "target_delimiter": " ",
167
+ "fewshot_delimiter": "\n\n",
168
+ "num_fewshot": 0,
169
+ "metric_list": [
170
+ {
171
+ "metric": "acc",
172
+ "aggregation": "mean",
173
+ "higher_is_better": true
174
+ }
175
+ ],
176
+ "output_type": "multiple_choice",
177
+ "repeats": 1,
178
+ "should_decontaminate": false,
179
+ "metadata": {
180
+ "version": 0.0
181
+ }
182
+ },
183
+ "gat_association": {
184
+ "task": "gat_association",
185
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
186
+ "dataset_name": "association",
187
+ "dataset_kwargs": {
188
+ "trust_remote_code": true
189
+ },
190
+ "test_split": "test",
191
+ "fewshot_split": "validation",
192
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
193
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
194
+ "doc_to_target": "{{label}}",
195
+ "doc_to_choice": [
196
+ "\u0623",
197
+ "\u0628",
198
+ "\u062c",
199
+ "\u062f"
200
+ ],
201
+ "description": "",
202
+ "target_delimiter": " ",
203
+ "fewshot_delimiter": "\n\n",
204
+ "num_fewshot": 0,
205
+ "metric_list": [
206
+ {
207
+ "metric": "acc",
208
+ "aggregation": "mean",
209
+ "higher_is_better": true
210
+ }
211
+ ],
212
+ "output_type": "multiple_choice",
213
+ "repeats": 1,
214
+ "should_decontaminate": false,
215
+ "metadata": {
216
+ "version": 0.0
217
+ }
218
+ },
219
+ "gat_comparisons": {
220
+ "task": "gat_comparisons",
221
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
222
+ "dataset_name": "comparisons",
223
+ "dataset_kwargs": {
224
+ "trust_remote_code": true
225
+ },
226
+ "test_split": "test",
227
+ "fewshot_split": "validation",
228
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
229
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
230
+ "doc_to_target": "{{label}}",
231
+ "doc_to_choice": [
232
+ "\u0623",
233
+ "\u0628",
234
+ "\u062c",
235
+ "\u062f"
236
+ ],
237
+ "description": "",
238
+ "target_delimiter": " ",
239
+ "fewshot_delimiter": "\n\n",
240
+ "num_fewshot": 0,
241
+ "metric_list": [
242
+ {
243
+ "metric": "acc",
244
+ "aggregation": "mean",
245
+ "higher_is_better": true
246
+ }
247
+ ],
248
+ "output_type": "multiple_choice",
249
+ "repeats": 1,
250
+ "should_decontaminate": false,
251
+ "metadata": {
252
+ "version": 0.0
253
+ }
254
+ },
255
+ "gat_completion": {
256
+ "task": "gat_completion",
257
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
258
+ "dataset_name": "completion",
259
+ "dataset_kwargs": {
260
+ "trust_remote_code": true
261
+ },
262
+ "test_split": "test",
263
+ "fewshot_split": "validation",
264
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
265
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
266
+ "doc_to_target": "{{label}}",
267
+ "doc_to_choice": [
268
+ "\u0623",
269
+ "\u0628",
270
+ "\u062c",
271
+ "\u062f"
272
+ ],
273
+ "description": "",
274
+ "target_delimiter": " ",
275
+ "fewshot_delimiter": "\n\n",
276
+ "num_fewshot": 0,
277
+ "metric_list": [
278
+ {
279
+ "metric": "acc",
280
+ "aggregation": "mean",
281
+ "higher_is_better": true
282
+ }
283
+ ],
284
+ "output_type": "multiple_choice",
285
+ "repeats": 1,
286
+ "should_decontaminate": false,
287
+ "metadata": {
288
+ "version": 0.0
289
+ }
290
+ },
291
+ "gat_contextual": {
292
+ "task": "gat_contextual",
293
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
294
+ "dataset_name": "contextual",
295
+ "dataset_kwargs": {
296
+ "trust_remote_code": true
297
+ },
298
+ "test_split": "test",
299
+ "fewshot_split": "validation",
300
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
301
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
302
+ "doc_to_target": "{{label}}",
303
+ "doc_to_choice": [
304
+ "\u0623",
305
+ "\u0628",
306
+ "\u062c",
307
+ "\u062f"
308
+ ],
309
+ "description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
310
+ "target_delimiter": " ",
311
+ "fewshot_delimiter": "\n\n",
312
+ "num_fewshot": 0,
313
+ "metric_list": [
314
+ {
315
+ "metric": "acc",
316
+ "aggregation": "mean",
317
+ "higher_is_better": true
318
+ }
319
+ ],
320
+ "output_type": "multiple_choice",
321
+ "repeats": 1,
322
+ "should_decontaminate": false,
323
+ "metadata": {
324
+ "version": 0.0
325
+ }
326
+ },
327
+ "gat_geometry": {
328
+ "task": "gat_geometry",
329
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
330
+ "dataset_name": "geometry",
331
+ "dataset_kwargs": {
332
+ "trust_remote_code": true
333
+ },
334
+ "test_split": "test",
335
+ "fewshot_split": "validation",
336
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
337
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
338
+ "doc_to_target": "{{label}}",
339
+ "doc_to_choice": [
340
+ "\u0623",
341
+ "\u0628",
342
+ "\u062c",
343
+ "\u062f"
344
+ ],
345
+ "description": "",
346
+ "target_delimiter": " ",
347
+ "fewshot_delimiter": "\n\n",
348
+ "num_fewshot": 0,
349
+ "metric_list": [
350
+ {
351
+ "metric": "acc",
352
+ "aggregation": "mean",
353
+ "higher_is_better": true
354
+ }
355
+ ],
356
+ "output_type": "multiple_choice",
357
+ "repeats": 1,
358
+ "should_decontaminate": false,
359
+ "metadata": {
360
+ "version": 0.0
361
+ }
362
+ },
363
+ "gat_reading": {
364
+ "task": "gat_reading",
365
+ "dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
366
+ "dataset_name": "reading",
367
+ "dataset_kwargs": {
368
+ "trust_remote_code": true
369
+ },
370
+ "test_split": "test",
371
+ "fewshot_split": "validation",
372
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
373
+ "doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
374
+ "doc_to_target": "{{label}}",
375
+ "doc_to_choice": [
376
+ "\u0623",
377
+ "\u0628",
378
+ "\u062c",
379
+ "\u062f"
380
+ ],
381
+ "description": "",
382
+ "target_delimiter": " ",
383
+ "fewshot_delimiter": "\n\n",
384
+ "num_fewshot": 0,
385
+ "metric_list": [
386
+ {
387
+ "metric": "acc",
388
+ "aggregation": "mean",
389
+ "higher_is_better": true
390
+ }
391
+ ],
392
+ "output_type": "multiple_choice",
393
+ "repeats": 1,
394
+ "should_decontaminate": false,
395
+ "metadata": {
396
+ "version": 0.0
397
+ }
398
+ }
399
+ },
400
+ "versions": {
401
+ "gat": 0,
402
+ "gat_algebra": 0.0,
403
+ "gat_analogy": 0.0,
404
+ "gat_arithmetic": 0.0,
405
+ "gat_association": 0.0,
406
+ "gat_comparisons": 0.0,
407
+ "gat_completion": 0.0,
408
+ "gat_contextual": 0.0,
409
+ "gat_geometry": 0.0,
410
+ "gat_reading": 0.0
411
+ },
412
+ "n-shot": {
413
+ "gat_algebra": 0,
414
+ "gat_analogy": 0,
415
+ "gat_arithmetic": 0,
416
+ "gat_association": 0,
417
+ "gat_comparisons": 0,
418
+ "gat_completion": 0,
419
+ "gat_contextual": 0,
420
+ "gat_geometry": 0,
421
+ "gat_reading": 0
422
+ },
423
+ "higher_is_better": {
424
+ "gat": {
425
+ "acc": true
426
+ },
427
+ "gat_algebra": {
428
+ "acc": true
429
+ },
430
+ "gat_analogy": {
431
+ "acc": true
432
+ },
433
+ "gat_arithmetic": {
434
+ "acc": true
435
+ },
436
+ "gat_association": {
437
+ "acc": true
438
+ },
439
+ "gat_comparisons": {
440
+ "acc": true
441
+ },
442
+ "gat_completion": {
443
+ "acc": true
444
+ },
445
+ "gat_contextual": {
446
+ "acc": true
447
+ },
448
+ "gat_geometry": {
449
+ "acc": true
450
+ },
451
+ "gat_reading": {
452
+ "acc": true
453
+ }
454
+ },
455
+ "n-samples": {
456
+ "gat_analogy": {
457
+ "original": 2745,
458
+ "effective": 2745
459
+ },
460
+ "gat_association": {
461
+ "original": 1045,
462
+ "effective": 1045
463
+ },
464
+ "gat_completion": {
465
+ "original": 1210,
466
+ "effective": 1210
467
+ },
468
+ "gat_reading": {
469
+ "original": 2645,
470
+ "effective": 2645
471
+ },
472
+ "gat_algebra": {
473
+ "original": 2695,
474
+ "effective": 2695
475
+ },
476
+ "gat_arithmetic": {
477
+ "original": 2717,
478
+ "effective": 2717
479
+ },
480
+ "gat_comparisons": {
481
+ "original": 1220,
482
+ "effective": 1220
483
+ },
484
+ "gat_contextual": {
485
+ "original": 1304,
486
+ "effective": 1304
487
+ },
488
+ "gat_geometry": {
489
+ "original": 365,
490
+ "effective": 365
491
+ }
492
+ },
493
+ "config": {
494
+ "model": "hf",
495
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
496
+ "model_num_parameters": 7455550464,
497
+ "model_dtype": "torch.bfloat16",
498
+ "model_revision": "main",
499
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
500
+ "batch_size": 1,
501
+ "batch_sizes": [],
502
+ "device": null,
503
+ "use_cache": null,
504
+ "limit": null,
505
+ "bootstrap_iters": 100000,
506
+ "gen_kwargs": null,
507
+ "random_seed": 0,
508
+ "numpy_seed": 1234,
509
+ "torch_seed": 1234,
510
+ "fewshot_seed": 1234
511
+ },
512
+ "git_hash": "5e10e017",
513
+ "date": 1736891004.0192773,
514
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
515
+ "transformers_version": "4.48.0",
516
+ "upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
517
+ "tokenizer_pad_token": [
518
+ "<|pad|>",
519
+ "2023"
520
+ ],
521
+ "tokenizer_eos_token": [
522
+ "<|endoftext|>",
523
+ "11"
524
+ ],
525
+ "tokenizer_bos_token": [
526
+ null,
527
+ "None"
528
+ ],
529
+ "eot_token_id": 11,
530
+ "max_length": 32768,
531
+ "task_hashes": {
532
+ "gat_analogy": "04ac010c48ed039457058b512b7ac0586c7c76a628da7caaf9aeb8f3e99ae5e3",
533
+ "gat_association": "2cbd868d220125bfcc54ae738592ad902191e4b7f804ce1772ae29e2d3bb3bf6",
534
+ "gat_completion": "74cf159ef4a3455a6a0e984fed8e9e9a12f0dc21fde95c2058216c5a711a4d31",
535
+ "gat_reading": "6f21934e536e7dca65361d01e5cafc27f8070c4f0dccf5a88c1fe071194b78a4",
536
+ "gat_algebra": "20750c926608570eaf87d29981e5ab49b2b097bd52d7f749c44ab4e175d9fdd2",
537
+ "gat_arithmetic": "c4b0c73c269d9eb3e8482fbda42e69191c28b95e75e1517d5f9142c6ef410204",
538
+ "gat_comparisons": "88bc22db186a50cab28938ec1fc332366fa0bc886bc98edf810cc9ae938405db",
539
+ "gat_contextual": "b8e88ff29b62b54eb834dca696304ca0fe1ce55d5cf7d0a9f0204456e3955be6",
540
+ "gat_geometry": "229545188469d0512a3297737f4ec7afe88d8a30e7e04f87b4982548e83b1e56"
541
+ },
542
+ "model_source": "hf",
543
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
544
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
545
+ "system_instruction": null,
546
+ "system_instruction_sha": null,
547
+ "fewshot_as_multiturn": false,
548
+ "chat_template": null,
549
+ "chat_template_sha": null,
550
+ "start_time": 601254.206185867,
551
+ "end_time": 601373.470204397,
552
+ "total_evaluation_time_seconds": "119.26401853002608"
553
+ }
evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_mcq": {
4
+ "alias": "moe_ien_mcq",
5
+ "acc,none": 0.5265265265265265,
6
+ "acc_stderr,none": 0.004995706870392996,
7
+ "acc_norm,none": 0.5265265265265265,
8
+ "acc_norm_stderr,none": 0.004995706870392996
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_mcq": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_mcq": {
16
+ "task": "moe_ien_mcq",
17
+ "dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
18
+ "dataset_name": "moe_ien_mcq",
19
+ "dataset_kwargs": {
20
+ "trust_remote_code": true
21
+ },
22
+ "validation_split": "validation",
23
+ "test_split": "test",
24
+ "fewshot_split": "validation",
25
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
26
+ "doc_to_text": "Query",
27
+ "doc_to_target": "gold",
28
+ "doc_to_choice": "{{Choices}}",
29
+ "description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
30
+ "target_delimiter": " ",
31
+ "fewshot_delimiter": "\n\n",
32
+ "fewshot_config": {
33
+ "sampler": "balanced_cat"
34
+ },
35
+ "num_fewshot": 0,
36
+ "metric_list": [
37
+ {
38
+ "metric": "acc",
39
+ "aggregation": "mean",
40
+ "higher_is_better": true
41
+ },
42
+ {
43
+ "metric": "acc_norm",
44
+ "aggregation": "mean",
45
+ "higher_is_better": true
46
+ }
47
+ ],
48
+ "output_type": "multiple_choice",
49
+ "repeats": 1,
50
+ "should_decontaminate": true,
51
+ "doc_to_decontamination_query": "Query",
52
+ "metadata": {
53
+ "version": 0.0
54
+ }
55
+ }
56
+ },
57
+ "versions": {
58
+ "moe_ien_mcq": 0.0
59
+ },
60
+ "n-shot": {
61
+ "moe_ien_mcq": 0
62
+ },
63
+ "higher_is_better": {
64
+ "moe_ien_mcq": {
65
+ "acc": true,
66
+ "acc_norm": true
67
+ }
68
+ },
69
+ "n-samples": {
70
+ "moe_ien_mcq": {
71
+ "original": 9990,
72
+ "effective": 9990
73
+ }
74
+ },
75
+ "config": {
76
+ "model": "hf",
77
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
78
+ "model_num_parameters": 7455550464,
79
+ "model_dtype": "torch.bfloat16",
80
+ "model_revision": "main",
81
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
82
+ "batch_size": 1,
83
+ "batch_sizes": [],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "b955b2950",
95
+ "date": 1739620378.768502,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.3",
98
+ "upper_git_hash": null,
99
+ "tokenizer_pad_token": [
100
+ "<|pad|>",
101
+ "2023"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "<|endoftext|>",
105
+ "11"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ null,
109
+ "None"
110
+ ],
111
+ "eot_token_id": 11,
112
+ "max_length": 32768,
113
+ "task_hashes": {
114
+ "moe_ien_mcq": "1ae93edb904d572143b5f36dd5dfcc4b901240916d4735ea328083598c912446"
115
+ },
116
+ "model_source": "hf",
117
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
118
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
119
+ "system_instruction": null,
120
+ "system_instruction_sha": null,
121
+ "fewshot_as_multiturn": false,
122
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
123
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
124
+ "start_time": 1395061.894176973,
125
+ "end_time": 1395336.684131379,
126
+ "total_evaluation_time_seconds": "274.78995440597646"
127
+ }
evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "moe_ien_tf": {
4
+ "alias": "moe_ien_tf",
5
+ "acc,none": 0.576335222393955,
6
+ "acc_stderr,none": 0.006476086786980228,
7
+ "acc_norm,none": 0.576335222393955,
8
+ "acc_norm_stderr,none": 0.006476086786980228
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "moe_ien_tf": []
13
+ },
14
+ "configs": {
15
+ "moe_ien_tf": {
16
+ "task": "moe_ien_tf",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
21
+ "dataset_name": "moe_ien_tf",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "choices",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": false,
54
+ "metadata": {
55
+ "version": 2.0
56
+ }
57
+ }
58
+ },
59
+ "versions": {
60
+ "moe_ien_tf": 2.0
61
+ },
62
+ "n-shot": {
63
+ "moe_ien_tf": 0
64
+ },
65
+ "higher_is_better": {
66
+ "moe_ien_tf": {
67
+ "acc": true,
68
+ "acc_norm": true
69
+ }
70
+ },
71
+ "n-samples": {
72
+ "moe_ien_tf": {
73
+ "original": 5823,
74
+ "effective": 5823
75
+ }
76
+ },
77
+ "config": {
78
+ "model": "hf",
79
+ "model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
80
+ "model_num_parameters": 7455550464,
81
+ "model_dtype": "torch.bfloat16",
82
+ "model_revision": "main",
83
+ "model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
84
+ "batch_size": 1,
85
+ "batch_sizes": [],
86
+ "device": null,
87
+ "use_cache": null,
88
+ "limit": null,
89
+ "bootstrap_iters": 100000,
90
+ "gen_kwargs": null,
91
+ "random_seed": 0,
92
+ "numpy_seed": 1234,
93
+ "torch_seed": 1234,
94
+ "fewshot_seed": 1234
95
+ },
96
+ "git_hash": "b955b2950",
97
+ "date": 1739620722.9521024,
98
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
99
+ "transformers_version": "4.48.3",
100
+ "upper_git_hash": null,
101
+ "tokenizer_pad_token": [
102
+ "<|pad|>",
103
+ "2023"
104
+ ],
105
+ "tokenizer_eos_token": [
106
+ "<|endoftext|>",
107
+ "11"
108
+ ],
109
+ "tokenizer_bos_token": [
110
+ null,
111
+ "None"
112
+ ],
113
+ "eot_token_id": 11,
114
+ "max_length": 32768,
115
+ "task_hashes": {
116
+ "moe_ien_tf": "ed81617ccb178d095c9a81fef15f5ba8b655782b26d36117f53c38b0a84e62e5"
117
+ },
118
+ "model_source": "hf",
119
+ "model_name": "tiiuae/Falcon3-7B-Instruct",
120
+ "model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
121
+ "system_instruction": null,
122
+ "system_instruction_sha": null,
123
+ "fewshot_as_multiturn": false,
124
+ "chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
125
+ "chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
126
+ "start_time": 1395406.00589162,
127
+ "end_time": 1395704.54657667,
128
+ "total_evaluation_time_seconds": "298.54068504995666"
129
+ }
evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json ADDED
The diff for this file is too large to render. See raw diff
 
evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "acva": {
4
+ "alias": "acva",
5
+ "acc,none": 0.7847301951779564,
6
+ "acc_stderr,none": 0.004404205705558861,
7
+ "acc_norm,none": 0.769345579793341,
8
+ "acc_norm_stderr,none": 0.004513957617295361
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "acva": []
13
+ },
14
+ "configs": {
15
+ "acva": {
16
+ "task": "acva",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
21
+ "dataset_kwargs": {
22
+ "trust_remote_code": true
23
+ },
24
+ "validation_split": "validation",
25
+ "test_split": "test",
26
+ "fewshot_split": "validation",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "choices",
31
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": false,
50
+ "metadata": {
51
+ "version": 1.0
52
+ }
53
+ }
54
+ },
55
+ "versions": {
56
+ "acva": 1.0
57
+ },
58
+ "n-shot": {
59
+ "acva": 5
60
+ },
61
+ "higher_is_better": {
62
+ "acva": {
63
+ "acc": true,
64
+ "acc_norm": true
65
+ }
66
+ },
67
+ "n-samples": {
68
+ "acva": {
69
+ "original": 8710,
70
+ "effective": 8710
71
+ }
72
+ },
73
+ "config": {
74
+ "model": "hf",
75
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
76
+ "model_num_parameters": 70553706496,
77
+ "model_dtype": "torch.bfloat16",
78
+ "model_revision": "main",
79
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
80
+ "batch_size": "auto",
81
+ "batch_sizes": [
82
+ 64
83
+ ],
84
+ "device": null,
85
+ "use_cache": null,
86
+ "limit": null,
87
+ "bootstrap_iters": 100000,
88
+ "gen_kwargs": null,
89
+ "random_seed": 0,
90
+ "numpy_seed": 1234,
91
+ "torch_seed": 1234,
92
+ "fewshot_seed": 1234
93
+ },
94
+ "git_hash": "788a3672",
95
+ "date": 1737861513.0031924,
96
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
97
+ "transformers_version": "4.48.1",
98
+ "upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
99
+ "tokenizer_pad_token": [
100
+ "<|finetune_right_pad_id|>",
101
+ "128004"
102
+ ],
103
+ "tokenizer_eos_token": [
104
+ "<|eot_id|>",
105
+ "128009"
106
+ ],
107
+ "tokenizer_bos_token": [
108
+ "<|begin_of_text|>",
109
+ "128000"
110
+ ],
111
+ "eot_token_id": 128009,
112
+ "max_length": 131072,
113
+ "task_hashes": {},
114
+ "model_source": "hf",
115
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
116
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
117
+ "system_instruction": null,
118
+ "system_instruction_sha": null,
119
+ "fewshot_as_multiturn": false,
120
+ "chat_template": null,
121
+ "chat_template_sha": null,
122
+ "start_time": 822799.725415956,
123
+ "end_time": 824041.525682158,
124
+ "total_evaluation_time_seconds": "1241.8002662019571"
125
+ }
evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "ar_ifeval": {
4
+ "alias": "ar_ifeval",
5
+ "prompt_level_strict_acc,none": 0.7089552238805971,
6
+ "prompt_level_strict_acc_stderr,none": 0.019638685568678992,
7
+ "inst_level_strict_acc,none": 0.8860068259385665,
8
+ "inst_level_strict_acc_stderr,none": "N/A",
9
+ "prompt_level_loose_acc,none": 0.7947761194029851,
10
+ "prompt_level_loose_acc_stderr,none": 0.017460611985170207,
11
+ "inst_level_loose_acc,none": 0.9208191126279863,
12
+ "inst_level_loose_acc_stderr,none": "N/A"
13
+ }
14
+ },
15
+ "group_subtasks": {
16
+ "ar_ifeval": []
17
+ },
18
+ "configs": {
19
+ "ar_ifeval": {
20
+ "task": "ar_ifeval",
21
+ "dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
22
+ "dataset_name": "ar_ifeval",
23
+ "dataset_kwargs": {
24
+ "trust_remote_code": true
25
+ },
26
+ "test_split": "test",
27
+ "doc_to_text": "prompt",
28
+ "doc_to_target": 0,
29
+ "process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
30
+ "description": "",
31
+ "target_delimiter": " ",
32
+ "fewshot_delimiter": "\n\n",
33
+ "num_fewshot": 0,
34
+ "metric_list": [
35
+ {
36
+ "metric": "prompt_level_strict_acc",
37
+ "aggregation": "mean",
38
+ "higher_is_better": true
39
+ },
40
+ {
41
+ "metric": "inst_level_strict_acc",
42
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "prompt_level_loose_acc",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ },
50
+ {
51
+ "metric": "inst_level_loose_acc",
52
+ "aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
53
+ "higher_is_better": true
54
+ }
55
+ ],
56
+ "output_type": "generate_until",
57
+ "generation_kwargs": {
58
+ "until": [],
59
+ "do_sample": false,
60
+ "temperature": 0.0,
61
+ "max_gen_toks": 1280
62
+ },
63
+ "repeats": 1,
64
+ "should_decontaminate": false,
65
+ "metadata": {
66
+ "version": 4.0
67
+ }
68
+ }
69
+ },
70
+ "versions": {
71
+ "ar_ifeval": 4.0
72
+ },
73
+ "n-shot": {
74
+ "ar_ifeval": 0
75
+ },
76
+ "higher_is_better": {
77
+ "ar_ifeval": {
78
+ "prompt_level_strict_acc": true,
79
+ "inst_level_strict_acc": true,
80
+ "prompt_level_loose_acc": true,
81
+ "inst_level_loose_acc": true
82
+ }
83
+ },
84
+ "n-samples": {
85
+ "ar_ifeval": {
86
+ "original": 536,
87
+ "effective": 536
88
+ }
89
+ },
90
+ "config": {
91
+ "model": "hf",
92
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
93
+ "model_num_parameters": 70553706496,
94
+ "model_dtype": "torch.bfloat16",
95
+ "model_revision": "main",
96
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
97
+ "batch_size": 1,
98
+ "batch_sizes": [],
99
+ "device": null,
100
+ "use_cache": null,
101
+ "limit": null,
102
+ "bootstrap_iters": 100000,
103
+ "gen_kwargs": null,
104
+ "random_seed": 0,
105
+ "numpy_seed": 1234,
106
+ "torch_seed": 1234,
107
+ "fewshot_seed": 1234
108
+ },
109
+ "git_hash": "788a3672",
110
+ "date": 1738755018.193393,
111
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
112
+ "transformers_version": "4.48.2",
113
+ "upper_git_hash": null,
114
+ "tokenizer_pad_token": [
115
+ "<|finetune_right_pad_id|>",
116
+ "128004"
117
+ ],
118
+ "tokenizer_eos_token": [
119
+ "<|eot_id|>",
120
+ "128009"
121
+ ],
122
+ "tokenizer_bos_token": [
123
+ "<|begin_of_text|>",
124
+ "128000"
125
+ ],
126
+ "eot_token_id": 128009,
127
+ "max_length": 131072,
128
+ "task_hashes": {
129
+ "ar_ifeval": "6bd5bfb26ee4f5909e16d66ee0e564fb2a5826815f16755272465c9e03f98a20"
130
+ },
131
+ "model_source": "hf",
132
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
133
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
134
+ "system_instruction": null,
135
+ "system_instruction_sha": null,
136
+ "fewshot_as_multiturn": false,
137
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
138
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
139
+ "start_time": 744977.123888747,
140
+ "end_time": 758450.608805326,
141
+ "total_evaluation_time_seconds": "13473.484916579095"
142
+ }
evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json ADDED
@@ -0,0 +1,126 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araMath_v3": {
4
+ "alias": "araMath_v3",
5
+ "acc,none": 0.7090909090909091,
6
+ "acc_stderr,none": 0.01848039016780232,
7
+ "acc_norm,none": 0.7090909090909091,
8
+ "acc_norm_stderr,none": 0.01848039016780232
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araMath_v3": []
13
+ },
14
+ "configs": {
15
+ "araMath_v3": {
16
+ "task": "araMath_v3",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
21
+ "dataset_name": "araMath_v3",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
28
+ "doc_to_text": "query",
29
+ "doc_to_target": "gold",
30
+ "doc_to_choice": "{{choices}}",
31
+ "description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
32
+ "target_delimiter": " ",
33
+ "fewshot_delimiter": "\n\n",
34
+ "num_fewshot": 5,
35
+ "metric_list": [
36
+ {
37
+ "metric": "acc",
38
+ "aggregation": "mean",
39
+ "higher_is_better": true
40
+ },
41
+ {
42
+ "metric": "acc_norm",
43
+ "aggregation": "mean",
44
+ "higher_is_better": true
45
+ }
46
+ ],
47
+ "output_type": "multiple_choice",
48
+ "repeats": 1,
49
+ "should_decontaminate": true,
50
+ "doc_to_decontamination_query": "query",
51
+ "metadata": {
52
+ "version": 0.0
53
+ }
54
+ }
55
+ },
56
+ "versions": {
57
+ "araMath_v3": 0.0
58
+ },
59
+ "n-shot": {
60
+ "araMath_v3": 5
61
+ },
62
+ "higher_is_better": {
63
+ "araMath_v3": {
64
+ "acc": true,
65
+ "acc_norm": true
66
+ }
67
+ },
68
+ "n-samples": {
69
+ "araMath_v3": {
70
+ "original": 605,
71
+ "effective": 605
72
+ }
73
+ },
74
+ "config": {
75
+ "model": "hf",
76
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
77
+ "model_num_parameters": 70553706496,
78
+ "model_dtype": "torch.bfloat16",
79
+ "model_revision": "main",
80
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
81
+ "batch_size": 1,
82
+ "batch_sizes": [],
83
+ "device": null,
84
+ "use_cache": null,
85
+ "limit": null,
86
+ "bootstrap_iters": 100000,
87
+ "gen_kwargs": null,
88
+ "random_seed": 0,
89
+ "numpy_seed": 1234,
90
+ "torch_seed": 1234,
91
+ "fewshot_seed": 1234
92
+ },
93
+ "git_hash": "788a3672",
94
+ "date": 1738750317.5038416,
95
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
96
+ "transformers_version": "4.48.2",
97
+ "upper_git_hash": null,
98
+ "tokenizer_pad_token": [
99
+ "<|finetune_right_pad_id|>",
100
+ "128004"
101
+ ],
102
+ "tokenizer_eos_token": [
103
+ "<|eot_id|>",
104
+ "128009"
105
+ ],
106
+ "tokenizer_bos_token": [
107
+ "<|begin_of_text|>",
108
+ "128000"
109
+ ],
110
+ "eot_token_id": 128009,
111
+ "max_length": 131072,
112
+ "task_hashes": {
113
+ "araMath_v3": "154ea94d6776e7d3980c98343cec49115ef3dc4dab8897fb4668f68494d55c76"
114
+ },
115
+ "model_source": "hf",
116
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
117
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
118
+ "system_instruction": null,
119
+ "system_instruction_sha": null,
120
+ "fewshot_as_multiturn": false,
121
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
122
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
123
+ "start_time": 740276.643313964,
124
+ "end_time": 740434.169818474,
125
+ "total_evaluation_time_seconds": "157.5265045099659"
126
+ }
evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json ADDED
@@ -0,0 +1,130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "araPro": {
4
+ "alias": "araPro",
5
+ "acc,none": 0.7048590281943611,
6
+ "acc_stderr,none": 0.006450314388729491,
7
+ "acc_norm,none": 0.7048590281943611,
8
+ "acc_norm_stderr,none": 0.006450314388729491
9
+ }
10
+ },
11
+ "group_subtasks": {
12
+ "araPro": []
13
+ },
14
+ "configs": {
15
+ "araPro": {
16
+ "task": "araPro",
17
+ "tag": [
18
+ "multiple_choice"
19
+ ],
20
+ "dataset_path": "lm_eval/tasks/araPro/araPro.py",
21
+ "dataset_name": "araPro",
22
+ "dataset_kwargs": {
23
+ "trust_remote_code": true
24
+ },
25
+ "validation_split": "validation",
26
+ "test_split": "test",
27
+ "fewshot_split": "validation",
28
+ "process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
29
+ "doc_to_text": "query",
30
+ "doc_to_target": "gold",
31
+ "doc_to_choice": "{{choices}}",
32
+ "description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
33
+ "target_delimiter": " ",
34
+ "fewshot_delimiter": "\n\n",
35
+ "fewshot_config": {
36
+ "sampler": "balanced_cat"
37
+ },
38
+ "num_fewshot": 0,
39
+ "metric_list": [
40
+ {
41
+ "metric": "acc",
42
+ "aggregation": "mean",
43
+ "higher_is_better": true
44
+ },
45
+ {
46
+ "metric": "acc_norm",
47
+ "aggregation": "mean",
48
+ "higher_is_better": true
49
+ }
50
+ ],
51
+ "output_type": "multiple_choice",
52
+ "repeats": 1,
53
+ "should_decontaminate": true,
54
+ "doc_to_decontamination_query": "Question",
55
+ "metadata": {
56
+ "version": 2.0
57
+ }
58
+ }
59
+ },
60
+ "versions": {
61
+ "araPro": 2.0
62
+ },
63
+ "n-shot": {
64
+ "araPro": 0
65
+ },
66
+ "higher_is_better": {
67
+ "araPro": {
68
+ "acc": true,
69
+ "acc_norm": true
70
+ }
71
+ },
72
+ "n-samples": {
73
+ "araPro": {
74
+ "original": 5001,
75
+ "effective": 5001
76
+ }
77
+ },
78
+ "config": {
79
+ "model": "hf",
80
+ "model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
81
+ "model_num_parameters": 70553706496,
82
+ "model_dtype": "torch.bfloat16",
83
+ "model_revision": "main",
84
+ "model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
85
+ "batch_size": 1,
86
+ "batch_sizes": [],
87
+ "device": null,
88
+ "use_cache": null,
89
+ "limit": null,
90
+ "bootstrap_iters": 100000,
91
+ "gen_kwargs": null,
92
+ "random_seed": 0,
93
+ "numpy_seed": 1234,
94
+ "torch_seed": 1234,
95
+ "fewshot_seed": 1234
96
+ },
97
+ "git_hash": "788a3672",
98
+ "date": 1738742514.712935,
99
+ "pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
100
+ "transformers_version": "4.48.2",
101
+ "upper_git_hash": null,
102
+ "tokenizer_pad_token": [
103
+ "<|finetune_right_pad_id|>",
104
+ "128004"
105
+ ],
106
+ "tokenizer_eos_token": [
107
+ "<|eot_id|>",
108
+ "128009"
109
+ ],
110
+ "tokenizer_bos_token": [
111
+ "<|begin_of_text|>",
112
+ "128000"
113
+ ],
114
+ "eot_token_id": 128009,
115
+ "max_length": 131072,
116
+ "task_hashes": {
117
+ "araPro": "ab4849e5668de72a27844a2a354787cbce92af5027f46a32300417b41913c5db"
118
+ },
119
+ "model_source": "hf",
120
+ "model_name": "meta-llama/Llama-3.3-70B-Instruct",
121
+ "model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
122
+ "system_instruction": null,
123
+ "system_instruction_sha": null,
124
+ "fewshot_as_multiturn": false,
125
+ "chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
126
+ "chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
127
+ "start_time": 732473.787962617,
128
+ "end_time": 736407.61692168,
129
+ "total_evaluation_time_seconds": "3933.8289590630447"
130
+ }