Clone model from ALLaM-AI/ALLaM-7B-Instruct-preview
Browse filesThis view is limited to 50 files because it contains too many changes.
See raw diff
- README.md +211 -0
- config.json +28 -0
- evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json +125 -0
- evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json +142 -0
- evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json +126 -0
- evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json +130 -0
- evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json +0 -0
- evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json +126 -0
- evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json +127 -0
- evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json +543 -0
- evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json +127 -0
- evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json +129 -0
- evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json +0 -0
- evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json +123 -0
- evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json +142 -0
- evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json +126 -0
- evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json +130 -0
- evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json +0 -0
- evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json +126 -0
- evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json +119 -0
- evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json +539 -0
- evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json +127 -0
- evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json +129 -0
- evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json +0 -0
- evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json +119 -0
- evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json +142 -0
- evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json +126 -0
- evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json +130 -0
- evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json +0 -0
- evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json +126 -0
- evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json +121 -0
- evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json +549 -0
- evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json +127 -0
- evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json +129 -0
- evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json +0 -0
- evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json +123 -0
- evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json +142 -0
- evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json +126 -0
- evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json +130 -0
- evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json +0 -0
- evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json +126 -0
- evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json +125 -0
- evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json +553 -0
- evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json +127 -0
- evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json +129 -0
- evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json +0 -0
- evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json +125 -0
- evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json +142 -0
- evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json +126 -0
- evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json +130 -0
README.md
ADDED
@@ -0,0 +1,211 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ar
|
5 |
+
- en
|
6 |
+
pipeline_tag: text-generation
|
7 |
+
tags:
|
8 |
+
- pytorch
|
9 |
+
library_name: transformers
|
10 |
+
---
|
11 |
+
# ALLaM-7B-Instruct-preview
|
12 |
+
|
13 |
+
ALLaM is a series of powerful language models designed to advance Arabic Language Technology (ALT) developed by the National Center for Artificial Intelligence (NCAI) at the [Saudi Data and AI Authority (SDAIA)](https://sdaia.gov.sa/en/default.aspx). `ALLaM-AI/ALLaM-7B-Instruct-preview` is trained from scratch. Our pretraining from scratch recipe consists of two steps: training on 4T English tokens followed by training on 1.2T mixed Arabic/English tokens. This retains the English capabilities of the model without catastrophic forgetting, effectively transferring knowledge from one language distribution to another.
|
14 |
+
|
15 |
+
## Intended Use
|
16 |
+
|
17 |
+
`ALLaM` is specifically designed to expedite the research and development of ALT through Large Language Models (LLM). It serves as one of the foundational elements for building product offerings as well as facilitating experimental initiatives.
|
18 |
+
|
19 |
+
The ALLaM series models are designed to be a component of a larger AI system, and it is important for developers to incorporate safety measures when creating these systems. These safety measures are crucial for ensuring a balance between effectiveness and security, as well as minimizing potential risks, such as those resulting from the integration of the model with external tools.
|
20 |
+
|
21 |
+
## Model Details
|
22 |
+
|
23 |
+
ALLaM is a family of LLMs specially trained for Arabic. The main two paths followed for pretraining are:
|
24 |
+
|
25 |
+
- **ALLaM**: Pretraining models from scratch
|
26 |
+
- **ALLaM-Adapted/ALLaM-(\*\*)/(\*\*)-ALLaM**/: Continued training from open source/weight models
|
27 |
+
|
28 |
+
For this release, we are providing our instruction-tuned 7B parameter generative model pretrained from scratch.
|
29 |
+
|
30 |
+
Some parameters for this model are provided in the following table:
|
31 |
+
|
32 |
+
| Size | Context Length | Pretraining Tokens | Instructions | Preference Pairs |
|
33 |
+
|----------------|-----------------|--------------------|--------------|------------------|
|
34 |
+
| 7B parameters | 4096 tokens |4T(en) + 1.2T(en+ar)| 7M | 260K |
|
35 |
+
|
36 |
+
|
37 |
+
## Model Description
|
38 |
+
|
39 |
+
- **Developed by:** National Center for Artificial Intelligence at [SDAIA](https://sdaia.gov.sa/en/default.aspx)
|
40 |
+
- **Model type:** Autoregressive Transformer
|
41 |
+
- **Language(s):** Arabic, English
|
42 |
+
- **License:** Please see the LICENSE file
|
43 |
+
- **Input:** Text
|
44 |
+
- **Output:** Text
|
45 |
+
|
46 |
+
|
47 |
+
## Training Details
|
48 |
+
|
49 |
+
ALLaM-7B-Instruct-preview is pretrained on a total of 5.2 trillion tokens in English and Arabic, Our training codebase is built on [NVIDIA/MegatronLM](https://github.com/NVIDIA/Megatron-LM). Average MFU during training was ~42%. We trained our model using bf16-mixed precision.
|
50 |
+
|
51 |
+
|
52 |
+
## Getting started
|
53 |
+
|
54 |
+
|
55 |
+
### System Prompt
|
56 |
+
|
57 |
+
It is important to note that this model is optimized to function without a predefined system prompt.
|
58 |
+
While Allam does not come with a default system prompt, it does provide the flexibility to add a custom system prompt.
|
59 |
+
For instance, a well crafted system prompt could be:
|
60 |
+
|
61 |
+
“You are ALLaM, a bilingual English and Arabic AI assistant.”
|
62 |
+
System prompts can also be in Arabic:
|
63 |
+
|
64 |
+
"أنت علام، مساعد ذكاء اصطناعي مطور من الهيئة السعودية للبيانات والذكاء الاصطناعي، تجيب على الأسئلة بطريقة مفيدة مع مراعاة القيم الثقافية المحلية."
|
65 |
+
Alternatively, users can get creative with their prompts, such as:
|
66 |
+
|
67 |
+
“You are an AI assistant who responds to everything like a pirate.”
|
68 |
+
|
69 |
+
The system prompt is integrated inside the tokenizer config (accessed via `apply_chat_template()` module).
|
70 |
+
|
71 |
+
|
72 |
+
### Example Usages
|
73 |
+
|
74 |
+
The weights for ALLaM model checkpoints can be accessed via [HuggingFace transformers](https://github.com/huggingface/transformers) (tested with `transformers>=4.40.1`). The following code snippet demonstrates how to load the model and generate text using the `ALLaM-AI/ALLaM-7B-Instruct-preview` model.
|
75 |
+
|
76 |
+
```python
|
77 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
78 |
+
allam_model = AutoModelForCausalLM.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
|
79 |
+
tokenizer = AutoTokenizer.from_pretrained("ALLaM-AI/ALLaM-7B-Instruct-preview")
|
80 |
+
messages=[
|
81 |
+
{"role": "user", "content": "كيف أجهز كوب شاهي؟"},
|
82 |
+
]
|
83 |
+
inputs = tokenizer.apply_chat_template(messages, tokenize=False)
|
84 |
+
inputs = tokenizer(inputs, return_tensors='pt', return_token_type_ids=False)
|
85 |
+
inputs = {k: v.to('cuda') for k,v in inputs.items()}
|
86 |
+
allam_model = allam_model.to('cuda')
|
87 |
+
response = allam_model.generate(**inputs, max_new_tokens=4096, do_sample=True, top_k=50, top_p=0.95, temperature=.6)
|
88 |
+
print(tokenizer.batch_decode(response, skip_special_tokens=True)[0])
|
89 |
+
```
|
90 |
+
|
91 |
+
|
92 |
+
## Ethical Considerations and Limitations
|
93 |
+
|
94 |
+
ALLaM is a generative model that comes with inherent uncertainties. Trials cannot encompass every possible use case. Hence, predicting ALLaM's responses in every context is not possible, leading on occasion to incorrect or biased outputs. Developers must conduct thorough safety evaluations and make specific adjustments to ensure the model is suitable for the intended purposes.
|
95 |
+
|
96 |
+
*The output generated by this model is not considered a statement of NCAI, SDAIA, or any other organization.*
|
97 |
+
|
98 |
+
## Evaluation
|
99 |
+
|
100 |
+
### Automatic Benchmarks
|
101 |
+
|
102 |
+
#### Arabic Benchmarks
|
103 |
+
**Massive Multitask Language Understanding** (MMLU) is a collection of many multiple-choice evaluation questions sourced from various academic levels (elementary to college level). These questions are typically related to humanities, STEM, or social sciences. It was originally an English dataset, but other variants were developed for Arabic:
|
104 |
+
|
105 |
+
<!-- - [Original English MMLU (MMLU-en)](https://github.com/hendrycks/test): A collection of 14,079 original English questions spanning 57 domains. -->
|
106 |
+
- [Arabic MMLU](https://huggingface.co/datasets/MBZUAI/ArabicMMLU): A collection of 14,575 original Arabic questions spanning 40 domains published by MBZUAI.
|
107 |
+
- [OpenAI MMLU-ar](https://huggingface.co/datasets/openai/MMMLU): A dataset comprising 14,042 questions, translated from the original MMLU benchmark published by OpenAI.
|
108 |
+
|
109 |
+
**Exams Arabic** ([Exams (Ar)](https://github.com/FreedomIntelligence/Arabic-eval/blob/main/LLM/benchmark_eval/benchmarks/EXAMS_Arabic/exam_test.jsonl)): A multiple choice question dataset with 537 samples, covering several domains e.g., Islamic studies, science, humanities, and physics.
|
110 |
+
|
111 |
+
**Arabic Cultural Alignment** ([ACVA](https://huggingface.co/datasets/FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment)): This dataset was generated by `gpt-3.5-turbo` and contains 8,710 True and False questions from 58 different areas.
|
112 |
+
|
113 |
+
**Education and Training Evaluation Commission** (ETEC): This dataset consists of Arabic-language multiple-choice questions, compiled by the ALLaM team in collaboration with [Saudi ETEC](https://acpd.etec.gov.sa/Home/index?csrt=5175167507218838843). It spans various educational levels, from elementary through post-college, with a total of 1,887 test samples.
|
114 |
+
|
115 |
+
**IEN**: This dataset was curated from the Ministry of Education's (MOE) [IEN platform](https://www.ientv.edu.sa/ar), organized by grade, topic, and difficulty level. It comprehensively covers the entire Saudi curriculum from 1st grade through high school, with 9990 multiple-choice questions and 5823 true/false questions.
|
116 |
+
|
117 |
+
**GAT**: The General Aptitude Test (GAT) dataset consists of approximately 16,000 Arabic multiple-choice questions, representing various sections of [the Qiyas General Aptitude Test](https://www.etec.gov.sa/en/service/Generalabilitytest/servicegoal). The sections include algebra, reading comprehension, analogies, arithmetic, associations, comparisons, completions, contextual understanding, and geometry.
|
118 |
+
|
119 |
+
**AraPro**: A curated collection of 5,001 multiple-choice questions (MCQs) authored by our domain experts. The dataset spans various subjects, including mathematics, science, and other relevant fields, providing a diverse set of questions for evaluation purpose.
|
120 |
+
|
121 |
+
**AraMath**: AraMath consists of 605 MCQs derived from [ArMath](https://github.com/reem-codes/ArMATH), which includes mathematical word problems, that was transformed to MCQs internally.
|
122 |
+
|
123 |
+
**Ar-IFEval**: an Arabic instruction-following (IF) evaluation dataset designed to automatically assess language models' compliance with specified instructions through verifiable methods. The dataset consists of 535 instances, each containing two to four verifiable instructions that can be validated using deterministic programming approaches.
|
124 |
+
|
125 |
+
All models were evaluated using our proprietary evaluation pipeline and [LM Evaluation Harness framework](https://github.com/EleutherAI/lm-evaluation-harness) to ensure fair comparisons. For API-based models, we used exact match evaluations of the generated outputs.
|
126 |
+
|
127 |
+
The evaluation scores of ALLaM can be found in JSON format [here](https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview/tree/main/evaluation).
|
128 |
+
|
129 |
+
|
130 |
+
| Model |AVG | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | Ar-IFEval <br>(prompt strict) <br>0 shot | Ar-IFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabic MMLU <br>0 Shot | Openai MMLU <br>0 shot | GAT <br>0 shot |
|
131 |
+
|:----------------------------|:----------|:---------|:-----------------|:----------------|:----------------|:-----------------|:-----------------------------------|:---------------------------------|:------------------|:--------------|:--------------------|:--------------------|:-----------------------------|
|
132 |
+
| ALLaM-7B-Instruct-preview | 64.42 | 66.67 | **91.77** | 82.95 | 69.71 | 66.78 | 31.34 | 67.65 | 51.58 | 76.33 | 67.78 | 55.91 | 44.53 |
|
133 |
+
| AceGPT-v2-8B-Chat | 52.67 | 56.81 | 77.01 | 75.91 | 63.51 | 41.49 | 10.26 | 39.25 | 51.96 | 72.69 | 57.02 | 49.99 | 36.15 |
|
134 |
+
| AceGPT-v2-32B-Chat | 62.23 | 64.81 | 81.6 | 80.35 | 67.19 | 64.46 | 25.75 | 63.41 | 55.31 | 71.57 | 68.3 | 60.8 | 43.21 |
|
135 |
+
| jais-family-6p7b-chat | 46.31 | 45.47 | 46.22 | 63.92 | 54.31 | 25.29 | 13.99 | 52.97 | 46.93 | 73.8 | 56.15 | 44.96 | 31.71 |
|
136 |
+
| jais-family-13b-chat | 49.14 | 48.65 | 62.95 | 68.68 | 57.53 | 26.61 | 17.16 | 54.27 | 45.07 | 71.18 | 58.14 | 47.73 | 31.72 |
|
137 |
+
| jais-family-30b-16k-chat | 52.54 | 53.31 | 74.88 | 68.76 | 62.79 | 41.49 | 16.6 | 54.95 | 49.72 | 60.08 | 62.04 | 50.98 | 34.85 |
|
138 |
+
| jais-family-30b-8k-chat | 53.19 | 53.52 | 72.76 | 70.65 | 61.27 | 33.39 | 16.79 | 54.68 | 50.28 | 74.47 | 63.11 | 50.9 | 36.44 |
|
139 |
+
| jais-adapted-7b-chat | 45.19 | 40.49 | 57.38 | 67.18 | 50.59 | 28.43 | 14.93 | 54.27 | 40.6 | 70.44 | 49.75 | 38.54 | 29.68 |
|
140 |
+
| jais-adapted-13b-chat | 51.86 | 48.12 | 69.65 | 71.85 | 59.07 | 37.02 | 23.32 | 60.61 | 48.23 | 67.78 | 56.42 | 46.83 | 33.4 |
|
141 |
+
| jais-adapted-70b-chat | 58.32 | 56.81 | 74.51 | 76.47 | 64.59 | 45.62 | 27.05 | 65.05 | 54.75 | 73.33 | 65.74 | 56.82 | 39.15 |
|
142 |
+
| Qwen2.5-7B-Instruct | 60.55 | 64.12 | 66.38 | 78.46 | 64.63 | 71.74 | 28.17 | 65.19 | 50.65 | 78.17 | 61.54 | 56.1 | 41.42 |
|
143 |
+
| Qwen2.5-14B-Instruct | 71.26 | 72.18 | 80.51 | 77.64 | 69.11 | 82.81 | 68.66 | 86.76 | 57.54 | 75.04 | 69.36 | 63.8 | 51.7 |
|
144 |
+
| Qwen2.5-72B-Instruct | **76.91** | **78.7** | 86.88 | **86.62** | **74.69** | **92.89** | 67.72 | 87.51 | 60.71 | **79.92** | **74.1** | **73.59** | **59.54** |
|
145 |
+
| Mistral-7B-Instruct-v0.3 | 43.05 | 35.67 | 53.59 | 63.4 | 43.85 | 27.11 | 30.41 | 64.03 | 34.08 | 60.25 | 45.27 | 32.3 | 26.65 |
|
146 |
+
| Mistral-Nemo-Instruct-2407 | 53.79 | 49.28 | 68.43 | 71.78 | 57.61 | 40.0 | 35.82 | 70.58 | 47.49 | 76.92 | 55.97 | 46.15 | 25.44 |
|
147 |
+
| Mistral-Small-Instruct-2409 | 51.11 | 40.96 | 60.64 | 63.66 | 47.73 | 44.46 | 51.12 | 78.16 | 38.73 | 68.93 | 50.43 | 39.63 | 28.82 |
|
148 |
+
| Falcon3-7B-Instruct | 41.3 | 37.52 | 52.65 | 57.63 | 41.47 | 56.53 | 8.58 | 47.92 | 31.84 | 58.98 | 42.08 | 32.36 | 27.99 |
|
149 |
+
| Meta-Llama-3.1-8B-Instruct | 54.08 | 45.68 | 59.23 | 71.7 | 52.51 | 34.38 | 51.87 | 79.11 | 52.51 | 69.93 | 56.43 | 44.67 | 30.9 |
|
150 |
+
| Llama-3.3-70B-Instruct | 71.43 | 68.84 | 79.6 | 78.81 | 70.49 | 70.91 | **70.9** | **88.6** | **65.74** | 76.93 | 72.01 | 70.25 | 44.12 |
|
151 |
+
|
152 |
+
Closed models evaluations:
|
153 |
+
|
154 |
+
| Model | ETEC <br>0 shot | IEN-MCQ <br>0 shot | IEN-TF <br>0 shot | AraPro <br>0 shot | AraMath <br>5 shot | ARIFEval <br>(prompt strict) <br>0 shot | ARIFEval <br>(inst strict) <br>0 shot | ExamsAR <br>5 shot | ACVA <br> 5 shot | Arabicmmlu <br>0 Shot | Openai mmlu <br>0 shot | GAT 0 shot |
|
155 |
+
|:---------------------------------------|:--------------|:-----------------|:----------------|:----------------|:-----------------|:----------------------------------|:--------------------------------|:-----------------|:-----------------------|:--------------------|:---------------------|:----------------------|
|
156 |
+
| Azureml GPT4o (gpt-4o-900ptu) | 79.39 | **92.03** | 88.97 | 80.86 | 83.47 | 70.9 | 88.12 | 61.82 | 72.51 | 79.02 | **76.5** | 62.65 |
|
157 |
+
| Claude Sonnet 3.5 (claude-3-5-sonnet-20241022) | **85.9** | 86.17 | **89.42** | **81.46** | 79.83 | 53.73 | 80.14 | **62.38** | **80.42** | 69.5 | 66.4 | **68.89** |
|
158 |
+
| gemini pro 1.5 (gemini-1.5-pro) | 83.31 | 88.28 | 85.44 | 76.22 | **94.88** | **74.81** | **90.17** | 58.1 | 75.17 | **82.0** | 64.8 | 59.14 |
|
159 |
+
|
160 |
+
#### English Benchmarks
|
161 |
+
|
162 |
+
| model |Avg | AGIEval 0 Shot | Arc (challenge) 0 Shot | GPQA (main) 0 Shot | Hendrycks <br>ethics 0 Shot | Winogrande 0 Shot | HellaSwag 0 Shot | TriviaQa 5 Shot | MMLU Pro<br>5 Shot | Minerva Math <br>4 Shot | MMLU 0 Shot | TruthfulQA <br>(mc2) 0 Shot | IFEval <br>(prompt strict)<br>0 Shot | IFEval <br>(inst strict)<br>0 Shot | GSM8k 5 Shot |
|
163 |
+
|:----------------------------------|:----------|:-----------------|:-----------------------|:--------------------------|:--------------------------|:--------------------|:-------------------|:------------------|:------------------|:----------------------|:--------------|:------------------------|:---------------------------------|:-------------------------------|:---------------|
|
164 |
+
| ALLaM-7B-Instruct-preview | 46.85 | 41.99 | 51.28 | 22.77 | 73.17 | 70.48 | 76.26 | 16.07 | 30.4 | 17.3 | 59.6 | 46.67 | 38.08 | 50.0 | 61.79 |
|
165 |
+
| AceGPT-v2-8B-Chat | 49.51 | 37.17 | 53.5 | 25.67 | 68.14 | 73.72 | 79.21 | 67.65 | 37.38 | 17.58 | 64.62 | 55.2 | 23.48 | 32.97 | 56.86 |
|
166 |
+
| AceGPT-v2-32B-Chat | 57.14 | 56.01 | 53.92 | 32.8125 | 66.23 | 79.16 | 83.29 | 69.45 | 45.89 | 32.8 | 74.03 | 59.18 | 27.54 | 40.89 | 78.7 |
|
167 |
+
| jais-family-6p7b-chat | 38.33 | 30.56 | 44.62 | 23.21 | 65.7 | 62.43 | 72.05 | 29.74 | 23.3 | 2.56 | 49.62 | 40.99 | 14.05 | 23.5 | 54.36 |
|
168 |
+
| jais-family-13b-chat | 42.62 | 30.31 | 47.87 | 25.89 | 65.91 | 65.04 | 75.0 | 35.82 | 24.4 | 19.1 | 51.91 | 40.57 | 19.41 | 30.82 | 64.59 |
|
169 |
+
| jais-family-30b-16k-chat | 45.15 | 31.85 | 48.46 | 23.88 | 69.44 | 68.19 | 76.21 | 43.99 | 29.11 | 22.3 | 58.5 | 44.78 | 18.3 | 29.14 | 67.93 |
|
170 |
+
| jais-family-30b-8k-chat | 47.59 | 36.65 | 48.38 | 21.88 | 69.28 | 70.32 | 78.55 | 46.67 | 28.7 | 26.44 | 57.46 | 49.49 | 22.92 | 37.05 | 72.48 |
|
171 |
+
| jais-adapted-7b-chat | 44.91 | 32.9 | 52.65 | 23.88 | 55.32 | 71.74 | 79.39 | 63.89 | 24.38 | 15.34 | 52.36 | 41.12 | 22.0 | 35.73 | 58.07 |
|
172 |
+
| jais-adapted-13b-chat | 47.7 | 36.49 | 54.18 | 26.34 | 65.73 | 69.77 | 80.86 | 58.48 | 26.29 | 21.34 | 55.66 | 42.27 | 24.95 | 36.57 | 68.84 |
|
173 |
+
| jais-adapted-70b-chat | 53.49 | 39.96 | 59.56 | 20.98 | 70.8 | 77.27 | 84.06 | 68.64 | 37.25 | 27.72 | 65.23 | 44.49 | 31.61 | 44.0 | 77.26 |
|
174 |
+
| Qwen2.5-7B-Instruct | 54.68 | 59.2 | 51.28 | 26.56 | 73.76 | 69.38 | 79.55 | 50.59 | 44.92 | 12.04 | 70.56 | 58.93 | 57.3 | 68.23 | 43.29 |
|
175 |
+
| Qwen2.5-14B-Instruct | 62.37 | 66.32 | 62.12 | 25.89 | 76.19 | 75.77 | 84.36 | 59.47 | 52.44 | 23.04 | 78.93 | 69.01 | 52.13 | 64.03 | 83.47 |
|
176 |
+
| Qwen2.5-72B-Instruct | **70.06** | **71.09** | **63.48** | 25.67 | 78.33 | 76.24 | **87.41** | 70.9 | **62.77** | **54.04** | **83.44** | **69.54** | 67.65 | 77.1 | **93.25** |
|
177 |
+
| Mistral-7B-Instruct-v0.3 | 51.98 | 36.45 | 58.87 | 23.21 | 72.58 | 73.95 | 82.93 | 67.97 | 33.18 | 13.44 | 59.74 | 59.69 | 42.51 | 54.8 | 48.37 |
|
178 |
+
| Mistral-Nemo-Instruct-2407 | 54.0 | 39.65 | 59.04 | 24.33 | 67.86 | 74.66 | 82.35 | 72.77 | 44.27 | 29.62 | 65.56 | 54.88 | 30.13 | 38.97 | 71.95 |
|
179 |
+
| Mistral-Small-Instruct-2409 | 61.65 | 40.76 | 60.49 | 25.89 | 72.27 | 78.53 | 85.35 | 79.11 | 47.47 | 39.42 | 69.42 | 56.35 | 58.23 | 68.35 | 81.43 |
|
180 |
+
| Falcon3-7B-Instruct | 58.04 | 43.84 | 59.47 | **33.71** | 70.39 | 70.09 | 78.43 | 51.98 | 46.73 | 30.76 | 68.14 | 55.53 | 56.01 | 68.59 | 78.92 |
|
181 |
+
| Meta-Llama-3.1-8B-Instruct | 56.5 | 42.39 | 55.12 | 27.23 | 66.69 | 73.95 | 79.28 | 70.05 | 40.641622 | 34.26 | 67.96 | 54.05 | 44.36 | 58.51 | 76.5 |
|
182 |
+
| Llama-3.3-70B-Instruct | 67.7 | 55.44 | 63.4 | 25.89 | **81.05** | **79.24** | 84.39 | **81.7** | 60.51 | 46.42 | 81.99 | 60.91 | **63.22** | **72.78** | 90.83 |
|
183 |
+
|
184 |
+
### MT-bench
|
185 |
+
|
186 |
+
**Multi-Turn Bench** (MT-Bench): A challenging multi-turn benchmark that uses GPT-4o as a judge. MT-bench comprises 80 questions from 8 domains. Each question is presented to the model and the responses are submitted to GPT-4o to assign scores to each response. The judge returns a score for the first and second turn separately.
|
187 |
+
This dataset was also automatically translated to Arabic and manually verified and culturally aligned.
|
188 |
+
|
189 |
+
| Model | AR Average | AR Turn 1 | AR Turn 2 | EN Average | EN Turn 1 | EN Turn 2 |
|
190 |
+
|---------------------|------------|-----------|-----------|------------|-----------|-----------|
|
191 |
+
| ALLaM-7B-Instruct-preview | 5.9 | **6.93**| 4.88 | 6.5 | 7.49 | 5.15 |
|
192 |
+
| AceGPT-v1.5-13B-Chat | 4.61 | 5.28 | 3.93 | 4.86 | 5.56 | 4.17 |
|
193 |
+
| AceGPT-v2-32B-Chat |5.43 | 6.61 | 4.26 | **6.5** | **7.41** | **5.58** |
|
194 |
+
| jais-family-13b-chat | 4.89 | 5.37 | 4.41 | 4.77 | 5.57 | 3.97
|
195 |
+
| jais-family-30b-16k-chat | 4.87 | 5.50 | 4.25 | 5.13 | 5.86 | 4.4 |
|
196 |
+
| jais-adapted-70b-chat | 5.86 | 6.33 | **5.38** | 5.88 | 6.41 | 5.36 |
|
197 |
+
|
198 |
+
## Citation
|
199 |
+
|
200 |
+
If you found this work helpful or used any part of this work, please include the following citation:
|
201 |
+
|
202 |
+
```
|
203 |
+
@inproceedings{
|
204 |
+
bari2025allam,
|
205 |
+
title={{ALL}aM: Large Language Models for Arabic and English},
|
206 |
+
author={M Saiful Bari and Yazeed Alnumay and Norah A. Alzahrani and Nouf M. Alotaibi and Hisham Abdullah Alyahya and Sultan AlRashed and Faisal Abdulrahman Mirza and Shaykhah Z. Alsubaie and Hassan A. Alahmed and Ghadah Alabduljabbar and Raghad Alkhathran and Yousef Almushayqih and Raneem Alnajim and Salman Alsubaihi and Maryam Al Mansour and Saad Amin Hassan and Dr. Majed Alrubaian and Ali Alammari and Zaki Alawami and Abdulmohsen Al-Thubaity and Ahmed Abdelali and Jeril Kuriakose and Abdalghani Abujabal and Nora Al-Twairesh and Areeb Alowisheq and Haidar Khan},
|
207 |
+
booktitle={The Thirteenth International Conference on Learning Representations},
|
208 |
+
year={2025},
|
209 |
+
url={https://openreview.net/forum?id=MscdsFVZrN}
|
210 |
+
}
|
211 |
+
```
|
config.json
ADDED
@@ -0,0 +1,28 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"architectures": [
|
3 |
+
"LlamaForCausalLM"
|
4 |
+
],
|
5 |
+
"attention_bias": false,
|
6 |
+
"attention_dropout": 0.0,
|
7 |
+
"bos_token_id": 1,
|
8 |
+
"eos_token_id": 2,
|
9 |
+
"hidden_act": "silu",
|
10 |
+
"hidden_size": 4096,
|
11 |
+
"initializer_range": 0.006,
|
12 |
+
"intermediate_size": 11008,
|
13 |
+
"max_position_embeddings": 4096,
|
14 |
+
"model_type": "llama",
|
15 |
+
"num_attention_heads": 32,
|
16 |
+
"num_hidden_layers": 32,
|
17 |
+
"num_key_value_heads": 32,
|
18 |
+
"pretraining_tp": 1,
|
19 |
+
"rms_norm_eps": 1e-05,
|
20 |
+
"rope_scaling": null,
|
21 |
+
"rope_theta": 10000.0,
|
22 |
+
"tie_word_embeddings": false,
|
23 |
+
"torch_dtype": "bfloat16",
|
24 |
+
"transformers_version": "4.39.3",
|
25 |
+
"use_cache": true,
|
26 |
+
"vocab_size": 64000,
|
27 |
+
"internal_version": "7b-alpha-v1.27.2.25"
|
28 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/acva_5_shot.json
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"acva": {
|
4 |
+
"alias": "acva",
|
5 |
+
"acc,none": 0.7274397244546499,
|
6 |
+
"acc_stderr,none": 0.004771397968508457,
|
7 |
+
"acc_norm,none": 0.7157290470723306,
|
8 |
+
"acc_norm_stderr,none": 0.004833440968499389
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"acva": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"acva": {
|
16 |
+
"task": "acva",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
|
21 |
+
"dataset_kwargs": {
|
22 |
+
"trust_remote_code": true
|
23 |
+
},
|
24 |
+
"validation_split": "validation",
|
25 |
+
"test_split": "test",
|
26 |
+
"fewshot_split": "validation",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": false,
|
50 |
+
"metadata": {
|
51 |
+
"version": 1.0
|
52 |
+
}
|
53 |
+
}
|
54 |
+
},
|
55 |
+
"versions": {
|
56 |
+
"acva": 1.0
|
57 |
+
},
|
58 |
+
"n-shot": {
|
59 |
+
"acva": 5
|
60 |
+
},
|
61 |
+
"higher_is_better": {
|
62 |
+
"acva": {
|
63 |
+
"acc": true,
|
64 |
+
"acc_norm": true
|
65 |
+
}
|
66 |
+
},
|
67 |
+
"n-samples": {
|
68 |
+
"acva": {
|
69 |
+
"original": 8710,
|
70 |
+
"effective": 8710
|
71 |
+
}
|
72 |
+
},
|
73 |
+
"config": {
|
74 |
+
"model": "hf",
|
75 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
76 |
+
"model_num_parameters": 32512545792,
|
77 |
+
"model_dtype": "torch.float16",
|
78 |
+
"model_revision": "main",
|
79 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
80 |
+
"batch_size": "auto",
|
81 |
+
"batch_sizes": [
|
82 |
+
64
|
83 |
+
],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "788a3672",
|
95 |
+
"date": 1737779797.3395095,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.1",
|
98 |
+
"upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<|endoftext|>",
|
101 |
+
"151643"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"<|endoftext|>",
|
105 |
+
"151643"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
null,
|
109 |
+
"None"
|
110 |
+
],
|
111 |
+
"eot_token_id": 151643,
|
112 |
+
"max_length": 32768,
|
113 |
+
"task_hashes": {},
|
114 |
+
"model_source": "hf",
|
115 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
116 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
117 |
+
"system_instruction": null,
|
118 |
+
"system_instruction_sha": null,
|
119 |
+
"fewshot_as_multiturn": false,
|
120 |
+
"chat_template": null,
|
121 |
+
"chat_template_sha": null,
|
122 |
+
"start_time": 26647.534977248,
|
123 |
+
"end_time": 27360.084961217,
|
124 |
+
"total_evaluation_time_seconds": "712.5499839689983"
|
125 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/ar_ifeval_0_shot.json
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"ar_ifeval": {
|
4 |
+
"alias": "ar_ifeval",
|
5 |
+
"prompt_level_strict_acc,none": 0.2574626865671642,
|
6 |
+
"prompt_level_strict_acc_stderr,none": 0.018903377119672635,
|
7 |
+
"inst_level_strict_acc,none": 0.6341296928327645,
|
8 |
+
"inst_level_strict_acc_stderr,none": "N/A",
|
9 |
+
"prompt_level_loose_acc,none": 0.31529850746268656,
|
10 |
+
"prompt_level_loose_acc_stderr,none": 0.020087907677710036,
|
11 |
+
"inst_level_loose_acc,none": 0.6764505119453925,
|
12 |
+
"inst_level_loose_acc_stderr,none": "N/A"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"group_subtasks": {
|
16 |
+
"ar_ifeval": []
|
17 |
+
},
|
18 |
+
"configs": {
|
19 |
+
"ar_ifeval": {
|
20 |
+
"task": "ar_ifeval",
|
21 |
+
"dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
|
22 |
+
"dataset_name": "ar_ifeval",
|
23 |
+
"dataset_kwargs": {
|
24 |
+
"trust_remote_code": true
|
25 |
+
},
|
26 |
+
"test_split": "test",
|
27 |
+
"doc_to_text": "prompt",
|
28 |
+
"doc_to_target": 0,
|
29 |
+
"process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
|
30 |
+
"description": "",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 0,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "prompt_level_strict_acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "inst_level_strict_acc",
|
42 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "prompt_level_loose_acc",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"metric": "inst_level_loose_acc",
|
52 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
53 |
+
"higher_is_better": true
|
54 |
+
}
|
55 |
+
],
|
56 |
+
"output_type": "generate_until",
|
57 |
+
"generation_kwargs": {
|
58 |
+
"until": [],
|
59 |
+
"do_sample": false,
|
60 |
+
"temperature": 0.0,
|
61 |
+
"max_gen_toks": 1280
|
62 |
+
},
|
63 |
+
"repeats": 1,
|
64 |
+
"should_decontaminate": false,
|
65 |
+
"metadata": {
|
66 |
+
"version": 4.0
|
67 |
+
}
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"versions": {
|
71 |
+
"ar_ifeval": 4.0
|
72 |
+
},
|
73 |
+
"n-shot": {
|
74 |
+
"ar_ifeval": 0
|
75 |
+
},
|
76 |
+
"higher_is_better": {
|
77 |
+
"ar_ifeval": {
|
78 |
+
"prompt_level_strict_acc": true,
|
79 |
+
"inst_level_strict_acc": true,
|
80 |
+
"prompt_level_loose_acc": true,
|
81 |
+
"inst_level_loose_acc": true
|
82 |
+
}
|
83 |
+
},
|
84 |
+
"n-samples": {
|
85 |
+
"ar_ifeval": {
|
86 |
+
"original": 536,
|
87 |
+
"effective": 536
|
88 |
+
}
|
89 |
+
},
|
90 |
+
"config": {
|
91 |
+
"model": "hf",
|
92 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
93 |
+
"model_num_parameters": 32512545792,
|
94 |
+
"model_dtype": "torch.float16",
|
95 |
+
"model_revision": "main",
|
96 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
97 |
+
"batch_size": 1,
|
98 |
+
"batch_sizes": [],
|
99 |
+
"device": null,
|
100 |
+
"use_cache": null,
|
101 |
+
"limit": null,
|
102 |
+
"bootstrap_iters": 100000,
|
103 |
+
"gen_kwargs": null,
|
104 |
+
"random_seed": 0,
|
105 |
+
"numpy_seed": 1234,
|
106 |
+
"torch_seed": 1234,
|
107 |
+
"fewshot_seed": 1234
|
108 |
+
},
|
109 |
+
"git_hash": "788a3672",
|
110 |
+
"date": 1738794647.2071357,
|
111 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
112 |
+
"transformers_version": "4.48.2",
|
113 |
+
"upper_git_hash": null,
|
114 |
+
"tokenizer_pad_token": [
|
115 |
+
"<|endoftext|>",
|
116 |
+
"151643"
|
117 |
+
],
|
118 |
+
"tokenizer_eos_token": [
|
119 |
+
"<|endoftext|>",
|
120 |
+
"151643"
|
121 |
+
],
|
122 |
+
"tokenizer_bos_token": [
|
123 |
+
null,
|
124 |
+
"None"
|
125 |
+
],
|
126 |
+
"eot_token_id": 151643,
|
127 |
+
"max_length": 32768,
|
128 |
+
"task_hashes": {
|
129 |
+
"ar_ifeval": "d0b91e989c8b697090db63bf498d8e2d8dd80815a595e5f22845a8425bff22fa"
|
130 |
+
},
|
131 |
+
"model_source": "hf",
|
132 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
133 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
134 |
+
"system_instruction": null,
|
135 |
+
"system_instruction_sha": null,
|
136 |
+
"fewshot_as_multiturn": false,
|
137 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
138 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
139 |
+
"start_time": 1753623.131321269,
|
140 |
+
"end_time": 1761093.682009075,
|
141 |
+
"total_evaluation_time_seconds": "7470.550687805982"
|
142 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/araMath_v3_5_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araMath_v3": {
|
4 |
+
"alias": "araMath_v3",
|
5 |
+
"acc,none": 0.6446280991735537,
|
6 |
+
"acc_stderr,none": 0.019475010007284948,
|
7 |
+
"acc_norm,none": 0.6446280991735537,
|
8 |
+
"acc_norm_stderr,none": 0.019475010007284948
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araMath_v3": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araMath_v3": {
|
16 |
+
"task": "araMath_v3",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
|
21 |
+
"dataset_name": "araMath_v3",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "{{choices}}",
|
31 |
+
"description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"araMath_v3": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"araMath_v3": 5
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"araMath_v3": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"araMath_v3": {
|
70 |
+
"original": 605,
|
71 |
+
"effective": 605
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
77 |
+
"model_num_parameters": 32512545792,
|
78 |
+
"model_dtype": "torch.float16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "788a3672",
|
94 |
+
"date": 1738805225.8162587,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.2",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|endoftext|>",
|
100 |
+
"151643"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|endoftext|>",
|
104 |
+
"151643"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
null,
|
108 |
+
"None"
|
109 |
+
],
|
110 |
+
"eot_token_id": 151643,
|
111 |
+
"max_length": 32768,
|
112 |
+
"task_hashes": {
|
113 |
+
"araMath_v3": "17b2596f46d709ea107ed20bef044ca126de23a8e9bbc8ba0a9beef94fbc032d"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
117 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
122 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
123 |
+
"start_time": 1764201.606664753,
|
124 |
+
"end_time": 1764270.091855178,
|
125 |
+
"total_evaluation_time_seconds": "68.48519042483531"
|
126 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/araPro_0_shot.json
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araPro": {
|
4 |
+
"alias": "araPro",
|
5 |
+
"acc,none": 0.671865626874625,
|
6 |
+
"acc_stderr,none": 0.006640213946839424,
|
7 |
+
"acc_norm,none": 0.671865626874625,
|
8 |
+
"acc_norm_stderr,none": 0.006640213946839424
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araPro": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araPro": {
|
16 |
+
"task": "araPro",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araPro/araPro.py",
|
21 |
+
"dataset_name": "araPro",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "{{choices}}",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": true,
|
54 |
+
"doc_to_decontamination_query": "Question",
|
55 |
+
"metadata": {
|
56 |
+
"version": 2.0
|
57 |
+
}
|
58 |
+
}
|
59 |
+
},
|
60 |
+
"versions": {
|
61 |
+
"araPro": 2.0
|
62 |
+
},
|
63 |
+
"n-shot": {
|
64 |
+
"araPro": 0
|
65 |
+
},
|
66 |
+
"higher_is_better": {
|
67 |
+
"araPro": {
|
68 |
+
"acc": true,
|
69 |
+
"acc_norm": true
|
70 |
+
}
|
71 |
+
},
|
72 |
+
"n-samples": {
|
73 |
+
"araPro": {
|
74 |
+
"original": 5001,
|
75 |
+
"effective": 5001
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"config": {
|
79 |
+
"model": "hf",
|
80 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
81 |
+
"model_num_parameters": 32512545792,
|
82 |
+
"model_dtype": "torch.float16",
|
83 |
+
"model_revision": "main",
|
84 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
85 |
+
"batch_size": 1,
|
86 |
+
"batch_sizes": [],
|
87 |
+
"device": null,
|
88 |
+
"use_cache": null,
|
89 |
+
"limit": null,
|
90 |
+
"bootstrap_iters": 100000,
|
91 |
+
"gen_kwargs": null,
|
92 |
+
"random_seed": 0,
|
93 |
+
"numpy_seed": 1234,
|
94 |
+
"torch_seed": 1234,
|
95 |
+
"fewshot_seed": 1234
|
96 |
+
},
|
97 |
+
"git_hash": "788a3672",
|
98 |
+
"date": 1738802810.5474553,
|
99 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
100 |
+
"transformers_version": "4.48.2",
|
101 |
+
"upper_git_hash": null,
|
102 |
+
"tokenizer_pad_token": [
|
103 |
+
"<|endoftext|>",
|
104 |
+
"151643"
|
105 |
+
],
|
106 |
+
"tokenizer_eos_token": [
|
107 |
+
"<|endoftext|>",
|
108 |
+
"151643"
|
109 |
+
],
|
110 |
+
"tokenizer_bos_token": [
|
111 |
+
null,
|
112 |
+
"None"
|
113 |
+
],
|
114 |
+
"eot_token_id": 151643,
|
115 |
+
"max_length": 32768,
|
116 |
+
"task_hashes": {
|
117 |
+
"araPro": "2f706897ad0129e016cc8d6907f8bb4359c32403fc2d1b0a4e78717f424793da"
|
118 |
+
},
|
119 |
+
"model_source": "hf",
|
120 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
121 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
122 |
+
"system_instruction": null,
|
123 |
+
"system_instruction_sha": null,
|
124 |
+
"fewshot_as_multiturn": false,
|
125 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
126 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
127 |
+
"start_time": 1761786.552693387,
|
128 |
+
"end_time": 1761894.218775138,
|
129 |
+
"total_evaluation_time_seconds": "107.66608175099827"
|
130 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/arabicmmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/AceGPT-v2-32B-Chat/etec_v2_0_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"etec_v2": {
|
4 |
+
"alias": "etec_v2",
|
5 |
+
"acc,none": 0.6481187069422364,
|
6 |
+
"acc_stderr,none": 0.010996501146375258,
|
7 |
+
"acc_norm,none": 0.6481187069422364,
|
8 |
+
"acc_norm_stderr,none": 0.010996501146375258
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"etec_v2": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"etec_v2": {
|
16 |
+
"task": "etec_v2",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/etec_v2/etec.py",
|
21 |
+
"dataset_name": "etec_v2",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 0,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"etec_v2": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"etec_v2": 0
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"etec_v2": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"etec_v2": {
|
70 |
+
"original": 1887,
|
71 |
+
"effective": 1887
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
77 |
+
"model_num_parameters": 32512545792,
|
78 |
+
"model_dtype": "torch.float16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "788a3672",
|
94 |
+
"date": 1738805984.3189015,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.2",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|endoftext|>",
|
100 |
+
"151643"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|endoftext|>",
|
104 |
+
"151643"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
null,
|
108 |
+
"None"
|
109 |
+
],
|
110 |
+
"eot_token_id": 151643,
|
111 |
+
"max_length": 32768,
|
112 |
+
"task_hashes": {
|
113 |
+
"etec_v2": "697b8bfc7d6b0f85165e5cca6953182b09b7a2b0d79fa31e74cc3897f432de41"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
117 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
122 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
123 |
+
"start_time": 1764960.166542801,
|
124 |
+
"end_time": 1765035.801506021,
|
125 |
+
"total_evaluation_time_seconds": "75.63496321998537"
|
126 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/exams_ar_5_shot.json
ADDED
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"exams_ar": {
|
4 |
+
"alias": "exams_ar",
|
5 |
+
"acc,none": 0.553072625698324,
|
6 |
+
"acc_stderr,none": 0.021474702941383872,
|
7 |
+
"acc_norm,none": 0.553072625698324,
|
8 |
+
"acc_norm_stderr,none": 0.021474702941383872
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"exams_ar": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"exams_ar": {
|
16 |
+
"task": "exams_ar",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/exams_ar",
|
21 |
+
"dataset_name": "exams_ar",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "choices",
|
32 |
+
"description": "description",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"num_fewshot": 5,
|
36 |
+
"metric_list": [
|
37 |
+
{
|
38 |
+
"metric": "acc",
|
39 |
+
"aggregation": "mean",
|
40 |
+
"higher_is_better": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"metric": "acc_norm",
|
44 |
+
"aggregation": "mean",
|
45 |
+
"higher_is_better": true
|
46 |
+
}
|
47 |
+
],
|
48 |
+
"output_type": "multiple_choice",
|
49 |
+
"repeats": 1,
|
50 |
+
"should_decontaminate": true,
|
51 |
+
"doc_to_decontamination_query": "query",
|
52 |
+
"metadata": {
|
53 |
+
"version": 1.0
|
54 |
+
}
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"versions": {
|
58 |
+
"exams_ar": 1.0
|
59 |
+
},
|
60 |
+
"n-shot": {
|
61 |
+
"exams_ar": 5
|
62 |
+
},
|
63 |
+
"higher_is_better": {
|
64 |
+
"exams_ar": {
|
65 |
+
"acc": true,
|
66 |
+
"acc_norm": true
|
67 |
+
}
|
68 |
+
},
|
69 |
+
"n-samples": {
|
70 |
+
"exams_ar": {
|
71 |
+
"original": 537,
|
72 |
+
"effective": 537
|
73 |
+
}
|
74 |
+
},
|
75 |
+
"config": {
|
76 |
+
"model": "hf",
|
77 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
78 |
+
"model_num_parameters": 32512545792,
|
79 |
+
"model_dtype": "torch.float16",
|
80 |
+
"model_revision": "main",
|
81 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
82 |
+
"batch_size": "auto",
|
83 |
+
"batch_sizes": [
|
84 |
+
16
|
85 |
+
],
|
86 |
+
"device": null,
|
87 |
+
"use_cache": null,
|
88 |
+
"limit": null,
|
89 |
+
"bootstrap_iters": 100000,
|
90 |
+
"gen_kwargs": null,
|
91 |
+
"random_seed": 0,
|
92 |
+
"numpy_seed": 1234,
|
93 |
+
"torch_seed": 1234,
|
94 |
+
"fewshot_seed": 1234
|
95 |
+
},
|
96 |
+
"git_hash": "788a3672",
|
97 |
+
"date": 1737780545.20475,
|
98 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
99 |
+
"transformers_version": "4.48.1",
|
100 |
+
"upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
|
101 |
+
"tokenizer_pad_token": [
|
102 |
+
"<|endoftext|>",
|
103 |
+
"151643"
|
104 |
+
],
|
105 |
+
"tokenizer_eos_token": [
|
106 |
+
"<|endoftext|>",
|
107 |
+
"151643"
|
108 |
+
],
|
109 |
+
"tokenizer_bos_token": [
|
110 |
+
null,
|
111 |
+
"None"
|
112 |
+
],
|
113 |
+
"eot_token_id": 151643,
|
114 |
+
"max_length": 32768,
|
115 |
+
"task_hashes": {},
|
116 |
+
"model_source": "hf",
|
117 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
118 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
119 |
+
"system_instruction": null,
|
120 |
+
"system_instruction_sha": null,
|
121 |
+
"fewshot_as_multiturn": false,
|
122 |
+
"chat_template": null,
|
123 |
+
"chat_template_sha": null,
|
124 |
+
"start_time": 27395.295045238,
|
125 |
+
"end_time": 27506.949709817,
|
126 |
+
"total_evaluation_time_seconds": "111.65466457900038"
|
127 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/gat_0_shot.json
ADDED
@@ -0,0 +1,543 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"gat": {
|
4 |
+
"acc,none": 0.4321459927254484,
|
5 |
+
"acc_stderr,none": 0.0038347299693873033,
|
6 |
+
"alias": "gat"
|
7 |
+
},
|
8 |
+
"gat_algebra": {
|
9 |
+
"alias": " - gat_algebra",
|
10 |
+
"acc,none": 0.3992578849721707,
|
11 |
+
"acc_stderr,none": 0.009435653731651068
|
12 |
+
},
|
13 |
+
"gat_analogy": {
|
14 |
+
"alias": " - gat_analogy",
|
15 |
+
"acc,none": 0.2867030965391621,
|
16 |
+
"acc_stderr,none": 0.00863295163043938
|
17 |
+
},
|
18 |
+
"gat_arithmetic": {
|
19 |
+
"alias": " - gat_arithmetic",
|
20 |
+
"acc,none": 0.3894000736105999,
|
21 |
+
"acc_stderr,none": 0.009356458715331561
|
22 |
+
},
|
23 |
+
"gat_association": {
|
24 |
+
"alias": " - gat_association",
|
25 |
+
"acc,none": 0.4143540669856459,
|
26 |
+
"acc_stderr,none": 0.01524590184737997
|
27 |
+
},
|
28 |
+
"gat_comparisons": {
|
29 |
+
"alias": " - gat_comparisons",
|
30 |
+
"acc,none": 0.34672131147540985,
|
31 |
+
"acc_stderr,none": 0.013631312083187472
|
32 |
+
},
|
33 |
+
"gat_completion": {
|
34 |
+
"alias": " - gat_completion",
|
35 |
+
"acc,none": 0.5793388429752067,
|
36 |
+
"acc_stderr,none": 0.014197745251253151
|
37 |
+
},
|
38 |
+
"gat_contextual": {
|
39 |
+
"alias": " - gat_contextual",
|
40 |
+
"acc,none": 0.522239263803681,
|
41 |
+
"acc_stderr,none": 0.013837823280527494
|
42 |
+
},
|
43 |
+
"gat_geometry": {
|
44 |
+
"alias": " - gat_geometry",
|
45 |
+
"acc,none": 0.5013698630136987,
|
46 |
+
"acc_stderr,none": 0.026207022561245137
|
47 |
+
},
|
48 |
+
"gat_reading": {
|
49 |
+
"alias": " - gat_reading",
|
50 |
+
"acc,none": 0.585633270321361,
|
51 |
+
"acc_stderr,none": 0.009580200187530542
|
52 |
+
}
|
53 |
+
},
|
54 |
+
"groups": {
|
55 |
+
"gat": {
|
56 |
+
"acc,none": 0.4321459927254484,
|
57 |
+
"acc_stderr,none": 0.0038347299693873033,
|
58 |
+
"alias": "gat"
|
59 |
+
}
|
60 |
+
},
|
61 |
+
"group_subtasks": {
|
62 |
+
"gat": [
|
63 |
+
"gat_analogy",
|
64 |
+
"gat_association",
|
65 |
+
"gat_completion",
|
66 |
+
"gat_reading",
|
67 |
+
"gat_algebra",
|
68 |
+
"gat_arithmetic",
|
69 |
+
"gat_comparisons",
|
70 |
+
"gat_contextual",
|
71 |
+
"gat_geometry"
|
72 |
+
]
|
73 |
+
},
|
74 |
+
"configs": {
|
75 |
+
"gat_algebra": {
|
76 |
+
"task": "gat_algebra",
|
77 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
78 |
+
"dataset_name": "algebra",
|
79 |
+
"dataset_kwargs": {
|
80 |
+
"trust_remote_code": true
|
81 |
+
},
|
82 |
+
"test_split": "test",
|
83 |
+
"fewshot_split": "validation",
|
84 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
85 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
86 |
+
"doc_to_target": "{{label}}",
|
87 |
+
"doc_to_choice": [
|
88 |
+
"\u0623",
|
89 |
+
"\u0628",
|
90 |
+
"\u062c",
|
91 |
+
"\u062f"
|
92 |
+
],
|
93 |
+
"description": "",
|
94 |
+
"target_delimiter": " ",
|
95 |
+
"fewshot_delimiter": "\n\n",
|
96 |
+
"num_fewshot": 0,
|
97 |
+
"metric_list": [
|
98 |
+
{
|
99 |
+
"metric": "acc",
|
100 |
+
"aggregation": "mean",
|
101 |
+
"higher_is_better": true
|
102 |
+
}
|
103 |
+
],
|
104 |
+
"output_type": "multiple_choice",
|
105 |
+
"repeats": 1,
|
106 |
+
"should_decontaminate": false,
|
107 |
+
"metadata": {
|
108 |
+
"version": 0.0
|
109 |
+
}
|
110 |
+
},
|
111 |
+
"gat_analogy": {
|
112 |
+
"task": "gat_analogy",
|
113 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
114 |
+
"dataset_name": "analogy",
|
115 |
+
"dataset_kwargs": {
|
116 |
+
"trust_remote_code": true
|
117 |
+
},
|
118 |
+
"test_split": "test",
|
119 |
+
"fewshot_split": "validation",
|
120 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
121 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
122 |
+
"doc_to_target": "{{label}}",
|
123 |
+
"doc_to_choice": [
|
124 |
+
"\u0623",
|
125 |
+
"\u0628",
|
126 |
+
"\u062c",
|
127 |
+
"\u062f"
|
128 |
+
],
|
129 |
+
"description": "",
|
130 |
+
"target_delimiter": " ",
|
131 |
+
"fewshot_delimiter": "\n\n",
|
132 |
+
"num_fewshot": 0,
|
133 |
+
"metric_list": [
|
134 |
+
{
|
135 |
+
"metric": "acc",
|
136 |
+
"aggregation": "mean",
|
137 |
+
"higher_is_better": true
|
138 |
+
}
|
139 |
+
],
|
140 |
+
"output_type": "multiple_choice",
|
141 |
+
"repeats": 1,
|
142 |
+
"should_decontaminate": false,
|
143 |
+
"metadata": {
|
144 |
+
"version": 0.0
|
145 |
+
}
|
146 |
+
},
|
147 |
+
"gat_arithmetic": {
|
148 |
+
"task": "gat_arithmetic",
|
149 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
150 |
+
"dataset_name": "arithmetic",
|
151 |
+
"dataset_kwargs": {
|
152 |
+
"trust_remote_code": true
|
153 |
+
},
|
154 |
+
"test_split": "test",
|
155 |
+
"fewshot_split": "validation",
|
156 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
157 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
158 |
+
"doc_to_target": "{{label}}",
|
159 |
+
"doc_to_choice": [
|
160 |
+
"\u0623",
|
161 |
+
"\u0628",
|
162 |
+
"\u062c",
|
163 |
+
"\u062f"
|
164 |
+
],
|
165 |
+
"description": "",
|
166 |
+
"target_delimiter": " ",
|
167 |
+
"fewshot_delimiter": "\n\n",
|
168 |
+
"num_fewshot": 0,
|
169 |
+
"metric_list": [
|
170 |
+
{
|
171 |
+
"metric": "acc",
|
172 |
+
"aggregation": "mean",
|
173 |
+
"higher_is_better": true
|
174 |
+
}
|
175 |
+
],
|
176 |
+
"output_type": "multiple_choice",
|
177 |
+
"repeats": 1,
|
178 |
+
"should_decontaminate": false,
|
179 |
+
"metadata": {
|
180 |
+
"version": 0.0
|
181 |
+
}
|
182 |
+
},
|
183 |
+
"gat_association": {
|
184 |
+
"task": "gat_association",
|
185 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
186 |
+
"dataset_name": "association",
|
187 |
+
"dataset_kwargs": {
|
188 |
+
"trust_remote_code": true
|
189 |
+
},
|
190 |
+
"test_split": "test",
|
191 |
+
"fewshot_split": "validation",
|
192 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
193 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
194 |
+
"doc_to_target": "{{label}}",
|
195 |
+
"doc_to_choice": [
|
196 |
+
"\u0623",
|
197 |
+
"\u0628",
|
198 |
+
"\u062c",
|
199 |
+
"\u062f"
|
200 |
+
],
|
201 |
+
"description": "",
|
202 |
+
"target_delimiter": " ",
|
203 |
+
"fewshot_delimiter": "\n\n",
|
204 |
+
"num_fewshot": 0,
|
205 |
+
"metric_list": [
|
206 |
+
{
|
207 |
+
"metric": "acc",
|
208 |
+
"aggregation": "mean",
|
209 |
+
"higher_is_better": true
|
210 |
+
}
|
211 |
+
],
|
212 |
+
"output_type": "multiple_choice",
|
213 |
+
"repeats": 1,
|
214 |
+
"should_decontaminate": false,
|
215 |
+
"metadata": {
|
216 |
+
"version": 0.0
|
217 |
+
}
|
218 |
+
},
|
219 |
+
"gat_comparisons": {
|
220 |
+
"task": "gat_comparisons",
|
221 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
222 |
+
"dataset_name": "comparisons",
|
223 |
+
"dataset_kwargs": {
|
224 |
+
"trust_remote_code": true
|
225 |
+
},
|
226 |
+
"test_split": "test",
|
227 |
+
"fewshot_split": "validation",
|
228 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
229 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
230 |
+
"doc_to_target": "{{label}}",
|
231 |
+
"doc_to_choice": [
|
232 |
+
"\u0623",
|
233 |
+
"\u0628",
|
234 |
+
"\u062c",
|
235 |
+
"\u062f"
|
236 |
+
],
|
237 |
+
"description": "",
|
238 |
+
"target_delimiter": " ",
|
239 |
+
"fewshot_delimiter": "\n\n",
|
240 |
+
"num_fewshot": 0,
|
241 |
+
"metric_list": [
|
242 |
+
{
|
243 |
+
"metric": "acc",
|
244 |
+
"aggregation": "mean",
|
245 |
+
"higher_is_better": true
|
246 |
+
}
|
247 |
+
],
|
248 |
+
"output_type": "multiple_choice",
|
249 |
+
"repeats": 1,
|
250 |
+
"should_decontaminate": false,
|
251 |
+
"metadata": {
|
252 |
+
"version": 0.0
|
253 |
+
}
|
254 |
+
},
|
255 |
+
"gat_completion": {
|
256 |
+
"task": "gat_completion",
|
257 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
258 |
+
"dataset_name": "completion",
|
259 |
+
"dataset_kwargs": {
|
260 |
+
"trust_remote_code": true
|
261 |
+
},
|
262 |
+
"test_split": "test",
|
263 |
+
"fewshot_split": "validation",
|
264 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
265 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
266 |
+
"doc_to_target": "{{label}}",
|
267 |
+
"doc_to_choice": [
|
268 |
+
"\u0623",
|
269 |
+
"\u0628",
|
270 |
+
"\u062c",
|
271 |
+
"\u062f"
|
272 |
+
],
|
273 |
+
"description": "",
|
274 |
+
"target_delimiter": " ",
|
275 |
+
"fewshot_delimiter": "\n\n",
|
276 |
+
"num_fewshot": 0,
|
277 |
+
"metric_list": [
|
278 |
+
{
|
279 |
+
"metric": "acc",
|
280 |
+
"aggregation": "mean",
|
281 |
+
"higher_is_better": true
|
282 |
+
}
|
283 |
+
],
|
284 |
+
"output_type": "multiple_choice",
|
285 |
+
"repeats": 1,
|
286 |
+
"should_decontaminate": false,
|
287 |
+
"metadata": {
|
288 |
+
"version": 0.0
|
289 |
+
}
|
290 |
+
},
|
291 |
+
"gat_contextual": {
|
292 |
+
"task": "gat_contextual",
|
293 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
294 |
+
"dataset_name": "contextual",
|
295 |
+
"dataset_kwargs": {
|
296 |
+
"trust_remote_code": true
|
297 |
+
},
|
298 |
+
"test_split": "test",
|
299 |
+
"fewshot_split": "validation",
|
300 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
301 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
302 |
+
"doc_to_target": "{{label}}",
|
303 |
+
"doc_to_choice": [
|
304 |
+
"\u0623",
|
305 |
+
"\u0628",
|
306 |
+
"\u062c",
|
307 |
+
"\u062f"
|
308 |
+
],
|
309 |
+
"description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
|
310 |
+
"target_delimiter": " ",
|
311 |
+
"fewshot_delimiter": "\n\n",
|
312 |
+
"num_fewshot": 0,
|
313 |
+
"metric_list": [
|
314 |
+
{
|
315 |
+
"metric": "acc",
|
316 |
+
"aggregation": "mean",
|
317 |
+
"higher_is_better": true
|
318 |
+
}
|
319 |
+
],
|
320 |
+
"output_type": "multiple_choice",
|
321 |
+
"repeats": 1,
|
322 |
+
"should_decontaminate": false,
|
323 |
+
"metadata": {
|
324 |
+
"version": 0.0
|
325 |
+
}
|
326 |
+
},
|
327 |
+
"gat_geometry": {
|
328 |
+
"task": "gat_geometry",
|
329 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
330 |
+
"dataset_name": "geometry",
|
331 |
+
"dataset_kwargs": {
|
332 |
+
"trust_remote_code": true
|
333 |
+
},
|
334 |
+
"test_split": "test",
|
335 |
+
"fewshot_split": "validation",
|
336 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
337 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
338 |
+
"doc_to_target": "{{label}}",
|
339 |
+
"doc_to_choice": [
|
340 |
+
"\u0623",
|
341 |
+
"\u0628",
|
342 |
+
"\u062c",
|
343 |
+
"\u062f"
|
344 |
+
],
|
345 |
+
"description": "",
|
346 |
+
"target_delimiter": " ",
|
347 |
+
"fewshot_delimiter": "\n\n",
|
348 |
+
"num_fewshot": 0,
|
349 |
+
"metric_list": [
|
350 |
+
{
|
351 |
+
"metric": "acc",
|
352 |
+
"aggregation": "mean",
|
353 |
+
"higher_is_better": true
|
354 |
+
}
|
355 |
+
],
|
356 |
+
"output_type": "multiple_choice",
|
357 |
+
"repeats": 1,
|
358 |
+
"should_decontaminate": false,
|
359 |
+
"metadata": {
|
360 |
+
"version": 0.0
|
361 |
+
}
|
362 |
+
},
|
363 |
+
"gat_reading": {
|
364 |
+
"task": "gat_reading",
|
365 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
366 |
+
"dataset_name": "reading",
|
367 |
+
"dataset_kwargs": {
|
368 |
+
"trust_remote_code": true
|
369 |
+
},
|
370 |
+
"test_split": "test",
|
371 |
+
"fewshot_split": "validation",
|
372 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
373 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
374 |
+
"doc_to_target": "{{label}}",
|
375 |
+
"doc_to_choice": [
|
376 |
+
"\u0623",
|
377 |
+
"\u0628",
|
378 |
+
"\u062c",
|
379 |
+
"\u062f"
|
380 |
+
],
|
381 |
+
"description": "",
|
382 |
+
"target_delimiter": " ",
|
383 |
+
"fewshot_delimiter": "\n\n",
|
384 |
+
"num_fewshot": 0,
|
385 |
+
"metric_list": [
|
386 |
+
{
|
387 |
+
"metric": "acc",
|
388 |
+
"aggregation": "mean",
|
389 |
+
"higher_is_better": true
|
390 |
+
}
|
391 |
+
],
|
392 |
+
"output_type": "multiple_choice",
|
393 |
+
"repeats": 1,
|
394 |
+
"should_decontaminate": false,
|
395 |
+
"metadata": {
|
396 |
+
"version": 0.0
|
397 |
+
}
|
398 |
+
}
|
399 |
+
},
|
400 |
+
"versions": {
|
401 |
+
"gat": 0,
|
402 |
+
"gat_algebra": 0.0,
|
403 |
+
"gat_analogy": 0.0,
|
404 |
+
"gat_arithmetic": 0.0,
|
405 |
+
"gat_association": 0.0,
|
406 |
+
"gat_comparisons": 0.0,
|
407 |
+
"gat_completion": 0.0,
|
408 |
+
"gat_contextual": 0.0,
|
409 |
+
"gat_geometry": 0.0,
|
410 |
+
"gat_reading": 0.0
|
411 |
+
},
|
412 |
+
"n-shot": {
|
413 |
+
"gat_algebra": 0,
|
414 |
+
"gat_analogy": 0,
|
415 |
+
"gat_arithmetic": 0,
|
416 |
+
"gat_association": 0,
|
417 |
+
"gat_comparisons": 0,
|
418 |
+
"gat_completion": 0,
|
419 |
+
"gat_contextual": 0,
|
420 |
+
"gat_geometry": 0,
|
421 |
+
"gat_reading": 0
|
422 |
+
},
|
423 |
+
"higher_is_better": {
|
424 |
+
"gat": {
|
425 |
+
"acc": true
|
426 |
+
},
|
427 |
+
"gat_algebra": {
|
428 |
+
"acc": true
|
429 |
+
},
|
430 |
+
"gat_analogy": {
|
431 |
+
"acc": true
|
432 |
+
},
|
433 |
+
"gat_arithmetic": {
|
434 |
+
"acc": true
|
435 |
+
},
|
436 |
+
"gat_association": {
|
437 |
+
"acc": true
|
438 |
+
},
|
439 |
+
"gat_comparisons": {
|
440 |
+
"acc": true
|
441 |
+
},
|
442 |
+
"gat_completion": {
|
443 |
+
"acc": true
|
444 |
+
},
|
445 |
+
"gat_contextual": {
|
446 |
+
"acc": true
|
447 |
+
},
|
448 |
+
"gat_geometry": {
|
449 |
+
"acc": true
|
450 |
+
},
|
451 |
+
"gat_reading": {
|
452 |
+
"acc": true
|
453 |
+
}
|
454 |
+
},
|
455 |
+
"n-samples": {
|
456 |
+
"gat_analogy": {
|
457 |
+
"original": 2745,
|
458 |
+
"effective": 2745
|
459 |
+
},
|
460 |
+
"gat_association": {
|
461 |
+
"original": 1045,
|
462 |
+
"effective": 1045
|
463 |
+
},
|
464 |
+
"gat_completion": {
|
465 |
+
"original": 1210,
|
466 |
+
"effective": 1210
|
467 |
+
},
|
468 |
+
"gat_reading": {
|
469 |
+
"original": 2645,
|
470 |
+
"effective": 2645
|
471 |
+
},
|
472 |
+
"gat_algebra": {
|
473 |
+
"original": 2695,
|
474 |
+
"effective": 2695
|
475 |
+
},
|
476 |
+
"gat_arithmetic": {
|
477 |
+
"original": 2717,
|
478 |
+
"effective": 2717
|
479 |
+
},
|
480 |
+
"gat_comparisons": {
|
481 |
+
"original": 1220,
|
482 |
+
"effective": 1220
|
483 |
+
},
|
484 |
+
"gat_contextual": {
|
485 |
+
"original": 1304,
|
486 |
+
"effective": 1304
|
487 |
+
},
|
488 |
+
"gat_geometry": {
|
489 |
+
"original": 365,
|
490 |
+
"effective": 365
|
491 |
+
}
|
492 |
+
},
|
493 |
+
"config": {
|
494 |
+
"model": "hf",
|
495 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
496 |
+
"model_num_parameters": 32512545792,
|
497 |
+
"model_dtype": "torch.float16",
|
498 |
+
"model_revision": "main",
|
499 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
500 |
+
"batch_size": 1,
|
501 |
+
"batch_sizes": [],
|
502 |
+
"device": null,
|
503 |
+
"use_cache": null,
|
504 |
+
"limit": null,
|
505 |
+
"bootstrap_iters": 100000,
|
506 |
+
"gen_kwargs": null,
|
507 |
+
"random_seed": 0,
|
508 |
+
"numpy_seed": 1234,
|
509 |
+
"torch_seed": 1234,
|
510 |
+
"fewshot_seed": 1234
|
511 |
+
},
|
512 |
+
"git_hash": "ef4b2026",
|
513 |
+
"date": 1733932681.9722512,
|
514 |
+
"pretty_env_info": "PyTorch version: 2.1.0a0+29c30b1\nIs debug build: False\nCUDA used to build PyTorch: 12.2\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.22.2\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.1.0a0+29c30b1\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.16.0a0\n[pip3] triton==2.0.0.dev20221202\n[conda] Could not collect",
|
515 |
+
"transformers_version": "4.47.0",
|
516 |
+
"upper_git_hash": "27ba526c4b16ee30604687f8bfd4c19680101dd1",
|
517 |
+
"tokenizer_pad_token": [
|
518 |
+
"<|endoftext|>",
|
519 |
+
"151643"
|
520 |
+
],
|
521 |
+
"tokenizer_eos_token": [
|
522 |
+
"<|endoftext|>",
|
523 |
+
"151643"
|
524 |
+
],
|
525 |
+
"tokenizer_bos_token": [
|
526 |
+
null,
|
527 |
+
"None"
|
528 |
+
],
|
529 |
+
"eot_token_id": 151643,
|
530 |
+
"max_length": 32768,
|
531 |
+
"task_hashes": {},
|
532 |
+
"model_source": "hf",
|
533 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
534 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
535 |
+
"system_instruction": null,
|
536 |
+
"system_instruction_sha": null,
|
537 |
+
"fewshot_as_multiturn": false,
|
538 |
+
"chat_template": null,
|
539 |
+
"chat_template_sha": null,
|
540 |
+
"start_time": 2367.995520754,
|
541 |
+
"end_time": 5482.980996963,
|
542 |
+
"total_evaluation_time_seconds": "3114.9854762089994"
|
543 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_mcq_0_shot.json
ADDED
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_mcq": {
|
4 |
+
"alias": "moe_ien_mcq",
|
5 |
+
"acc,none": 0.816016016016016,
|
6 |
+
"acc_stderr,none": 0.0038768441643790346,
|
7 |
+
"acc_norm,none": 0.816016016016016,
|
8 |
+
"acc_norm_stderr,none": 0.0038768441643790346
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_mcq": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_mcq": {
|
16 |
+
"task": "moe_ien_mcq",
|
17 |
+
"dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
|
18 |
+
"dataset_name": "moe_ien_mcq",
|
19 |
+
"dataset_kwargs": {
|
20 |
+
"trust_remote_code": true
|
21 |
+
},
|
22 |
+
"validation_split": "validation",
|
23 |
+
"test_split": "test",
|
24 |
+
"fewshot_split": "validation",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "Query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "{{Choices}}",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"fewshot_config": {
|
33 |
+
"sampler": "balanced_cat"
|
34 |
+
},
|
35 |
+
"num_fewshot": 0,
|
36 |
+
"metric_list": [
|
37 |
+
{
|
38 |
+
"metric": "acc",
|
39 |
+
"aggregation": "mean",
|
40 |
+
"higher_is_better": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"metric": "acc_norm",
|
44 |
+
"aggregation": "mean",
|
45 |
+
"higher_is_better": true
|
46 |
+
}
|
47 |
+
],
|
48 |
+
"output_type": "multiple_choice",
|
49 |
+
"repeats": 1,
|
50 |
+
"should_decontaminate": true,
|
51 |
+
"doc_to_decontamination_query": "Query",
|
52 |
+
"metadata": {
|
53 |
+
"version": 0.0
|
54 |
+
}
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"versions": {
|
58 |
+
"moe_ien_mcq": 0.0
|
59 |
+
},
|
60 |
+
"n-shot": {
|
61 |
+
"moe_ien_mcq": 0
|
62 |
+
},
|
63 |
+
"higher_is_better": {
|
64 |
+
"moe_ien_mcq": {
|
65 |
+
"acc": true,
|
66 |
+
"acc_norm": true
|
67 |
+
}
|
68 |
+
},
|
69 |
+
"n-samples": {
|
70 |
+
"moe_ien_mcq": {
|
71 |
+
"original": 9990,
|
72 |
+
"effective": 9990
|
73 |
+
}
|
74 |
+
},
|
75 |
+
"config": {
|
76 |
+
"model": "hf",
|
77 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
78 |
+
"model_num_parameters": 32512545792,
|
79 |
+
"model_dtype": "torch.float16",
|
80 |
+
"model_revision": "main",
|
81 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
82 |
+
"batch_size": 1,
|
83 |
+
"batch_sizes": [],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "788a3672",
|
95 |
+
"date": 1738807582.4110897,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.2",
|
98 |
+
"upper_git_hash": null,
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<|endoftext|>",
|
101 |
+
"151643"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"<|endoftext|>",
|
105 |
+
"151643"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
null,
|
109 |
+
"None"
|
110 |
+
],
|
111 |
+
"eot_token_id": 151643,
|
112 |
+
"max_length": 32768,
|
113 |
+
"task_hashes": {
|
114 |
+
"moe_ien_mcq": "e5422ff2f277b9bfffeb1b5ad185b714804b5a3d276dfff99a29eb88d9a41683"
|
115 |
+
},
|
116 |
+
"model_source": "hf",
|
117 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
118 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
119 |
+
"system_instruction": null,
|
120 |
+
"system_instruction_sha": null,
|
121 |
+
"fewshot_as_multiturn": false,
|
122 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
123 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
124 |
+
"start_time": 1766558.431540363,
|
125 |
+
"end_time": 1766704.504224634,
|
126 |
+
"total_evaluation_time_seconds": "146.07268427102827"
|
127 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/moe_ien_tf_0_shot.json
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_tf": {
|
4 |
+
"alias": "moe_ien_tf",
|
5 |
+
"acc,none": 0.8035376953460416,
|
6 |
+
"acc_stderr,none": 0.005207228603848848,
|
7 |
+
"acc_norm,none": 0.8035376953460416,
|
8 |
+
"acc_norm_stderr,none": 0.005207228603848848
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_tf": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_tf": {
|
16 |
+
"task": "moe_ien_tf",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
|
21 |
+
"dataset_name": "moe_ien_tf",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "choices",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": false,
|
54 |
+
"metadata": {
|
55 |
+
"version": 2.0
|
56 |
+
}
|
57 |
+
}
|
58 |
+
},
|
59 |
+
"versions": {
|
60 |
+
"moe_ien_tf": 2.0
|
61 |
+
},
|
62 |
+
"n-shot": {
|
63 |
+
"moe_ien_tf": 0
|
64 |
+
},
|
65 |
+
"higher_is_better": {
|
66 |
+
"moe_ien_tf": {
|
67 |
+
"acc": true,
|
68 |
+
"acc_norm": true
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"n-samples": {
|
72 |
+
"moe_ien_tf": {
|
73 |
+
"original": 5823,
|
74 |
+
"effective": 5823
|
75 |
+
}
|
76 |
+
},
|
77 |
+
"config": {
|
78 |
+
"model": "hf",
|
79 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-32B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
80 |
+
"model_num_parameters": 32512545792,
|
81 |
+
"model_dtype": "torch.float16",
|
82 |
+
"model_revision": "main",
|
83 |
+
"model_sha": "1c0ca4fb3fa4c292ac3d1f64f330f210c9f184d4",
|
84 |
+
"batch_size": 1,
|
85 |
+
"batch_sizes": [],
|
86 |
+
"device": null,
|
87 |
+
"use_cache": null,
|
88 |
+
"limit": null,
|
89 |
+
"bootstrap_iters": 100000,
|
90 |
+
"gen_kwargs": null,
|
91 |
+
"random_seed": 0,
|
92 |
+
"numpy_seed": 1234,
|
93 |
+
"torch_seed": 1234,
|
94 |
+
"fewshot_seed": 1234
|
95 |
+
},
|
96 |
+
"git_hash": "788a3672",
|
97 |
+
"date": 1738809377.2163908,
|
98 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
99 |
+
"transformers_version": "4.48.2",
|
100 |
+
"upper_git_hash": null,
|
101 |
+
"tokenizer_pad_token": [
|
102 |
+
"<|endoftext|>",
|
103 |
+
"151643"
|
104 |
+
],
|
105 |
+
"tokenizer_eos_token": [
|
106 |
+
"<|endoftext|>",
|
107 |
+
"151643"
|
108 |
+
],
|
109 |
+
"tokenizer_bos_token": [
|
110 |
+
null,
|
111 |
+
"None"
|
112 |
+
],
|
113 |
+
"eot_token_id": 151643,
|
114 |
+
"max_length": 32768,
|
115 |
+
"task_hashes": {
|
116 |
+
"moe_ien_tf": "116cb28cd11c72b01c3d52d75d3918c312d0a4f569bfdb8b2219398ec576a3f4"
|
117 |
+
},
|
118 |
+
"model_source": "hf",
|
119 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-32B-Chat",
|
120 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-32B-Chat",
|
121 |
+
"system_instruction": null,
|
122 |
+
"system_instruction_sha": null,
|
123 |
+
"fewshot_as_multiturn": false,
|
124 |
+
"chat_template": "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
125 |
+
"chat_template_sha": "af9c0233881b083b52ff773580215222b5440ac3d0beeeca99b76329b048f8db",
|
126 |
+
"start_time": 1768353.06839988,
|
127 |
+
"end_time": 1768502.097875321,
|
128 |
+
"total_evaluation_time_seconds": "149.0294754409697"
|
129 |
+
}
|
evaluations/ar/AceGPT-v2-32B-Chat/openaimmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/AceGPT-v2-8B-Chat/acva_5_shot.json
ADDED
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"acva": {
|
4 |
+
"alias": "acva",
|
5 |
+
"acc,none": 0.7415614236509759,
|
6 |
+
"acc_stderr,none": 0.004691028694524559,
|
7 |
+
"acc_norm,none": 0.7268656716417911,
|
8 |
+
"acc_norm_stderr,none": 0.004774534958083965
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"acva": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"acva": {
|
16 |
+
"task": "acva",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
|
21 |
+
"dataset_kwargs": {
|
22 |
+
"trust_remote_code": true
|
23 |
+
},
|
24 |
+
"test_split": "test",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "choices",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"num_fewshot": 5,
|
33 |
+
"metric_list": [
|
34 |
+
{
|
35 |
+
"metric": "acc",
|
36 |
+
"aggregation": "mean",
|
37 |
+
"higher_is_better": true
|
38 |
+
},
|
39 |
+
{
|
40 |
+
"metric": "acc_norm",
|
41 |
+
"aggregation": "mean",
|
42 |
+
"higher_is_better": true
|
43 |
+
}
|
44 |
+
],
|
45 |
+
"output_type": "multiple_choice",
|
46 |
+
"repeats": 1,
|
47 |
+
"should_decontaminate": false,
|
48 |
+
"metadata": {
|
49 |
+
"version": 0.0
|
50 |
+
}
|
51 |
+
}
|
52 |
+
},
|
53 |
+
"versions": {
|
54 |
+
"acva": 0.0
|
55 |
+
},
|
56 |
+
"n-shot": {
|
57 |
+
"acva": 5
|
58 |
+
},
|
59 |
+
"higher_is_better": {
|
60 |
+
"acva": {
|
61 |
+
"acc": true,
|
62 |
+
"acc_norm": true
|
63 |
+
}
|
64 |
+
},
|
65 |
+
"n-samples": {
|
66 |
+
"acva": {
|
67 |
+
"original": 8710,
|
68 |
+
"effective": 8710
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"config": {
|
72 |
+
"model": "hf",
|
73 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
74 |
+
"model_num_parameters": 8030261248,
|
75 |
+
"model_dtype": "torch.float16",
|
76 |
+
"model_revision": "main",
|
77 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
78 |
+
"batch_size": "auto",
|
79 |
+
"batch_sizes": [
|
80 |
+
64
|
81 |
+
],
|
82 |
+
"device": null,
|
83 |
+
"use_cache": null,
|
84 |
+
"limit": null,
|
85 |
+
"bootstrap_iters": 100000,
|
86 |
+
"gen_kwargs": null,
|
87 |
+
"random_seed": 0,
|
88 |
+
"numpy_seed": 1234,
|
89 |
+
"torch_seed": 1234,
|
90 |
+
"fewshot_seed": 1234
|
91 |
+
},
|
92 |
+
"git_hash": "5e10e017",
|
93 |
+
"date": 1736966813.484974,
|
94 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
95 |
+
"transformers_version": "4.48.0",
|
96 |
+
"upper_git_hash": "2e5cd5395faf76fea1afc96dd0f7161a9d3aa145",
|
97 |
+
"tokenizer_pad_token": [
|
98 |
+
"<|end_of_text|>",
|
99 |
+
"128001"
|
100 |
+
],
|
101 |
+
"tokenizer_eos_token": [
|
102 |
+
"<|end_of_text|>",
|
103 |
+
"128001"
|
104 |
+
],
|
105 |
+
"tokenizer_bos_token": [
|
106 |
+
"<|begin_of_text|>",
|
107 |
+
"128000"
|
108 |
+
],
|
109 |
+
"eot_token_id": 128001,
|
110 |
+
"max_length": 8192,
|
111 |
+
"task_hashes": {},
|
112 |
+
"model_source": "hf",
|
113 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
114 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
115 |
+
"system_instruction": null,
|
116 |
+
"system_instruction_sha": null,
|
117 |
+
"fewshot_as_multiturn": false,
|
118 |
+
"chat_template": null,
|
119 |
+
"chat_template_sha": null,
|
120 |
+
"start_time": 2430.929540314,
|
121 |
+
"end_time": 3025.204908665,
|
122 |
+
"total_evaluation_time_seconds": "594.275368351"
|
123 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/ar_ifeval_0_shot.json
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"ar_ifeval": {
|
4 |
+
"alias": "ar_ifeval",
|
5 |
+
"prompt_level_strict_acc,none": 0.10261194029850747,
|
6 |
+
"prompt_level_strict_acc_stderr,none": 0.01311934649092474,
|
7 |
+
"inst_level_strict_acc,none": 0.3924914675767918,
|
8 |
+
"inst_level_strict_acc_stderr,none": "N/A",
|
9 |
+
"prompt_level_loose_acc,none": 0.12126865671641791,
|
10 |
+
"prompt_level_loose_acc_stderr,none": 0.01411319854290401,
|
11 |
+
"inst_level_loose_acc,none": 0.42389078498293514,
|
12 |
+
"inst_level_loose_acc_stderr,none": "N/A"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"group_subtasks": {
|
16 |
+
"ar_ifeval": []
|
17 |
+
},
|
18 |
+
"configs": {
|
19 |
+
"ar_ifeval": {
|
20 |
+
"task": "ar_ifeval",
|
21 |
+
"dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
|
22 |
+
"dataset_name": "ar_ifeval",
|
23 |
+
"dataset_kwargs": {
|
24 |
+
"trust_remote_code": true
|
25 |
+
},
|
26 |
+
"test_split": "test",
|
27 |
+
"doc_to_text": "prompt",
|
28 |
+
"doc_to_target": 0,
|
29 |
+
"process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
|
30 |
+
"description": "",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 0,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "prompt_level_strict_acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "inst_level_strict_acc",
|
42 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "prompt_level_loose_acc",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"metric": "inst_level_loose_acc",
|
52 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
53 |
+
"higher_is_better": true
|
54 |
+
}
|
55 |
+
],
|
56 |
+
"output_type": "generate_until",
|
57 |
+
"generation_kwargs": {
|
58 |
+
"until": [],
|
59 |
+
"do_sample": false,
|
60 |
+
"temperature": 0.0,
|
61 |
+
"max_gen_toks": 1280
|
62 |
+
},
|
63 |
+
"repeats": 1,
|
64 |
+
"should_decontaminate": false,
|
65 |
+
"metadata": {
|
66 |
+
"version": 4.0
|
67 |
+
}
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"versions": {
|
71 |
+
"ar_ifeval": 4.0
|
72 |
+
},
|
73 |
+
"n-shot": {
|
74 |
+
"ar_ifeval": 0
|
75 |
+
},
|
76 |
+
"higher_is_better": {
|
77 |
+
"ar_ifeval": {
|
78 |
+
"prompt_level_strict_acc": true,
|
79 |
+
"inst_level_strict_acc": true,
|
80 |
+
"prompt_level_loose_acc": true,
|
81 |
+
"inst_level_loose_acc": true
|
82 |
+
}
|
83 |
+
},
|
84 |
+
"n-samples": {
|
85 |
+
"ar_ifeval": {
|
86 |
+
"original": 536,
|
87 |
+
"effective": 536
|
88 |
+
}
|
89 |
+
},
|
90 |
+
"config": {
|
91 |
+
"model": "hf",
|
92 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
93 |
+
"model_num_parameters": 8030261248,
|
94 |
+
"model_dtype": "torch.float16",
|
95 |
+
"model_revision": "main",
|
96 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
97 |
+
"batch_size": 1,
|
98 |
+
"batch_sizes": [],
|
99 |
+
"device": null,
|
100 |
+
"use_cache": null,
|
101 |
+
"limit": null,
|
102 |
+
"bootstrap_iters": 100000,
|
103 |
+
"gen_kwargs": null,
|
104 |
+
"random_seed": 0,
|
105 |
+
"numpy_seed": 1234,
|
106 |
+
"torch_seed": 1234,
|
107 |
+
"fewshot_seed": 1234
|
108 |
+
},
|
109 |
+
"git_hash": "b955b2950",
|
110 |
+
"date": 1739784109.8369951,
|
111 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
112 |
+
"transformers_version": "4.48.3",
|
113 |
+
"upper_git_hash": null,
|
114 |
+
"tokenizer_pad_token": [
|
115 |
+
"<|end_of_text|>",
|
116 |
+
"128001"
|
117 |
+
],
|
118 |
+
"tokenizer_eos_token": [
|
119 |
+
"<|end_of_text|>",
|
120 |
+
"128001"
|
121 |
+
],
|
122 |
+
"tokenizer_bos_token": [
|
123 |
+
"<|begin_of_text|>",
|
124 |
+
"128000"
|
125 |
+
],
|
126 |
+
"eot_token_id": 128001,
|
127 |
+
"max_length": 8192,
|
128 |
+
"task_hashes": {
|
129 |
+
"ar_ifeval": "9ce88f26b4b78e684512ecd933af67fe512192f41e27d2bedc62f288943db360"
|
130 |
+
},
|
131 |
+
"model_source": "hf",
|
132 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
133 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
134 |
+
"system_instruction": null,
|
135 |
+
"system_instruction_sha": null,
|
136 |
+
"fewshot_as_multiturn": false,
|
137 |
+
"chat_template": null,
|
138 |
+
"chat_template_sha": null,
|
139 |
+
"start_time": 62023.729831301,
|
140 |
+
"end_time": 66967.714743853,
|
141 |
+
"total_evaluation_time_seconds": "4943.98491255199"
|
142 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/araMath_v3_5_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araMath_v3": {
|
4 |
+
"alias": "araMath_v3",
|
5 |
+
"acc,none": 0.41487603305785126,
|
6 |
+
"acc_stderr,none": 0.02004770429343817,
|
7 |
+
"acc_norm,none": 0.41487603305785126,
|
8 |
+
"acc_norm_stderr,none": 0.02004770429343817
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araMath_v3": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araMath_v3": {
|
16 |
+
"task": "araMath_v3",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
|
21 |
+
"dataset_name": "araMath_v3",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "{{choices}}",
|
31 |
+
"description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"araMath_v3": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"araMath_v3": 5
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"araMath_v3": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"araMath_v3": {
|
70 |
+
"original": 605,
|
71 |
+
"effective": 605
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
77 |
+
"model_num_parameters": 8030261248,
|
78 |
+
"model_dtype": "torch.float16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739784015.8084505,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|end_of_text|>",
|
100 |
+
"128001"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|end_of_text|>",
|
104 |
+
"128001"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
"<|begin_of_text|>",
|
108 |
+
"128000"
|
109 |
+
],
|
110 |
+
"eot_token_id": 128001,
|
111 |
+
"max_length": 8192,
|
112 |
+
"task_hashes": {
|
113 |
+
"araMath_v3": "4eebd1da6e6937fc09bb9f1871adb53192dbce96733f0f8ee76d406c2fc8cad5"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
117 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": null,
|
122 |
+
"chat_template_sha": null,
|
123 |
+
"start_time": 61929.69246185,
|
124 |
+
"end_time": 61980.464828513,
|
125 |
+
"total_evaluation_time_seconds": "50.772366663004505"
|
126 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/araPro_0_shot.json
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araPro": {
|
4 |
+
"alias": "araPro",
|
5 |
+
"acc,none": 0.6350729854029195,
|
6 |
+
"acc_stderr,none": 0.006808161111700288,
|
7 |
+
"acc_norm,none": 0.6350729854029195,
|
8 |
+
"acc_norm_stderr,none": 0.006808161111700288
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araPro": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araPro": {
|
16 |
+
"task": "araPro",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araPro/araPro.py",
|
21 |
+
"dataset_name": "araPro",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "{{choices}}",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": true,
|
54 |
+
"doc_to_decontamination_query": "Question",
|
55 |
+
"metadata": {
|
56 |
+
"version": 2.0
|
57 |
+
}
|
58 |
+
}
|
59 |
+
},
|
60 |
+
"versions": {
|
61 |
+
"araPro": 2.0
|
62 |
+
},
|
63 |
+
"n-shot": {
|
64 |
+
"araPro": 0
|
65 |
+
},
|
66 |
+
"higher_is_better": {
|
67 |
+
"araPro": {
|
68 |
+
"acc": true,
|
69 |
+
"acc_norm": true
|
70 |
+
}
|
71 |
+
},
|
72 |
+
"n-samples": {
|
73 |
+
"araPro": {
|
74 |
+
"original": 5001,
|
75 |
+
"effective": 5001
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"config": {
|
79 |
+
"model": "hf",
|
80 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
81 |
+
"model_num_parameters": 8030261248,
|
82 |
+
"model_dtype": "torch.float16",
|
83 |
+
"model_revision": "main",
|
84 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
85 |
+
"batch_size": 1,
|
86 |
+
"batch_sizes": [],
|
87 |
+
"device": null,
|
88 |
+
"use_cache": null,
|
89 |
+
"limit": null,
|
90 |
+
"bootstrap_iters": 100000,
|
91 |
+
"gen_kwargs": null,
|
92 |
+
"random_seed": 0,
|
93 |
+
"numpy_seed": 1234,
|
94 |
+
"torch_seed": 1234,
|
95 |
+
"fewshot_seed": 1234
|
96 |
+
},
|
97 |
+
"git_hash": "b955b2950",
|
98 |
+
"date": 1739782427.4652286,
|
99 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
100 |
+
"transformers_version": "4.48.3",
|
101 |
+
"upper_git_hash": null,
|
102 |
+
"tokenizer_pad_token": [
|
103 |
+
"<|end_of_text|>",
|
104 |
+
"128001"
|
105 |
+
],
|
106 |
+
"tokenizer_eos_token": [
|
107 |
+
"<|end_of_text|>",
|
108 |
+
"128001"
|
109 |
+
],
|
110 |
+
"tokenizer_bos_token": [
|
111 |
+
"<|begin_of_text|>",
|
112 |
+
"128000"
|
113 |
+
],
|
114 |
+
"eot_token_id": 128001,
|
115 |
+
"max_length": 8192,
|
116 |
+
"task_hashes": {
|
117 |
+
"araPro": "655c2f6626c4b10533bba45ff63f9d4501694dea7f65d0bb251390819154f901"
|
118 |
+
},
|
119 |
+
"model_source": "hf",
|
120 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
121 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
122 |
+
"system_instruction": null,
|
123 |
+
"system_instruction_sha": null,
|
124 |
+
"fewshot_as_multiturn": false,
|
125 |
+
"chat_template": null,
|
126 |
+
"chat_template_sha": null,
|
127 |
+
"start_time": 60341.23142254,
|
128 |
+
"end_time": 60939.383586887,
|
129 |
+
"total_evaluation_time_seconds": "598.1521643470041"
|
130 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/arabicmmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/AceGPT-v2-8B-Chat/etec_v2_0_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"etec_v2": {
|
4 |
+
"alias": "etec_v2",
|
5 |
+
"acc,none": 0.5680975092739798,
|
6 |
+
"acc_stderr,none": 0.011406002243769559,
|
7 |
+
"acc_norm,none": 0.5680975092739798,
|
8 |
+
"acc_norm_stderr,none": 0.011406002243769559
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"etec_v2": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"etec_v2": {
|
16 |
+
"task": "etec_v2",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/etec_v2/etec.py",
|
21 |
+
"dataset_name": "etec_v2",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 0,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"etec_v2": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"etec_v2": 0
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"etec_v2": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"etec_v2": {
|
70 |
+
"original": 1887,
|
71 |
+
"effective": 1887
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
77 |
+
"model_num_parameters": 8030261248,
|
78 |
+
"model_dtype": "torch.float16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739783073.791851,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|end_of_text|>",
|
100 |
+
"128001"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|end_of_text|>",
|
104 |
+
"128001"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
"<|begin_of_text|>",
|
108 |
+
"128000"
|
109 |
+
],
|
110 |
+
"eot_token_id": 128001,
|
111 |
+
"max_length": 8192,
|
112 |
+
"task_hashes": {
|
113 |
+
"etec_v2": "d371135bd6f3e91b2eb292576c3b2fae24dc4c0d7cd2a5f6eacf1fe6bc062e76"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
117 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": null,
|
122 |
+
"chat_template_sha": null,
|
123 |
+
"start_time": 60987.772646854,
|
124 |
+
"end_time": 61072.230445773,
|
125 |
+
"total_evaluation_time_seconds": "84.4577989190002"
|
126 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/exams_ar_5_shot.json
ADDED
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"exams_ar": {
|
4 |
+
"alias": "exams_ar",
|
5 |
+
"acc,none": 0.5195530726256983,
|
6 |
+
"acc_stderr,none": 0.02158019049784565,
|
7 |
+
"acc_norm,none": 0.5195530726256983,
|
8 |
+
"acc_norm_stderr,none": 0.02158019049784565
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"exams_ar": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"exams_ar": {
|
16 |
+
"task": "exams_ar",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/exams_ar",
|
21 |
+
"dataset_name": "exams_ar",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"test_split": "test",
|
26 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
|
27 |
+
"doc_to_text": "query",
|
28 |
+
"doc_to_target": "gold",
|
29 |
+
"doc_to_choice": "choices",
|
30 |
+
"description": "description",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 5,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "acc_norm",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
}
|
45 |
+
],
|
46 |
+
"output_type": "multiple_choice",
|
47 |
+
"repeats": 1,
|
48 |
+
"should_decontaminate": true,
|
49 |
+
"doc_to_decontamination_query": "query",
|
50 |
+
"metadata": {
|
51 |
+
"version": 0.0
|
52 |
+
}
|
53 |
+
}
|
54 |
+
},
|
55 |
+
"versions": {
|
56 |
+
"exams_ar": 0.0
|
57 |
+
},
|
58 |
+
"n-shot": {
|
59 |
+
"exams_ar": 5
|
60 |
+
},
|
61 |
+
"higher_is_better": {
|
62 |
+
"exams_ar": {
|
63 |
+
"acc": true,
|
64 |
+
"acc_norm": true
|
65 |
+
}
|
66 |
+
},
|
67 |
+
"n-samples": {
|
68 |
+
"exams_ar": {
|
69 |
+
"original": 537,
|
70 |
+
"effective": 537
|
71 |
+
}
|
72 |
+
},
|
73 |
+
"config": {
|
74 |
+
"model": "vllm",
|
75 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
|
76 |
+
"batch_size": 1,
|
77 |
+
"batch_sizes": [],
|
78 |
+
"device": null,
|
79 |
+
"use_cache": null,
|
80 |
+
"limit": null,
|
81 |
+
"bootstrap_iters": 100000,
|
82 |
+
"gen_kwargs": null,
|
83 |
+
"random_seed": 0,
|
84 |
+
"numpy_seed": 1234,
|
85 |
+
"torch_seed": 1234,
|
86 |
+
"fewshot_seed": 1234
|
87 |
+
},
|
88 |
+
"git_hash": "8e1bd48d",
|
89 |
+
"date": 1735747770.5687191,
|
90 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
91 |
+
"transformers_version": "4.47.1",
|
92 |
+
"upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
|
93 |
+
"tokenizer_pad_token": [
|
94 |
+
"<|end_of_text|>",
|
95 |
+
"128001"
|
96 |
+
],
|
97 |
+
"tokenizer_eos_token": [
|
98 |
+
"<|end_of_text|>",
|
99 |
+
"128001"
|
100 |
+
],
|
101 |
+
"tokenizer_bos_token": [
|
102 |
+
"<|begin_of_text|>",
|
103 |
+
"128000"
|
104 |
+
],
|
105 |
+
"eot_token_id": 128001,
|
106 |
+
"max_length": 8192,
|
107 |
+
"task_hashes": {},
|
108 |
+
"model_source": "vllm",
|
109 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
110 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
111 |
+
"system_instruction": null,
|
112 |
+
"system_instruction_sha": null,
|
113 |
+
"fewshot_as_multiturn": false,
|
114 |
+
"chat_template": null,
|
115 |
+
"chat_template_sha": null,
|
116 |
+
"start_time": 8055.848670643,
|
117 |
+
"end_time": 8272.25518881,
|
118 |
+
"total_evaluation_time_seconds": "216.40651816700029"
|
119 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/gat_0_shot.json
ADDED
@@ -0,0 +1,539 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"gat": {
|
4 |
+
"acc,none": 0.3615326727706008,
|
5 |
+
"acc_stderr,none": 0.003748588350676633,
|
6 |
+
"alias": "gat"
|
7 |
+
},
|
8 |
+
"gat_algebra": {
|
9 |
+
"alias": " - gat_algebra",
|
10 |
+
"acc,none": 0.30241187384044527,
|
11 |
+
"acc_stderr,none": 0.008849121616191958
|
12 |
+
},
|
13 |
+
"gat_analogy": {
|
14 |
+
"alias": " - gat_analogy",
|
15 |
+
"acc,none": 0.3227686703096539,
|
16 |
+
"acc_stderr,none": 0.008925286248200312
|
17 |
+
},
|
18 |
+
"gat_arithmetic": {
|
19 |
+
"alias": " - gat_arithmetic",
|
20 |
+
"acc,none": 0.3213102686786897,
|
21 |
+
"acc_stderr,none": 0.008960516811645579
|
22 |
+
},
|
23 |
+
"gat_association": {
|
24 |
+
"alias": " - gat_association",
|
25 |
+
"acc,none": 0.39425837320574164,
|
26 |
+
"acc_stderr,none": 0.01512460088966808
|
27 |
+
},
|
28 |
+
"gat_comparisons": {
|
29 |
+
"alias": " - gat_comparisons",
|
30 |
+
"acc,none": 0.28114754098360656,
|
31 |
+
"acc_stderr,none": 0.012876124676937594
|
32 |
+
},
|
33 |
+
"gat_completion": {
|
34 |
+
"alias": " - gat_completion",
|
35 |
+
"acc,none": 0.46115702479338844,
|
36 |
+
"acc_stderr,none": 0.014336474830596175
|
37 |
+
},
|
38 |
+
"gat_contextual": {
|
39 |
+
"alias": " - gat_contextual",
|
40 |
+
"acc,none": 0.2983128834355828,
|
41 |
+
"acc_stderr,none": 0.012674637536976358
|
42 |
+
},
|
43 |
+
"gat_geometry": {
|
44 |
+
"alias": " - gat_geometry",
|
45 |
+
"acc,none": 0.3232876712328767,
|
46 |
+
"acc_stderr,none": 0.024515791774351408
|
47 |
+
},
|
48 |
+
"gat_reading": {
|
49 |
+
"alias": " - gat_reading",
|
50 |
+
"acc,none": 0.5183364839319471,
|
51 |
+
"acc_stderr,none": 0.009717331969425425
|
52 |
+
}
|
53 |
+
},
|
54 |
+
"groups": {
|
55 |
+
"gat": {
|
56 |
+
"acc,none": 0.3615326727706008,
|
57 |
+
"acc_stderr,none": 0.003748588350676633,
|
58 |
+
"alias": "gat"
|
59 |
+
}
|
60 |
+
},
|
61 |
+
"group_subtasks": {
|
62 |
+
"gat": [
|
63 |
+
"gat_analogy",
|
64 |
+
"gat_association",
|
65 |
+
"gat_completion",
|
66 |
+
"gat_reading",
|
67 |
+
"gat_algebra",
|
68 |
+
"gat_arithmetic",
|
69 |
+
"gat_comparisons",
|
70 |
+
"gat_contextual",
|
71 |
+
"gat_geometry"
|
72 |
+
]
|
73 |
+
},
|
74 |
+
"configs": {
|
75 |
+
"gat_algebra": {
|
76 |
+
"task": "gat_algebra",
|
77 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
78 |
+
"dataset_name": "algebra",
|
79 |
+
"dataset_kwargs": {
|
80 |
+
"trust_remote_code": true
|
81 |
+
},
|
82 |
+
"test_split": "test",
|
83 |
+
"fewshot_split": "validation",
|
84 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
85 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
86 |
+
"doc_to_target": "{{label}}",
|
87 |
+
"doc_to_choice": [
|
88 |
+
"\u0623",
|
89 |
+
"\u0628",
|
90 |
+
"\u062c",
|
91 |
+
"\u062f"
|
92 |
+
],
|
93 |
+
"description": "",
|
94 |
+
"target_delimiter": " ",
|
95 |
+
"fewshot_delimiter": "\n\n",
|
96 |
+
"num_fewshot": 0,
|
97 |
+
"metric_list": [
|
98 |
+
{
|
99 |
+
"metric": "acc",
|
100 |
+
"aggregation": "mean",
|
101 |
+
"higher_is_better": true
|
102 |
+
}
|
103 |
+
],
|
104 |
+
"output_type": "multiple_choice",
|
105 |
+
"repeats": 1,
|
106 |
+
"should_decontaminate": false,
|
107 |
+
"metadata": {
|
108 |
+
"version": 0.0
|
109 |
+
}
|
110 |
+
},
|
111 |
+
"gat_analogy": {
|
112 |
+
"task": "gat_analogy",
|
113 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
114 |
+
"dataset_name": "analogy",
|
115 |
+
"dataset_kwargs": {
|
116 |
+
"trust_remote_code": true
|
117 |
+
},
|
118 |
+
"test_split": "test",
|
119 |
+
"fewshot_split": "validation",
|
120 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
121 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
122 |
+
"doc_to_target": "{{label}}",
|
123 |
+
"doc_to_choice": [
|
124 |
+
"\u0623",
|
125 |
+
"\u0628",
|
126 |
+
"\u062c",
|
127 |
+
"\u062f"
|
128 |
+
],
|
129 |
+
"description": "",
|
130 |
+
"target_delimiter": " ",
|
131 |
+
"fewshot_delimiter": "\n\n",
|
132 |
+
"num_fewshot": 0,
|
133 |
+
"metric_list": [
|
134 |
+
{
|
135 |
+
"metric": "acc",
|
136 |
+
"aggregation": "mean",
|
137 |
+
"higher_is_better": true
|
138 |
+
}
|
139 |
+
],
|
140 |
+
"output_type": "multiple_choice",
|
141 |
+
"repeats": 1,
|
142 |
+
"should_decontaminate": false,
|
143 |
+
"metadata": {
|
144 |
+
"version": 0.0
|
145 |
+
}
|
146 |
+
},
|
147 |
+
"gat_arithmetic": {
|
148 |
+
"task": "gat_arithmetic",
|
149 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
150 |
+
"dataset_name": "arithmetic",
|
151 |
+
"dataset_kwargs": {
|
152 |
+
"trust_remote_code": true
|
153 |
+
},
|
154 |
+
"test_split": "test",
|
155 |
+
"fewshot_split": "validation",
|
156 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
157 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
158 |
+
"doc_to_target": "{{label}}",
|
159 |
+
"doc_to_choice": [
|
160 |
+
"\u0623",
|
161 |
+
"\u0628",
|
162 |
+
"\u062c",
|
163 |
+
"\u062f"
|
164 |
+
],
|
165 |
+
"description": "",
|
166 |
+
"target_delimiter": " ",
|
167 |
+
"fewshot_delimiter": "\n\n",
|
168 |
+
"num_fewshot": 0,
|
169 |
+
"metric_list": [
|
170 |
+
{
|
171 |
+
"metric": "acc",
|
172 |
+
"aggregation": "mean",
|
173 |
+
"higher_is_better": true
|
174 |
+
}
|
175 |
+
],
|
176 |
+
"output_type": "multiple_choice",
|
177 |
+
"repeats": 1,
|
178 |
+
"should_decontaminate": false,
|
179 |
+
"metadata": {
|
180 |
+
"version": 0.0
|
181 |
+
}
|
182 |
+
},
|
183 |
+
"gat_association": {
|
184 |
+
"task": "gat_association",
|
185 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
186 |
+
"dataset_name": "association",
|
187 |
+
"dataset_kwargs": {
|
188 |
+
"trust_remote_code": true
|
189 |
+
},
|
190 |
+
"test_split": "test",
|
191 |
+
"fewshot_split": "validation",
|
192 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
193 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
194 |
+
"doc_to_target": "{{label}}",
|
195 |
+
"doc_to_choice": [
|
196 |
+
"\u0623",
|
197 |
+
"\u0628",
|
198 |
+
"\u062c",
|
199 |
+
"\u062f"
|
200 |
+
],
|
201 |
+
"description": "",
|
202 |
+
"target_delimiter": " ",
|
203 |
+
"fewshot_delimiter": "\n\n",
|
204 |
+
"num_fewshot": 0,
|
205 |
+
"metric_list": [
|
206 |
+
{
|
207 |
+
"metric": "acc",
|
208 |
+
"aggregation": "mean",
|
209 |
+
"higher_is_better": true
|
210 |
+
}
|
211 |
+
],
|
212 |
+
"output_type": "multiple_choice",
|
213 |
+
"repeats": 1,
|
214 |
+
"should_decontaminate": false,
|
215 |
+
"metadata": {
|
216 |
+
"version": 0.0
|
217 |
+
}
|
218 |
+
},
|
219 |
+
"gat_comparisons": {
|
220 |
+
"task": "gat_comparisons",
|
221 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
222 |
+
"dataset_name": "comparisons",
|
223 |
+
"dataset_kwargs": {
|
224 |
+
"trust_remote_code": true
|
225 |
+
},
|
226 |
+
"test_split": "test",
|
227 |
+
"fewshot_split": "validation",
|
228 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
229 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
230 |
+
"doc_to_target": "{{label}}",
|
231 |
+
"doc_to_choice": [
|
232 |
+
"\u0623",
|
233 |
+
"\u0628",
|
234 |
+
"\u062c",
|
235 |
+
"\u062f"
|
236 |
+
],
|
237 |
+
"description": "",
|
238 |
+
"target_delimiter": " ",
|
239 |
+
"fewshot_delimiter": "\n\n",
|
240 |
+
"num_fewshot": 0,
|
241 |
+
"metric_list": [
|
242 |
+
{
|
243 |
+
"metric": "acc",
|
244 |
+
"aggregation": "mean",
|
245 |
+
"higher_is_better": true
|
246 |
+
}
|
247 |
+
],
|
248 |
+
"output_type": "multiple_choice",
|
249 |
+
"repeats": 1,
|
250 |
+
"should_decontaminate": false,
|
251 |
+
"metadata": {
|
252 |
+
"version": 0.0
|
253 |
+
}
|
254 |
+
},
|
255 |
+
"gat_completion": {
|
256 |
+
"task": "gat_completion",
|
257 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
258 |
+
"dataset_name": "completion",
|
259 |
+
"dataset_kwargs": {
|
260 |
+
"trust_remote_code": true
|
261 |
+
},
|
262 |
+
"test_split": "test",
|
263 |
+
"fewshot_split": "validation",
|
264 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
265 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
266 |
+
"doc_to_target": "{{label}}",
|
267 |
+
"doc_to_choice": [
|
268 |
+
"\u0623",
|
269 |
+
"\u0628",
|
270 |
+
"\u062c",
|
271 |
+
"\u062f"
|
272 |
+
],
|
273 |
+
"description": "",
|
274 |
+
"target_delimiter": " ",
|
275 |
+
"fewshot_delimiter": "\n\n",
|
276 |
+
"num_fewshot": 0,
|
277 |
+
"metric_list": [
|
278 |
+
{
|
279 |
+
"metric": "acc",
|
280 |
+
"aggregation": "mean",
|
281 |
+
"higher_is_better": true
|
282 |
+
}
|
283 |
+
],
|
284 |
+
"output_type": "multiple_choice",
|
285 |
+
"repeats": 1,
|
286 |
+
"should_decontaminate": false,
|
287 |
+
"metadata": {
|
288 |
+
"version": 0.0
|
289 |
+
}
|
290 |
+
},
|
291 |
+
"gat_contextual": {
|
292 |
+
"task": "gat_contextual",
|
293 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
294 |
+
"dataset_name": "contextual",
|
295 |
+
"dataset_kwargs": {
|
296 |
+
"trust_remote_code": true
|
297 |
+
},
|
298 |
+
"test_split": "test",
|
299 |
+
"fewshot_split": "validation",
|
300 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
301 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
302 |
+
"doc_to_target": "{{label}}",
|
303 |
+
"doc_to_choice": [
|
304 |
+
"\u0623",
|
305 |
+
"\u0628",
|
306 |
+
"\u062c",
|
307 |
+
"\u062f"
|
308 |
+
],
|
309 |
+
"description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
|
310 |
+
"target_delimiter": " ",
|
311 |
+
"fewshot_delimiter": "\n\n",
|
312 |
+
"num_fewshot": 0,
|
313 |
+
"metric_list": [
|
314 |
+
{
|
315 |
+
"metric": "acc",
|
316 |
+
"aggregation": "mean",
|
317 |
+
"higher_is_better": true
|
318 |
+
}
|
319 |
+
],
|
320 |
+
"output_type": "multiple_choice",
|
321 |
+
"repeats": 1,
|
322 |
+
"should_decontaminate": false,
|
323 |
+
"metadata": {
|
324 |
+
"version": 0.0
|
325 |
+
}
|
326 |
+
},
|
327 |
+
"gat_geometry": {
|
328 |
+
"task": "gat_geometry",
|
329 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
330 |
+
"dataset_name": "geometry",
|
331 |
+
"dataset_kwargs": {
|
332 |
+
"trust_remote_code": true
|
333 |
+
},
|
334 |
+
"test_split": "test",
|
335 |
+
"fewshot_split": "validation",
|
336 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
337 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
338 |
+
"doc_to_target": "{{label}}",
|
339 |
+
"doc_to_choice": [
|
340 |
+
"\u0623",
|
341 |
+
"\u0628",
|
342 |
+
"\u062c",
|
343 |
+
"\u062f"
|
344 |
+
],
|
345 |
+
"description": "",
|
346 |
+
"target_delimiter": " ",
|
347 |
+
"fewshot_delimiter": "\n\n",
|
348 |
+
"num_fewshot": 0,
|
349 |
+
"metric_list": [
|
350 |
+
{
|
351 |
+
"metric": "acc",
|
352 |
+
"aggregation": "mean",
|
353 |
+
"higher_is_better": true
|
354 |
+
}
|
355 |
+
],
|
356 |
+
"output_type": "multiple_choice",
|
357 |
+
"repeats": 1,
|
358 |
+
"should_decontaminate": false,
|
359 |
+
"metadata": {
|
360 |
+
"version": 0.0
|
361 |
+
}
|
362 |
+
},
|
363 |
+
"gat_reading": {
|
364 |
+
"task": "gat_reading",
|
365 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
366 |
+
"dataset_name": "reading",
|
367 |
+
"dataset_kwargs": {
|
368 |
+
"trust_remote_code": true
|
369 |
+
},
|
370 |
+
"test_split": "test",
|
371 |
+
"fewshot_split": "validation",
|
372 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
373 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
374 |
+
"doc_to_target": "{{label}}",
|
375 |
+
"doc_to_choice": [
|
376 |
+
"\u0623",
|
377 |
+
"\u0628",
|
378 |
+
"\u062c",
|
379 |
+
"\u062f"
|
380 |
+
],
|
381 |
+
"description": "",
|
382 |
+
"target_delimiter": " ",
|
383 |
+
"fewshot_delimiter": "\n\n",
|
384 |
+
"num_fewshot": 0,
|
385 |
+
"metric_list": [
|
386 |
+
{
|
387 |
+
"metric": "acc",
|
388 |
+
"aggregation": "mean",
|
389 |
+
"higher_is_better": true
|
390 |
+
}
|
391 |
+
],
|
392 |
+
"output_type": "multiple_choice",
|
393 |
+
"repeats": 1,
|
394 |
+
"should_decontaminate": false,
|
395 |
+
"metadata": {
|
396 |
+
"version": 0.0
|
397 |
+
}
|
398 |
+
}
|
399 |
+
},
|
400 |
+
"versions": {
|
401 |
+
"gat": 0,
|
402 |
+
"gat_algebra": 0.0,
|
403 |
+
"gat_analogy": 0.0,
|
404 |
+
"gat_arithmetic": 0.0,
|
405 |
+
"gat_association": 0.0,
|
406 |
+
"gat_comparisons": 0.0,
|
407 |
+
"gat_completion": 0.0,
|
408 |
+
"gat_contextual": 0.0,
|
409 |
+
"gat_geometry": 0.0,
|
410 |
+
"gat_reading": 0.0
|
411 |
+
},
|
412 |
+
"n-shot": {
|
413 |
+
"gat_algebra": 0,
|
414 |
+
"gat_analogy": 0,
|
415 |
+
"gat_arithmetic": 0,
|
416 |
+
"gat_association": 0,
|
417 |
+
"gat_comparisons": 0,
|
418 |
+
"gat_completion": 0,
|
419 |
+
"gat_contextual": 0,
|
420 |
+
"gat_geometry": 0,
|
421 |
+
"gat_reading": 0
|
422 |
+
},
|
423 |
+
"higher_is_better": {
|
424 |
+
"gat": {
|
425 |
+
"acc": true
|
426 |
+
},
|
427 |
+
"gat_algebra": {
|
428 |
+
"acc": true
|
429 |
+
},
|
430 |
+
"gat_analogy": {
|
431 |
+
"acc": true
|
432 |
+
},
|
433 |
+
"gat_arithmetic": {
|
434 |
+
"acc": true
|
435 |
+
},
|
436 |
+
"gat_association": {
|
437 |
+
"acc": true
|
438 |
+
},
|
439 |
+
"gat_comparisons": {
|
440 |
+
"acc": true
|
441 |
+
},
|
442 |
+
"gat_completion": {
|
443 |
+
"acc": true
|
444 |
+
},
|
445 |
+
"gat_contextual": {
|
446 |
+
"acc": true
|
447 |
+
},
|
448 |
+
"gat_geometry": {
|
449 |
+
"acc": true
|
450 |
+
},
|
451 |
+
"gat_reading": {
|
452 |
+
"acc": true
|
453 |
+
}
|
454 |
+
},
|
455 |
+
"n-samples": {
|
456 |
+
"gat_analogy": {
|
457 |
+
"original": 2745,
|
458 |
+
"effective": 2745
|
459 |
+
},
|
460 |
+
"gat_association": {
|
461 |
+
"original": 1045,
|
462 |
+
"effective": 1045
|
463 |
+
},
|
464 |
+
"gat_completion": {
|
465 |
+
"original": 1210,
|
466 |
+
"effective": 1210
|
467 |
+
},
|
468 |
+
"gat_reading": {
|
469 |
+
"original": 2645,
|
470 |
+
"effective": 2645
|
471 |
+
},
|
472 |
+
"gat_algebra": {
|
473 |
+
"original": 2695,
|
474 |
+
"effective": 2695
|
475 |
+
},
|
476 |
+
"gat_arithmetic": {
|
477 |
+
"original": 2717,
|
478 |
+
"effective": 2717
|
479 |
+
},
|
480 |
+
"gat_comparisons": {
|
481 |
+
"original": 1220,
|
482 |
+
"effective": 1220
|
483 |
+
},
|
484 |
+
"gat_contextual": {
|
485 |
+
"original": 1304,
|
486 |
+
"effective": 1304
|
487 |
+
},
|
488 |
+
"gat_geometry": {
|
489 |
+
"original": 365,
|
490 |
+
"effective": 365
|
491 |
+
}
|
492 |
+
},
|
493 |
+
"config": {
|
494 |
+
"model": "vllm",
|
495 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.4,download_dir=/tmp",
|
496 |
+
"batch_size": 1,
|
497 |
+
"batch_sizes": [],
|
498 |
+
"device": null,
|
499 |
+
"use_cache": null,
|
500 |
+
"limit": null,
|
501 |
+
"bootstrap_iters": 100000,
|
502 |
+
"gen_kwargs": null,
|
503 |
+
"random_seed": 0,
|
504 |
+
"numpy_seed": 1234,
|
505 |
+
"torch_seed": 1234,
|
506 |
+
"fewshot_seed": 1234
|
507 |
+
},
|
508 |
+
"git_hash": "8e1bd48d",
|
509 |
+
"date": 1735749781.6371627,
|
510 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
511 |
+
"transformers_version": "4.47.1",
|
512 |
+
"upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
|
513 |
+
"tokenizer_pad_token": [
|
514 |
+
"<|end_of_text|>",
|
515 |
+
"128001"
|
516 |
+
],
|
517 |
+
"tokenizer_eos_token": [
|
518 |
+
"<|end_of_text|>",
|
519 |
+
"128001"
|
520 |
+
],
|
521 |
+
"tokenizer_bos_token": [
|
522 |
+
"<|begin_of_text|>",
|
523 |
+
"128000"
|
524 |
+
],
|
525 |
+
"eot_token_id": 128001,
|
526 |
+
"max_length": 8192,
|
527 |
+
"task_hashes": {},
|
528 |
+
"model_source": "vllm",
|
529 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
530 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
531 |
+
"system_instruction": null,
|
532 |
+
"system_instruction_sha": null,
|
533 |
+
"fewshot_as_multiturn": false,
|
534 |
+
"chat_template": null,
|
535 |
+
"chat_template_sha": null,
|
536 |
+
"start_time": 10066.91226392,
|
537 |
+
"end_time": 10586.891967311,
|
538 |
+
"total_evaluation_time_seconds": "519.9797033909999"
|
539 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_mcq_0_shot.json
ADDED
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_mcq": {
|
4 |
+
"alias": "moe_ien_mcq",
|
5 |
+
"acc,none": 0.7700700700700701,
|
6 |
+
"acc_stderr,none": 0.0042101916833611345,
|
7 |
+
"acc_norm,none": 0.7700700700700701,
|
8 |
+
"acc_norm_stderr,none": 0.0042101916833611345
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_mcq": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_mcq": {
|
16 |
+
"task": "moe_ien_mcq",
|
17 |
+
"dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
|
18 |
+
"dataset_name": "moe_ien_mcq",
|
19 |
+
"dataset_kwargs": {
|
20 |
+
"trust_remote_code": true
|
21 |
+
},
|
22 |
+
"validation_split": "validation",
|
23 |
+
"test_split": "test",
|
24 |
+
"fewshot_split": "validation",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "Query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "{{Choices}}",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"fewshot_config": {
|
33 |
+
"sampler": "balanced_cat"
|
34 |
+
},
|
35 |
+
"num_fewshot": 0,
|
36 |
+
"metric_list": [
|
37 |
+
{
|
38 |
+
"metric": "acc",
|
39 |
+
"aggregation": "mean",
|
40 |
+
"higher_is_better": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"metric": "acc_norm",
|
44 |
+
"aggregation": "mean",
|
45 |
+
"higher_is_better": true
|
46 |
+
}
|
47 |
+
],
|
48 |
+
"output_type": "multiple_choice",
|
49 |
+
"repeats": 1,
|
50 |
+
"should_decontaminate": true,
|
51 |
+
"doc_to_decontamination_query": "Query",
|
52 |
+
"metadata": {
|
53 |
+
"version": 0.0
|
54 |
+
}
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"versions": {
|
58 |
+
"moe_ien_mcq": 0.0
|
59 |
+
},
|
60 |
+
"n-shot": {
|
61 |
+
"moe_ien_mcq": 0
|
62 |
+
},
|
63 |
+
"higher_is_better": {
|
64 |
+
"moe_ien_mcq": {
|
65 |
+
"acc": true,
|
66 |
+
"acc_norm": true
|
67 |
+
}
|
68 |
+
},
|
69 |
+
"n-samples": {
|
70 |
+
"moe_ien_mcq": {
|
71 |
+
"original": 9990,
|
72 |
+
"effective": 9990
|
73 |
+
}
|
74 |
+
},
|
75 |
+
"config": {
|
76 |
+
"model": "hf",
|
77 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
78 |
+
"model_num_parameters": 8030261248,
|
79 |
+
"model_dtype": "torch.float16",
|
80 |
+
"model_revision": "main",
|
81 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
82 |
+
"batch_size": 1,
|
83 |
+
"batch_sizes": [],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "b955b2950",
|
95 |
+
"date": 1739783202.062394,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.3",
|
98 |
+
"upper_git_hash": null,
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<|end_of_text|>",
|
101 |
+
"128001"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"<|end_of_text|>",
|
105 |
+
"128001"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
"<|begin_of_text|>",
|
109 |
+
"128000"
|
110 |
+
],
|
111 |
+
"eot_token_id": 128001,
|
112 |
+
"max_length": 8192,
|
113 |
+
"task_hashes": {
|
114 |
+
"moe_ien_mcq": "99731f9d1bb76d010da5a439ea1b0bb7695451459d680f708f7222f02ba8e831"
|
115 |
+
},
|
116 |
+
"model_source": "hf",
|
117 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
118 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
119 |
+
"system_instruction": null,
|
120 |
+
"system_instruction_sha": null,
|
121 |
+
"fewshot_as_multiturn": false,
|
122 |
+
"chat_template": null,
|
123 |
+
"chat_template_sha": null,
|
124 |
+
"start_time": 61116.014324615,
|
125 |
+
"end_time": 61463.567260828,
|
126 |
+
"total_evaluation_time_seconds": "347.5529362130037"
|
127 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/moe_ien_tf_0_shot.json
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_tf": {
|
4 |
+
"alias": "moe_ien_tf",
|
5 |
+
"acc,none": 0.7590589043448395,
|
6 |
+
"acc_stderr,none": 0.00560476076159517,
|
7 |
+
"acc_norm,none": 0.7590589043448395,
|
8 |
+
"acc_norm_stderr,none": 0.00560476076159517
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_tf": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_tf": {
|
16 |
+
"task": "moe_ien_tf",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
|
21 |
+
"dataset_name": "moe_ien_tf",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "choices",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": false,
|
54 |
+
"metadata": {
|
55 |
+
"version": 2.0
|
56 |
+
}
|
57 |
+
}
|
58 |
+
},
|
59 |
+
"versions": {
|
60 |
+
"moe_ien_tf": 2.0
|
61 |
+
},
|
62 |
+
"n-shot": {
|
63 |
+
"moe_ien_tf": 0
|
64 |
+
},
|
65 |
+
"higher_is_better": {
|
66 |
+
"moe_ien_tf": {
|
67 |
+
"acc": true,
|
68 |
+
"acc_norm": true
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"n-samples": {
|
72 |
+
"moe_ien_tf": {
|
73 |
+
"original": 5823,
|
74 |
+
"effective": 5823
|
75 |
+
}
|
76 |
+
},
|
77 |
+
"config": {
|
78 |
+
"model": "hf",
|
79 |
+
"model_args": "pretrained=FreedomIntelligence/AceGPT-v2-8B-Chat,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
80 |
+
"model_num_parameters": 8030261248,
|
81 |
+
"model_dtype": "torch.float16",
|
82 |
+
"model_revision": "main",
|
83 |
+
"model_sha": "562d0998c03c02d315e346f81650a43955711901",
|
84 |
+
"batch_size": 1,
|
85 |
+
"batch_sizes": [],
|
86 |
+
"device": null,
|
87 |
+
"use_cache": null,
|
88 |
+
"limit": null,
|
89 |
+
"bootstrap_iters": 100000,
|
90 |
+
"gen_kwargs": null,
|
91 |
+
"random_seed": 0,
|
92 |
+
"numpy_seed": 1234,
|
93 |
+
"torch_seed": 1234,
|
94 |
+
"fewshot_seed": 1234
|
95 |
+
},
|
96 |
+
"git_hash": "b955b2950",
|
97 |
+
"date": 1739783594.7150183,
|
98 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
99 |
+
"transformers_version": "4.48.3",
|
100 |
+
"upper_git_hash": null,
|
101 |
+
"tokenizer_pad_token": [
|
102 |
+
"<|end_of_text|>",
|
103 |
+
"128001"
|
104 |
+
],
|
105 |
+
"tokenizer_eos_token": [
|
106 |
+
"<|end_of_text|>",
|
107 |
+
"128001"
|
108 |
+
],
|
109 |
+
"tokenizer_bos_token": [
|
110 |
+
"<|begin_of_text|>",
|
111 |
+
"128000"
|
112 |
+
],
|
113 |
+
"eot_token_id": 128001,
|
114 |
+
"max_length": 8192,
|
115 |
+
"task_hashes": {
|
116 |
+
"moe_ien_tf": "a8315c59ec304a82f04395ff5e7728d6586b1b0b5f569486840b7d29d76a8dd8"
|
117 |
+
},
|
118 |
+
"model_source": "hf",
|
119 |
+
"model_name": "FreedomIntelligence/AceGPT-v2-8B-Chat",
|
120 |
+
"model_name_sanitized": "FreedomIntelligence__AceGPT-v2-8B-Chat",
|
121 |
+
"system_instruction": null,
|
122 |
+
"system_instruction_sha": null,
|
123 |
+
"fewshot_as_multiturn": false,
|
124 |
+
"chat_template": null,
|
125 |
+
"chat_template_sha": null,
|
126 |
+
"start_time": 61508.598662402,
|
127 |
+
"end_time": 61883.458017876,
|
128 |
+
"total_evaluation_time_seconds": "374.85935547400004"
|
129 |
+
}
|
evaluations/ar/AceGPT-v2-8B-Chat/openaimmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/Allam-7b-instruct-preview/acva_5_shot.json
ADDED
@@ -0,0 +1,119 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"acva": {
|
4 |
+
"alias": "acva",
|
5 |
+
"acc,none": 0.7746268656716417,
|
6 |
+
"acc_stderr,none": 0.004477269169728854,
|
7 |
+
"acc_norm,none": 0.7632606199770379,
|
8 |
+
"acc_norm_stderr,none": 0.004554991129754026
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"acva": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"acva": {
|
16 |
+
"task": "acva",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
|
21 |
+
"dataset_kwargs": {
|
22 |
+
"trust_remote_code": true
|
23 |
+
},
|
24 |
+
"test_split": "test",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "choices",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"num_fewshot": 5,
|
33 |
+
"metric_list": [
|
34 |
+
{
|
35 |
+
"metric": "acc",
|
36 |
+
"aggregation": "mean",
|
37 |
+
"higher_is_better": true
|
38 |
+
},
|
39 |
+
{
|
40 |
+
"metric": "acc_norm",
|
41 |
+
"aggregation": "mean",
|
42 |
+
"higher_is_better": true
|
43 |
+
}
|
44 |
+
],
|
45 |
+
"output_type": "multiple_choice",
|
46 |
+
"repeats": 1,
|
47 |
+
"should_decontaminate": false,
|
48 |
+
"metadata": {
|
49 |
+
"version": 0.0
|
50 |
+
}
|
51 |
+
}
|
52 |
+
},
|
53 |
+
"versions": {
|
54 |
+
"acva": 0.0
|
55 |
+
},
|
56 |
+
"n-shot": {
|
57 |
+
"acva": 5
|
58 |
+
},
|
59 |
+
"higher_is_better": {
|
60 |
+
"acva": {
|
61 |
+
"acc": true,
|
62 |
+
"acc_norm": true
|
63 |
+
}
|
64 |
+
},
|
65 |
+
"n-samples": {
|
66 |
+
"acva": {
|
67 |
+
"original": 8710,
|
68 |
+
"effective": 8710
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"config": {
|
72 |
+
"model": "vllm",
|
73 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
|
74 |
+
"batch_size": 1,
|
75 |
+
"batch_sizes": [],
|
76 |
+
"device": null,
|
77 |
+
"use_cache": null,
|
78 |
+
"limit": null,
|
79 |
+
"bootstrap_iters": 100000,
|
80 |
+
"gen_kwargs": null,
|
81 |
+
"random_seed": 0,
|
82 |
+
"numpy_seed": 1234,
|
83 |
+
"torch_seed": 1234,
|
84 |
+
"fewshot_seed": 1234
|
85 |
+
},
|
86 |
+
"git_hash": "8e1bd48d",
|
87 |
+
"date": 1735662713.7617116,
|
88 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
89 |
+
"transformers_version": "4.47.1",
|
90 |
+
"upper_git_hash": null,
|
91 |
+
"tokenizer_pad_token": [
|
92 |
+
"<unk>",
|
93 |
+
"0"
|
94 |
+
],
|
95 |
+
"tokenizer_eos_token": [
|
96 |
+
"</s>",
|
97 |
+
"2"
|
98 |
+
],
|
99 |
+
"tokenizer_bos_token": [
|
100 |
+
"<s>",
|
101 |
+
"1"
|
102 |
+
],
|
103 |
+
"eot_token_id": 2,
|
104 |
+
"max_length": 4096,
|
105 |
+
"task_hashes": {
|
106 |
+
"acva": "d007c508f0accdd697f549d7cbe7f960f1470c8f86f1a0969355a6ef33108edb"
|
107 |
+
},
|
108 |
+
"model_source": "vllm",
|
109 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
110 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
111 |
+
"system_instruction": null,
|
112 |
+
"system_instruction_sha": null,
|
113 |
+
"fewshot_as_multiturn": false,
|
114 |
+
"chat_template": null,
|
115 |
+
"chat_template_sha": null,
|
116 |
+
"start_time": 3374.021232778,
|
117 |
+
"end_time": 3578.563943596,
|
118 |
+
"total_evaluation_time_seconds": "204.54271081800016"
|
119 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/ar_ifeval_0_shot.json
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"ar_ifeval": {
|
4 |
+
"alias": "ar_ifeval",
|
5 |
+
"prompt_level_strict_acc,none": 0.31343283582089554,
|
6 |
+
"prompt_level_strict_acc_stderr,none": 0.020055655889994813,
|
7 |
+
"inst_level_strict_acc,none": 0.6764505119453925,
|
8 |
+
"inst_level_strict_acc_stderr,none": "N/A",
|
9 |
+
"prompt_level_loose_acc,none": 0.3656716417910448,
|
10 |
+
"prompt_level_loose_acc_stderr,none": 0.020822161638297296,
|
11 |
+
"inst_level_loose_acc,none": 0.7051194539249147,
|
12 |
+
"inst_level_loose_acc_stderr,none": "N/A"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"group_subtasks": {
|
16 |
+
"ar_ifeval": []
|
17 |
+
},
|
18 |
+
"configs": {
|
19 |
+
"ar_ifeval": {
|
20 |
+
"task": "ar_ifeval",
|
21 |
+
"dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
|
22 |
+
"dataset_name": "ar_ifeval",
|
23 |
+
"dataset_kwargs": {
|
24 |
+
"trust_remote_code": true
|
25 |
+
},
|
26 |
+
"test_split": "test",
|
27 |
+
"doc_to_text": "prompt",
|
28 |
+
"doc_to_target": 0,
|
29 |
+
"process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
|
30 |
+
"description": "",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 0,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "prompt_level_strict_acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "inst_level_strict_acc",
|
42 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "prompt_level_loose_acc",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"metric": "inst_level_loose_acc",
|
52 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
53 |
+
"higher_is_better": true
|
54 |
+
}
|
55 |
+
],
|
56 |
+
"output_type": "generate_until",
|
57 |
+
"generation_kwargs": {
|
58 |
+
"until": [],
|
59 |
+
"do_sample": false,
|
60 |
+
"temperature": 0.0,
|
61 |
+
"max_gen_toks": 1280
|
62 |
+
},
|
63 |
+
"repeats": 1,
|
64 |
+
"should_decontaminate": false,
|
65 |
+
"metadata": {
|
66 |
+
"version": 4.0
|
67 |
+
}
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"versions": {
|
71 |
+
"ar_ifeval": 4.0
|
72 |
+
},
|
73 |
+
"n-shot": {
|
74 |
+
"ar_ifeval": 0
|
75 |
+
},
|
76 |
+
"higher_is_better": {
|
77 |
+
"ar_ifeval": {
|
78 |
+
"prompt_level_strict_acc": true,
|
79 |
+
"inst_level_strict_acc": true,
|
80 |
+
"prompt_level_loose_acc": true,
|
81 |
+
"inst_level_loose_acc": true
|
82 |
+
}
|
83 |
+
},
|
84 |
+
"n-samples": {
|
85 |
+
"ar_ifeval": {
|
86 |
+
"original": 536,
|
87 |
+
"effective": 536
|
88 |
+
}
|
89 |
+
},
|
90 |
+
"config": {
|
91 |
+
"model": "hf",
|
92 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
93 |
+
"model_num_parameters": 7000559616,
|
94 |
+
"model_dtype": "torch.bfloat16",
|
95 |
+
"model_revision": "main",
|
96 |
+
"model_sha": "",
|
97 |
+
"batch_size": 1,
|
98 |
+
"batch_sizes": [],
|
99 |
+
"device": null,
|
100 |
+
"use_cache": null,
|
101 |
+
"limit": null,
|
102 |
+
"bootstrap_iters": 100000,
|
103 |
+
"gen_kwargs": null,
|
104 |
+
"random_seed": 0,
|
105 |
+
"numpy_seed": 1234,
|
106 |
+
"torch_seed": 1234,
|
107 |
+
"fewshot_seed": 1234
|
108 |
+
},
|
109 |
+
"git_hash": "b955b2950",
|
110 |
+
"date": 1739618378.981141,
|
111 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
112 |
+
"transformers_version": "4.48.3",
|
113 |
+
"upper_git_hash": null,
|
114 |
+
"tokenizer_pad_token": [
|
115 |
+
"<unk>",
|
116 |
+
"0"
|
117 |
+
],
|
118 |
+
"tokenizer_eos_token": [
|
119 |
+
"</s>",
|
120 |
+
"2"
|
121 |
+
],
|
122 |
+
"tokenizer_bos_token": [
|
123 |
+
"<s>",
|
124 |
+
"1"
|
125 |
+
],
|
126 |
+
"eot_token_id": 2,
|
127 |
+
"max_length": 4096,
|
128 |
+
"task_hashes": {
|
129 |
+
"ar_ifeval": "d0db7903ef270d7dc54efe4e7713be0de9864fc3a36c901c6e5777a6a5f69aa9"
|
130 |
+
},
|
131 |
+
"model_source": "hf",
|
132 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
133 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
134 |
+
"system_instruction": null,
|
135 |
+
"system_instruction_sha": null,
|
136 |
+
"fewshot_as_multiturn": false,
|
137 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
138 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
139 |
+
"start_time": 1393068.333905473,
|
140 |
+
"end_time": 1397143.169266589,
|
141 |
+
"total_evaluation_time_seconds": "4074.8353611161"
|
142 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/araMath_v3_5_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araMath_v3": {
|
4 |
+
"alias": "araMath_v3",
|
5 |
+
"acc,none": 0.6677685950413224,
|
6 |
+
"acc_stderr,none": 0.019165266705090528,
|
7 |
+
"acc_norm,none": 0.6677685950413224,
|
8 |
+
"acc_norm_stderr,none": 0.019165266705090528
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araMath_v3": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araMath_v3": {
|
16 |
+
"task": "araMath_v3",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
|
21 |
+
"dataset_name": "araMath_v3",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "{{choices}}",
|
31 |
+
"description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"araMath_v3": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"araMath_v3": 5
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"araMath_v3": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"araMath_v3": {
|
70 |
+
"original": 605,
|
71 |
+
"effective": 605
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
77 |
+
"model_num_parameters": 7000559616,
|
78 |
+
"model_dtype": "torch.bfloat16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739618269.6292942,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<unk>",
|
100 |
+
"0"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"</s>",
|
104 |
+
"2"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
"<s>",
|
108 |
+
"1"
|
109 |
+
],
|
110 |
+
"eot_token_id": 2,
|
111 |
+
"max_length": 4096,
|
112 |
+
"task_hashes": {
|
113 |
+
"araMath_v3": "e7f60b63c44ee90c76a61f37207fa1f812622b6662200911fcfd7dabe78ada66"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
117 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
122 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
123 |
+
"start_time": 1392959.193182268,
|
124 |
+
"end_time": 1393012.133225703,
|
125 |
+
"total_evaluation_time_seconds": "52.940043434966356"
|
126 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/araPro_0_shot.json
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araPro": {
|
4 |
+
"alias": "araPro",
|
5 |
+
"acc,none": 0.6970605878824235,
|
6 |
+
"acc_stderr,none": 0.006498724870364006,
|
7 |
+
"acc_norm,none": 0.6970605878824235,
|
8 |
+
"acc_norm_stderr,none": 0.006498724870364006
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araPro": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araPro": {
|
16 |
+
"task": "araPro",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araPro/araPro.py",
|
21 |
+
"dataset_name": "araPro",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "{{choices}}",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": true,
|
54 |
+
"doc_to_decontamination_query": "Question",
|
55 |
+
"metadata": {
|
56 |
+
"version": 2.0
|
57 |
+
}
|
58 |
+
}
|
59 |
+
},
|
60 |
+
"versions": {
|
61 |
+
"araPro": 2.0
|
62 |
+
},
|
63 |
+
"n-shot": {
|
64 |
+
"araPro": 0
|
65 |
+
},
|
66 |
+
"higher_is_better": {
|
67 |
+
"araPro": {
|
68 |
+
"acc": true,
|
69 |
+
"acc_norm": true
|
70 |
+
}
|
71 |
+
},
|
72 |
+
"n-samples": {
|
73 |
+
"araPro": {
|
74 |
+
"original": 5001,
|
75 |
+
"effective": 5001
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"config": {
|
79 |
+
"model": "hf",
|
80 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
81 |
+
"model_num_parameters": 7000559616,
|
82 |
+
"model_dtype": "torch.bfloat16",
|
83 |
+
"model_revision": "main",
|
84 |
+
"model_sha": "",
|
85 |
+
"batch_size": 1,
|
86 |
+
"batch_sizes": [],
|
87 |
+
"device": null,
|
88 |
+
"use_cache": null,
|
89 |
+
"limit": null,
|
90 |
+
"bootstrap_iters": 100000,
|
91 |
+
"gen_kwargs": null,
|
92 |
+
"random_seed": 0,
|
93 |
+
"numpy_seed": 1234,
|
94 |
+
"torch_seed": 1234,
|
95 |
+
"fewshot_seed": 1234
|
96 |
+
},
|
97 |
+
"git_hash": "b955b2950",
|
98 |
+
"date": 1739617164.0204737,
|
99 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
100 |
+
"transformers_version": "4.48.3",
|
101 |
+
"upper_git_hash": null,
|
102 |
+
"tokenizer_pad_token": [
|
103 |
+
"<unk>",
|
104 |
+
"0"
|
105 |
+
],
|
106 |
+
"tokenizer_eos_token": [
|
107 |
+
"</s>",
|
108 |
+
"2"
|
109 |
+
],
|
110 |
+
"tokenizer_bos_token": [
|
111 |
+
"<s>",
|
112 |
+
"1"
|
113 |
+
],
|
114 |
+
"eot_token_id": 2,
|
115 |
+
"max_length": 4096,
|
116 |
+
"task_hashes": {
|
117 |
+
"araPro": "01340c360a1565c46298c4c24dd3fdfe1ea614c6eef6e4d4f021f1da83da2584"
|
118 |
+
},
|
119 |
+
"model_source": "hf",
|
120 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
121 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
122 |
+
"system_instruction": null,
|
123 |
+
"system_instruction_sha": null,
|
124 |
+
"fewshot_as_multiturn": false,
|
125 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
126 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
127 |
+
"start_time": 1391853.516943726,
|
128 |
+
"end_time": 1392050.054185297,
|
129 |
+
"total_evaluation_time_seconds": "196.5372415711172"
|
130 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/arabicmmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/Allam-7b-instruct-preview/etec_v2_0_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"etec_v2": {
|
4 |
+
"alias": "etec_v2",
|
5 |
+
"acc,none": 0.6666666666666666,
|
6 |
+
"acc_stderr,none": 0.010854826817097195,
|
7 |
+
"acc_norm,none": 0.6666666666666666,
|
8 |
+
"acc_norm_stderr,none": 0.010854826817097195
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"etec_v2": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"etec_v2": {
|
16 |
+
"task": "etec_v2",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/etec_v2/etec.py",
|
21 |
+
"dataset_name": "etec_v2",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 0,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"etec_v2": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"etec_v2": 0
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"etec_v2": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"etec_v2": {
|
70 |
+
"original": 1887,
|
71 |
+
"effective": 1887
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
77 |
+
"model_num_parameters": 7000559616,
|
78 |
+
"model_dtype": "torch.bfloat16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739617421.4265695,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<unk>",
|
100 |
+
"0"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"</s>",
|
104 |
+
"2"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
"<s>",
|
108 |
+
"1"
|
109 |
+
],
|
110 |
+
"eot_token_id": 2,
|
111 |
+
"max_length": 4096,
|
112 |
+
"task_hashes": {
|
113 |
+
"etec_v2": "a0d87bf7eb82815b66ea544cb632aafb803526dee24b399f30fdc751be442b60"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
117 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
122 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
123 |
+
"start_time": 1392110.980523203,
|
124 |
+
"end_time": 1392198.883363127,
|
125 |
+
"total_evaluation_time_seconds": "87.90283992397599"
|
126 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/exams_ar_5_shot.json
ADDED
@@ -0,0 +1,121 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"exams_ar": {
|
4 |
+
"alias": "exams_ar",
|
5 |
+
"acc,none": 0.515828677839851,
|
6 |
+
"acc_stderr,none": 0.021585885942816244,
|
7 |
+
"acc_norm,none": 0.515828677839851,
|
8 |
+
"acc_norm_stderr,none": 0.021585885942816244
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"exams_ar": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"exams_ar": {
|
16 |
+
"task": "exams_ar",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/exams_ar",
|
21 |
+
"dataset_name": "exams_ar",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"test_split": "test",
|
26 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
|
27 |
+
"doc_to_text": "query",
|
28 |
+
"doc_to_target": "gold",
|
29 |
+
"doc_to_choice": "choices",
|
30 |
+
"description": "description",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 5,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "acc_norm",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
}
|
45 |
+
],
|
46 |
+
"output_type": "multiple_choice",
|
47 |
+
"repeats": 1,
|
48 |
+
"should_decontaminate": true,
|
49 |
+
"doc_to_decontamination_query": "query",
|
50 |
+
"metadata": {
|
51 |
+
"version": 0.0
|
52 |
+
}
|
53 |
+
}
|
54 |
+
},
|
55 |
+
"versions": {
|
56 |
+
"exams_ar": 0.0
|
57 |
+
},
|
58 |
+
"n-shot": {
|
59 |
+
"exams_ar": 5
|
60 |
+
},
|
61 |
+
"higher_is_better": {
|
62 |
+
"exams_ar": {
|
63 |
+
"acc": true,
|
64 |
+
"acc_norm": true
|
65 |
+
}
|
66 |
+
},
|
67 |
+
"n-samples": {
|
68 |
+
"exams_ar": {
|
69 |
+
"original": 537,
|
70 |
+
"effective": 537
|
71 |
+
}
|
72 |
+
},
|
73 |
+
"config": {
|
74 |
+
"model": "vllm",
|
75 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
|
76 |
+
"batch_size": 1,
|
77 |
+
"batch_sizes": [],
|
78 |
+
"device": null,
|
79 |
+
"use_cache": null,
|
80 |
+
"limit": null,
|
81 |
+
"bootstrap_iters": 100000,
|
82 |
+
"gen_kwargs": null,
|
83 |
+
"random_seed": 0,
|
84 |
+
"numpy_seed": 1234,
|
85 |
+
"torch_seed": 1234,
|
86 |
+
"fewshot_seed": 1234
|
87 |
+
},
|
88 |
+
"git_hash": "8e1bd48d",
|
89 |
+
"date": 1735662207.0830526,
|
90 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
91 |
+
"transformers_version": "4.47.1",
|
92 |
+
"upper_git_hash": null,
|
93 |
+
"tokenizer_pad_token": [
|
94 |
+
"<unk>",
|
95 |
+
"0"
|
96 |
+
],
|
97 |
+
"tokenizer_eos_token": [
|
98 |
+
"</s>",
|
99 |
+
"2"
|
100 |
+
],
|
101 |
+
"tokenizer_bos_token": [
|
102 |
+
"<s>",
|
103 |
+
"1"
|
104 |
+
],
|
105 |
+
"eot_token_id": 2,
|
106 |
+
"max_length": 4096,
|
107 |
+
"task_hashes": {
|
108 |
+
"exams_ar": "b1561abd56354d570ac16bf64163b0ee8dc6c507234b05f678576b09c26c644a"
|
109 |
+
},
|
110 |
+
"model_source": "vllm",
|
111 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
112 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
113 |
+
"system_instruction": null,
|
114 |
+
"system_instruction_sha": null,
|
115 |
+
"fewshot_as_multiturn": false,
|
116 |
+
"chat_template": null,
|
117 |
+
"chat_template_sha": null,
|
118 |
+
"start_time": 2867.397536365,
|
119 |
+
"end_time": 2948.510496752,
|
120 |
+
"total_evaluation_time_seconds": "81.11296038699993"
|
121 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/gat_0_shot.json
ADDED
@@ -0,0 +1,549 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"gat": {
|
4 |
+
"acc,none": 0.4452527279568544,
|
5 |
+
"acc_stderr,none": 0.0038711388833064567,
|
6 |
+
"alias": "gat"
|
7 |
+
},
|
8 |
+
"gat_algebra": {
|
9 |
+
"alias": " - gat_algebra",
|
10 |
+
"acc,none": 0.40667903525046384,
|
11 |
+
"acc_stderr,none": 0.009463939247454995
|
12 |
+
},
|
13 |
+
"gat_analogy": {
|
14 |
+
"alias": " - gat_analogy",
|
15 |
+
"acc,none": 0.35919854280510016,
|
16 |
+
"acc_stderr,none": 0.009158766245747282
|
17 |
+
},
|
18 |
+
"gat_arithmetic": {
|
19 |
+
"alias": " - gat_arithmetic",
|
20 |
+
"acc,none": 0.40154582259845417,
|
21 |
+
"acc_stderr,none": 0.009406284814832203
|
22 |
+
},
|
23 |
+
"gat_association": {
|
24 |
+
"alias": " - gat_association",
|
25 |
+
"acc,none": 0.5464114832535886,
|
26 |
+
"acc_stderr,none": 0.015407801869520031
|
27 |
+
},
|
28 |
+
"gat_comparisons": {
|
29 |
+
"alias": " - gat_comparisons",
|
30 |
+
"acc,none": 0.34508196721311474,
|
31 |
+
"acc_stderr,none": 0.013616100682624904
|
32 |
+
},
|
33 |
+
"gat_completion": {
|
34 |
+
"alias": " - gat_completion",
|
35 |
+
"acc,none": 0.6057851239669422,
|
36 |
+
"acc_stderr,none": 0.014054411207805699
|
37 |
+
},
|
38 |
+
"gat_contextual": {
|
39 |
+
"alias": " - gat_contextual",
|
40 |
+
"acc,none": 0.3941717791411043,
|
41 |
+
"acc_stderr,none": 0.013537713096332765
|
42 |
+
},
|
43 |
+
"gat_geometry": {
|
44 |
+
"alias": " - gat_geometry",
|
45 |
+
"acc,none": 0.473972602739726,
|
46 |
+
"acc_stderr,none": 0.026171590093068537
|
47 |
+
},
|
48 |
+
"gat_reading": {
|
49 |
+
"alias": " - gat_reading",
|
50 |
+
"acc,none": 0.5727788279773157,
|
51 |
+
"acc_stderr,none": 0.009620311542503682
|
52 |
+
}
|
53 |
+
},
|
54 |
+
"groups": {
|
55 |
+
"gat": {
|
56 |
+
"acc,none": 0.4452527279568544,
|
57 |
+
"acc_stderr,none": 0.0038711388833064567,
|
58 |
+
"alias": "gat"
|
59 |
+
}
|
60 |
+
},
|
61 |
+
"group_subtasks": {
|
62 |
+
"gat": [
|
63 |
+
"gat_analogy",
|
64 |
+
"gat_association",
|
65 |
+
"gat_completion",
|
66 |
+
"gat_reading",
|
67 |
+
"gat_algebra",
|
68 |
+
"gat_arithmetic",
|
69 |
+
"gat_comparisons",
|
70 |
+
"gat_contextual",
|
71 |
+
"gat_geometry"
|
72 |
+
]
|
73 |
+
},
|
74 |
+
"configs": {
|
75 |
+
"gat_algebra": {
|
76 |
+
"task": "gat_algebra",
|
77 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
78 |
+
"dataset_name": "algebra",
|
79 |
+
"dataset_kwargs": {
|
80 |
+
"trust_remote_code": true
|
81 |
+
},
|
82 |
+
"test_split": "test",
|
83 |
+
"fewshot_split": "validation",
|
84 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
85 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
86 |
+
"doc_to_target": "{{label}}",
|
87 |
+
"doc_to_choice": [
|
88 |
+
"\u0623",
|
89 |
+
"\u0628",
|
90 |
+
"\u062c",
|
91 |
+
"\u062f"
|
92 |
+
],
|
93 |
+
"description": "",
|
94 |
+
"target_delimiter": " ",
|
95 |
+
"fewshot_delimiter": "\n\n",
|
96 |
+
"num_fewshot": 0,
|
97 |
+
"metric_list": [
|
98 |
+
{
|
99 |
+
"metric": "acc",
|
100 |
+
"aggregation": "mean",
|
101 |
+
"higher_is_better": true
|
102 |
+
}
|
103 |
+
],
|
104 |
+
"output_type": "multiple_choice",
|
105 |
+
"repeats": 1,
|
106 |
+
"should_decontaminate": false,
|
107 |
+
"metadata": {
|
108 |
+
"version": 0.0
|
109 |
+
}
|
110 |
+
},
|
111 |
+
"gat_analogy": {
|
112 |
+
"task": "gat_analogy",
|
113 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
114 |
+
"dataset_name": "analogy",
|
115 |
+
"dataset_kwargs": {
|
116 |
+
"trust_remote_code": true
|
117 |
+
},
|
118 |
+
"test_split": "test",
|
119 |
+
"fewshot_split": "validation",
|
120 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
121 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
122 |
+
"doc_to_target": "{{label}}",
|
123 |
+
"doc_to_choice": [
|
124 |
+
"\u0623",
|
125 |
+
"\u0628",
|
126 |
+
"\u062c",
|
127 |
+
"\u062f"
|
128 |
+
],
|
129 |
+
"description": "",
|
130 |
+
"target_delimiter": " ",
|
131 |
+
"fewshot_delimiter": "\n\n",
|
132 |
+
"num_fewshot": 0,
|
133 |
+
"metric_list": [
|
134 |
+
{
|
135 |
+
"metric": "acc",
|
136 |
+
"aggregation": "mean",
|
137 |
+
"higher_is_better": true
|
138 |
+
}
|
139 |
+
],
|
140 |
+
"output_type": "multiple_choice",
|
141 |
+
"repeats": 1,
|
142 |
+
"should_decontaminate": false,
|
143 |
+
"metadata": {
|
144 |
+
"version": 0.0
|
145 |
+
}
|
146 |
+
},
|
147 |
+
"gat_arithmetic": {
|
148 |
+
"task": "gat_arithmetic",
|
149 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
150 |
+
"dataset_name": "arithmetic",
|
151 |
+
"dataset_kwargs": {
|
152 |
+
"trust_remote_code": true
|
153 |
+
},
|
154 |
+
"test_split": "test",
|
155 |
+
"fewshot_split": "validation",
|
156 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
157 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
158 |
+
"doc_to_target": "{{label}}",
|
159 |
+
"doc_to_choice": [
|
160 |
+
"\u0623",
|
161 |
+
"\u0628",
|
162 |
+
"\u062c",
|
163 |
+
"\u062f"
|
164 |
+
],
|
165 |
+
"description": "",
|
166 |
+
"target_delimiter": " ",
|
167 |
+
"fewshot_delimiter": "\n\n",
|
168 |
+
"num_fewshot": 0,
|
169 |
+
"metric_list": [
|
170 |
+
{
|
171 |
+
"metric": "acc",
|
172 |
+
"aggregation": "mean",
|
173 |
+
"higher_is_better": true
|
174 |
+
}
|
175 |
+
],
|
176 |
+
"output_type": "multiple_choice",
|
177 |
+
"repeats": 1,
|
178 |
+
"should_decontaminate": false,
|
179 |
+
"metadata": {
|
180 |
+
"version": 0.0
|
181 |
+
}
|
182 |
+
},
|
183 |
+
"gat_association": {
|
184 |
+
"task": "gat_association",
|
185 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
186 |
+
"dataset_name": "association",
|
187 |
+
"dataset_kwargs": {
|
188 |
+
"trust_remote_code": true
|
189 |
+
},
|
190 |
+
"test_split": "test",
|
191 |
+
"fewshot_split": "validation",
|
192 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
193 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
194 |
+
"doc_to_target": "{{label}}",
|
195 |
+
"doc_to_choice": [
|
196 |
+
"\u0623",
|
197 |
+
"\u0628",
|
198 |
+
"\u062c",
|
199 |
+
"\u062f"
|
200 |
+
],
|
201 |
+
"description": "",
|
202 |
+
"target_delimiter": " ",
|
203 |
+
"fewshot_delimiter": "\n\n",
|
204 |
+
"num_fewshot": 0,
|
205 |
+
"metric_list": [
|
206 |
+
{
|
207 |
+
"metric": "acc",
|
208 |
+
"aggregation": "mean",
|
209 |
+
"higher_is_better": true
|
210 |
+
}
|
211 |
+
],
|
212 |
+
"output_type": "multiple_choice",
|
213 |
+
"repeats": 1,
|
214 |
+
"should_decontaminate": false,
|
215 |
+
"metadata": {
|
216 |
+
"version": 0.0
|
217 |
+
}
|
218 |
+
},
|
219 |
+
"gat_comparisons": {
|
220 |
+
"task": "gat_comparisons",
|
221 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
222 |
+
"dataset_name": "comparisons",
|
223 |
+
"dataset_kwargs": {
|
224 |
+
"trust_remote_code": true
|
225 |
+
},
|
226 |
+
"test_split": "test",
|
227 |
+
"fewshot_split": "validation",
|
228 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
229 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
230 |
+
"doc_to_target": "{{label}}",
|
231 |
+
"doc_to_choice": [
|
232 |
+
"\u0623",
|
233 |
+
"\u0628",
|
234 |
+
"\u062c",
|
235 |
+
"\u062f"
|
236 |
+
],
|
237 |
+
"description": "",
|
238 |
+
"target_delimiter": " ",
|
239 |
+
"fewshot_delimiter": "\n\n",
|
240 |
+
"num_fewshot": 0,
|
241 |
+
"metric_list": [
|
242 |
+
{
|
243 |
+
"metric": "acc",
|
244 |
+
"aggregation": "mean",
|
245 |
+
"higher_is_better": true
|
246 |
+
}
|
247 |
+
],
|
248 |
+
"output_type": "multiple_choice",
|
249 |
+
"repeats": 1,
|
250 |
+
"should_decontaminate": false,
|
251 |
+
"metadata": {
|
252 |
+
"version": 0.0
|
253 |
+
}
|
254 |
+
},
|
255 |
+
"gat_completion": {
|
256 |
+
"task": "gat_completion",
|
257 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
258 |
+
"dataset_name": "completion",
|
259 |
+
"dataset_kwargs": {
|
260 |
+
"trust_remote_code": true
|
261 |
+
},
|
262 |
+
"test_split": "test",
|
263 |
+
"fewshot_split": "validation",
|
264 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
265 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
266 |
+
"doc_to_target": "{{label}}",
|
267 |
+
"doc_to_choice": [
|
268 |
+
"\u0623",
|
269 |
+
"\u0628",
|
270 |
+
"\u062c",
|
271 |
+
"\u062f"
|
272 |
+
],
|
273 |
+
"description": "",
|
274 |
+
"target_delimiter": " ",
|
275 |
+
"fewshot_delimiter": "\n\n",
|
276 |
+
"num_fewshot": 0,
|
277 |
+
"metric_list": [
|
278 |
+
{
|
279 |
+
"metric": "acc",
|
280 |
+
"aggregation": "mean",
|
281 |
+
"higher_is_better": true
|
282 |
+
}
|
283 |
+
],
|
284 |
+
"output_type": "multiple_choice",
|
285 |
+
"repeats": 1,
|
286 |
+
"should_decontaminate": false,
|
287 |
+
"metadata": {
|
288 |
+
"version": 0.0
|
289 |
+
}
|
290 |
+
},
|
291 |
+
"gat_contextual": {
|
292 |
+
"task": "gat_contextual",
|
293 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
294 |
+
"dataset_name": "contextual",
|
295 |
+
"dataset_kwargs": {
|
296 |
+
"trust_remote_code": true
|
297 |
+
},
|
298 |
+
"test_split": "test",
|
299 |
+
"fewshot_split": "validation",
|
300 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
301 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
302 |
+
"doc_to_target": "{{label}}",
|
303 |
+
"doc_to_choice": [
|
304 |
+
"\u0623",
|
305 |
+
"\u0628",
|
306 |
+
"\u062c",
|
307 |
+
"\u062f"
|
308 |
+
],
|
309 |
+
"description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
|
310 |
+
"target_delimiter": " ",
|
311 |
+
"fewshot_delimiter": "\n\n",
|
312 |
+
"num_fewshot": 0,
|
313 |
+
"metric_list": [
|
314 |
+
{
|
315 |
+
"metric": "acc",
|
316 |
+
"aggregation": "mean",
|
317 |
+
"higher_is_better": true
|
318 |
+
}
|
319 |
+
],
|
320 |
+
"output_type": "multiple_choice",
|
321 |
+
"repeats": 1,
|
322 |
+
"should_decontaminate": false,
|
323 |
+
"metadata": {
|
324 |
+
"version": 0.0
|
325 |
+
}
|
326 |
+
},
|
327 |
+
"gat_geometry": {
|
328 |
+
"task": "gat_geometry",
|
329 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
330 |
+
"dataset_name": "geometry",
|
331 |
+
"dataset_kwargs": {
|
332 |
+
"trust_remote_code": true
|
333 |
+
},
|
334 |
+
"test_split": "test",
|
335 |
+
"fewshot_split": "validation",
|
336 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
337 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
338 |
+
"doc_to_target": "{{label}}",
|
339 |
+
"doc_to_choice": [
|
340 |
+
"\u0623",
|
341 |
+
"\u0628",
|
342 |
+
"\u062c",
|
343 |
+
"\u062f"
|
344 |
+
],
|
345 |
+
"description": "",
|
346 |
+
"target_delimiter": " ",
|
347 |
+
"fewshot_delimiter": "\n\n",
|
348 |
+
"num_fewshot": 0,
|
349 |
+
"metric_list": [
|
350 |
+
{
|
351 |
+
"metric": "acc",
|
352 |
+
"aggregation": "mean",
|
353 |
+
"higher_is_better": true
|
354 |
+
}
|
355 |
+
],
|
356 |
+
"output_type": "multiple_choice",
|
357 |
+
"repeats": 1,
|
358 |
+
"should_decontaminate": false,
|
359 |
+
"metadata": {
|
360 |
+
"version": 0.0
|
361 |
+
}
|
362 |
+
},
|
363 |
+
"gat_reading": {
|
364 |
+
"task": "gat_reading",
|
365 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
366 |
+
"dataset_name": "reading",
|
367 |
+
"dataset_kwargs": {
|
368 |
+
"trust_remote_code": true
|
369 |
+
},
|
370 |
+
"test_split": "test",
|
371 |
+
"fewshot_split": "validation",
|
372 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
373 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
374 |
+
"doc_to_target": "{{label}}",
|
375 |
+
"doc_to_choice": [
|
376 |
+
"\u0623",
|
377 |
+
"\u0628",
|
378 |
+
"\u062c",
|
379 |
+
"\u062f"
|
380 |
+
],
|
381 |
+
"description": "",
|
382 |
+
"target_delimiter": " ",
|
383 |
+
"fewshot_delimiter": "\n\n",
|
384 |
+
"num_fewshot": 0,
|
385 |
+
"metric_list": [
|
386 |
+
{
|
387 |
+
"metric": "acc",
|
388 |
+
"aggregation": "mean",
|
389 |
+
"higher_is_better": true
|
390 |
+
}
|
391 |
+
],
|
392 |
+
"output_type": "multiple_choice",
|
393 |
+
"repeats": 1,
|
394 |
+
"should_decontaminate": false,
|
395 |
+
"metadata": {
|
396 |
+
"version": 0.0
|
397 |
+
}
|
398 |
+
}
|
399 |
+
},
|
400 |
+
"versions": {
|
401 |
+
"gat": 0,
|
402 |
+
"gat_algebra": 0.0,
|
403 |
+
"gat_analogy": 0.0,
|
404 |
+
"gat_arithmetic": 0.0,
|
405 |
+
"gat_association": 0.0,
|
406 |
+
"gat_comparisons": 0.0,
|
407 |
+
"gat_completion": 0.0,
|
408 |
+
"gat_contextual": 0.0,
|
409 |
+
"gat_geometry": 0.0,
|
410 |
+
"gat_reading": 0.0
|
411 |
+
},
|
412 |
+
"n-shot": {
|
413 |
+
"gat_algebra": 0,
|
414 |
+
"gat_analogy": 0,
|
415 |
+
"gat_arithmetic": 0,
|
416 |
+
"gat_association": 0,
|
417 |
+
"gat_comparisons": 0,
|
418 |
+
"gat_completion": 0,
|
419 |
+
"gat_contextual": 0,
|
420 |
+
"gat_geometry": 0,
|
421 |
+
"gat_reading": 0
|
422 |
+
},
|
423 |
+
"higher_is_better": {
|
424 |
+
"gat": {
|
425 |
+
"acc": true
|
426 |
+
},
|
427 |
+
"gat_algebra": {
|
428 |
+
"acc": true
|
429 |
+
},
|
430 |
+
"gat_analogy": {
|
431 |
+
"acc": true
|
432 |
+
},
|
433 |
+
"gat_arithmetic": {
|
434 |
+
"acc": true
|
435 |
+
},
|
436 |
+
"gat_association": {
|
437 |
+
"acc": true
|
438 |
+
},
|
439 |
+
"gat_comparisons": {
|
440 |
+
"acc": true
|
441 |
+
},
|
442 |
+
"gat_completion": {
|
443 |
+
"acc": true
|
444 |
+
},
|
445 |
+
"gat_contextual": {
|
446 |
+
"acc": true
|
447 |
+
},
|
448 |
+
"gat_geometry": {
|
449 |
+
"acc": true
|
450 |
+
},
|
451 |
+
"gat_reading": {
|
452 |
+
"acc": true
|
453 |
+
}
|
454 |
+
},
|
455 |
+
"n-samples": {
|
456 |
+
"gat_analogy": {
|
457 |
+
"original": 2745,
|
458 |
+
"effective": 2745
|
459 |
+
},
|
460 |
+
"gat_association": {
|
461 |
+
"original": 1045,
|
462 |
+
"effective": 1045
|
463 |
+
},
|
464 |
+
"gat_completion": {
|
465 |
+
"original": 1210,
|
466 |
+
"effective": 1210
|
467 |
+
},
|
468 |
+
"gat_reading": {
|
469 |
+
"original": 2645,
|
470 |
+
"effective": 2645
|
471 |
+
},
|
472 |
+
"gat_algebra": {
|
473 |
+
"original": 2695,
|
474 |
+
"effective": 2695
|
475 |
+
},
|
476 |
+
"gat_arithmetic": {
|
477 |
+
"original": 2717,
|
478 |
+
"effective": 2717
|
479 |
+
},
|
480 |
+
"gat_comparisons": {
|
481 |
+
"original": 1220,
|
482 |
+
"effective": 1220
|
483 |
+
},
|
484 |
+
"gat_contextual": {
|
485 |
+
"original": 1304,
|
486 |
+
"effective": 1304
|
487 |
+
},
|
488 |
+
"gat_geometry": {
|
489 |
+
"original": 365,
|
490 |
+
"effective": 365
|
491 |
+
}
|
492 |
+
},
|
493 |
+
"config": {
|
494 |
+
"model": "vllm",
|
495 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,tensor_parallel_size=1,data_parallel_size=2,gpu_memory_utilization=0.8",
|
496 |
+
"batch_size": 1,
|
497 |
+
"batch_sizes": [],
|
498 |
+
"device": null,
|
499 |
+
"use_cache": null,
|
500 |
+
"limit": null,
|
501 |
+
"bootstrap_iters": 100000,
|
502 |
+
"gen_kwargs": null,
|
503 |
+
"random_seed": 0,
|
504 |
+
"numpy_seed": 1234,
|
505 |
+
"torch_seed": 1234,
|
506 |
+
"fewshot_seed": 1234
|
507 |
+
},
|
508 |
+
"git_hash": "8e1bd48d",
|
509 |
+
"date": 1735664096.2650902,
|
510 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100 80GB PCIe\nGPU 1: NVIDIA A100 80GB PCIe\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 48\nOn-line CPU(s) list: 0-47\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V13 64-Core Processor\nCPU family: 25\nModel: 1\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 1\nStepping: 1\nBogoMIPS: 4890.87\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr rdpru arat umip vaes vpclmulqdq rdpid fsrm\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 1.5 MiB (48 instances)\nL1i cache: 1.5 MiB (48 instances)\nL2 cache: 24 MiB (48 instances)\nL3 cache: 192 MiB (6 instances)\nNUMA node(s): 2\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Not affected\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Vulnerable\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
511 |
+
"transformers_version": "4.47.1",
|
512 |
+
"upper_git_hash": null,
|
513 |
+
"tokenizer_pad_token": [
|
514 |
+
"<unk>",
|
515 |
+
"0"
|
516 |
+
],
|
517 |
+
"tokenizer_eos_token": [
|
518 |
+
"</s>",
|
519 |
+
"2"
|
520 |
+
],
|
521 |
+
"tokenizer_bos_token": [
|
522 |
+
"<s>",
|
523 |
+
"1"
|
524 |
+
],
|
525 |
+
"eot_token_id": 2,
|
526 |
+
"max_length": 4096,
|
527 |
+
"task_hashes": {
|
528 |
+
"gat_analogy": "ede28dec097bfebe8a85a19fa27d001696858276df66254bdb70fc63231f1a83",
|
529 |
+
"gat_association": "5d82550d46c4f3cabf370185a8a23cc2eb5b08f1f0c5e210a8a712562a44bd08",
|
530 |
+
"gat_completion": "fc3c19dd7f1896696fec1bffc21182804c9b2f1fb8d8c882428a6bb4bb61e370",
|
531 |
+
"gat_reading": "93053b187a750d2e87f5488f2d0fda944f3da9195bb04d1c4dee9c4b56fa626a",
|
532 |
+
"gat_algebra": "77832c595eaaf156775c3dbb27da0915ef600ebf46a7113ae32a202b0359e8a6",
|
533 |
+
"gat_arithmetic": "6a498f75f5cc0ffd1b30f7a6293ba80d08f2a8876d5558d8e934bf57355ff0cc",
|
534 |
+
"gat_comparisons": "acb80c0ed8dd07e916a471189aef3a546efc289824b2cc50a32c11dc4c97c9c1",
|
535 |
+
"gat_contextual": "de063ed3b94011d74ee24a6532122c9d344fc15e42800db44f0849995a0bc37a",
|
536 |
+
"gat_geometry": "3e482885559a4404ee9e97556edc6e49959770a499f4ae2c58f18ad85b91a363"
|
537 |
+
},
|
538 |
+
"model_source": "vllm",
|
539 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
540 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
541 |
+
"system_instruction": null,
|
542 |
+
"system_instruction_sha": null,
|
543 |
+
"fewshot_as_multiturn": false,
|
544 |
+
"chat_template": null,
|
545 |
+
"chat_template_sha": null,
|
546 |
+
"start_time": 4756.376698655,
|
547 |
+
"end_time": 5124.76942052,
|
548 |
+
"total_evaluation_time_seconds": "368.39272186499966"
|
549 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/moe_ien_mcq_0_shot.json
ADDED
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_mcq": {
|
4 |
+
"alias": "moe_ien_mcq",
|
5 |
+
"acc,none": 0.9177177177177177,
|
6 |
+
"acc_stderr,none": 0.002749455634736978,
|
7 |
+
"acc_norm,none": 0.9177177177177177,
|
8 |
+
"acc_norm_stderr,none": 0.002749455634736978
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_mcq": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_mcq": {
|
16 |
+
"task": "moe_ien_mcq",
|
17 |
+
"dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
|
18 |
+
"dataset_name": "moe_ien_mcq",
|
19 |
+
"dataset_kwargs": {
|
20 |
+
"trust_remote_code": true
|
21 |
+
},
|
22 |
+
"validation_split": "validation",
|
23 |
+
"test_split": "test",
|
24 |
+
"fewshot_split": "validation",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "Query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "{{Choices}}",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"fewshot_config": {
|
33 |
+
"sampler": "balanced_cat"
|
34 |
+
},
|
35 |
+
"num_fewshot": 0,
|
36 |
+
"metric_list": [
|
37 |
+
{
|
38 |
+
"metric": "acc",
|
39 |
+
"aggregation": "mean",
|
40 |
+
"higher_is_better": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"metric": "acc_norm",
|
44 |
+
"aggregation": "mean",
|
45 |
+
"higher_is_better": true
|
46 |
+
}
|
47 |
+
],
|
48 |
+
"output_type": "multiple_choice",
|
49 |
+
"repeats": 1,
|
50 |
+
"should_decontaminate": true,
|
51 |
+
"doc_to_decontamination_query": "Query",
|
52 |
+
"metadata": {
|
53 |
+
"version": 0.0
|
54 |
+
}
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"versions": {
|
58 |
+
"moe_ien_mcq": 0.0
|
59 |
+
},
|
60 |
+
"n-shot": {
|
61 |
+
"moe_ien_mcq": 0
|
62 |
+
},
|
63 |
+
"higher_is_better": {
|
64 |
+
"moe_ien_mcq": {
|
65 |
+
"acc": true,
|
66 |
+
"acc_norm": true
|
67 |
+
}
|
68 |
+
},
|
69 |
+
"n-samples": {
|
70 |
+
"moe_ien_mcq": {
|
71 |
+
"original": 9990,
|
72 |
+
"effective": 9990
|
73 |
+
}
|
74 |
+
},
|
75 |
+
"config": {
|
76 |
+
"model": "hf",
|
77 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
78 |
+
"model_num_parameters": 7000559616,
|
79 |
+
"model_dtype": "torch.bfloat16",
|
80 |
+
"model_revision": "main",
|
81 |
+
"model_sha": "",
|
82 |
+
"batch_size": 1,
|
83 |
+
"batch_sizes": [],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "b955b2950",
|
95 |
+
"date": 1739617571.8184838,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.3",
|
98 |
+
"upper_git_hash": null,
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<unk>",
|
101 |
+
"0"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"</s>",
|
105 |
+
"2"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
"<s>",
|
109 |
+
"1"
|
110 |
+
],
|
111 |
+
"eot_token_id": 2,
|
112 |
+
"max_length": 4096,
|
113 |
+
"task_hashes": {
|
114 |
+
"moe_ien_mcq": "504533b140426f12c89d975ef421328fc89d69af8719c420a1bf897ed4724191"
|
115 |
+
},
|
116 |
+
"model_source": "hf",
|
117 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
118 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
119 |
+
"system_instruction": null,
|
120 |
+
"system_instruction_sha": null,
|
121 |
+
"fewshot_as_multiturn": false,
|
122 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
123 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
124 |
+
"start_time": 1392261.292633723,
|
125 |
+
"end_time": 1392626.942167409,
|
126 |
+
"total_evaluation_time_seconds": "365.64953368599527"
|
127 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/moe_ien_tf_0_shot.json
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_tf": {
|
4 |
+
"alias": "moe_ien_tf",
|
5 |
+
"acc,none": 0.8294693456980937,
|
6 |
+
"acc_stderr,none": 0.004929073554117403,
|
7 |
+
"acc_norm,none": 0.8294693456980937,
|
8 |
+
"acc_norm_stderr,none": 0.004929073554117403
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_tf": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_tf": {
|
16 |
+
"task": "moe_ien_tf",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
|
21 |
+
"dataset_name": "moe_ien_tf",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "choices",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": false,
|
54 |
+
"metadata": {
|
55 |
+
"version": 2.0
|
56 |
+
}
|
57 |
+
}
|
58 |
+
},
|
59 |
+
"versions": {
|
60 |
+
"moe_ien_tf": 2.0
|
61 |
+
},
|
62 |
+
"n-shot": {
|
63 |
+
"moe_ien_tf": 0
|
64 |
+
},
|
65 |
+
"higher_is_better": {
|
66 |
+
"moe_ien_tf": {
|
67 |
+
"acc": true,
|
68 |
+
"acc_norm": true
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"n-samples": {
|
72 |
+
"moe_ien_tf": {
|
73 |
+
"original": 5823,
|
74 |
+
"effective": 5823
|
75 |
+
}
|
76 |
+
},
|
77 |
+
"config": {
|
78 |
+
"model": "hf",
|
79 |
+
"model_args": "pretrained=/tmp/7b-alpha-v1.27.2.25,trust_remote_code=True,cache_dir=/tmp,parallelize=False",
|
80 |
+
"model_num_parameters": 7000559616,
|
81 |
+
"model_dtype": "torch.bfloat16",
|
82 |
+
"model_revision": "main",
|
83 |
+
"model_sha": "",
|
84 |
+
"batch_size": 1,
|
85 |
+
"batch_sizes": [],
|
86 |
+
"device": null,
|
87 |
+
"use_cache": null,
|
88 |
+
"limit": null,
|
89 |
+
"bootstrap_iters": 100000,
|
90 |
+
"gen_kwargs": null,
|
91 |
+
"random_seed": 0,
|
92 |
+
"numpy_seed": 1234,
|
93 |
+
"torch_seed": 1234,
|
94 |
+
"fewshot_seed": 1234
|
95 |
+
},
|
96 |
+
"git_hash": "b955b2950",
|
97 |
+
"date": 1739617995.3462336,
|
98 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
99 |
+
"transformers_version": "4.48.3",
|
100 |
+
"upper_git_hash": null,
|
101 |
+
"tokenizer_pad_token": [
|
102 |
+
"<unk>",
|
103 |
+
"0"
|
104 |
+
],
|
105 |
+
"tokenizer_eos_token": [
|
106 |
+
"</s>",
|
107 |
+
"2"
|
108 |
+
],
|
109 |
+
"tokenizer_bos_token": [
|
110 |
+
"<s>",
|
111 |
+
"1"
|
112 |
+
],
|
113 |
+
"eot_token_id": 2,
|
114 |
+
"max_length": 4096,
|
115 |
+
"task_hashes": {
|
116 |
+
"moe_ien_tf": "8701a646f6ea8b9bb96c028f817fbeabfb9031580f5054368b43d14d4a5a1270"
|
117 |
+
},
|
118 |
+
"model_source": "hf",
|
119 |
+
"model_name": "/tmp/7b-alpha-v1.27.2.25",
|
120 |
+
"model_name_sanitized": "__tmp__7b-alpha-v1.27.2.25",
|
121 |
+
"system_instruction": null,
|
122 |
+
"system_instruction_sha": null,
|
123 |
+
"fewshot_as_multiturn": false,
|
124 |
+
"chat_template": "{% if messages[0]['role'] == 'system' %}{% set loop_messages = messages[1:] %}{% set system_message = messages[0]['content'] %}{% else %}{% set loop_messages = messages %}{% set system_message = false %}{% endif %}{% for message in loop_messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 and system_message != false %}{% set content = '<<SYS>>\\n' + system_message + '\\n<</SYS>>\\n\\n' + message['content'] %}{% else %}{% set content = message['content'] %}{% endif %}{% if message['role'] == 'user' %}{{ bos_token + ' [INST] ' + content.strip() + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ ' ' + content.strip() + ' ' + eos_token }}{% endif %}{% endfor %}",
|
125 |
+
"chat_template_sha": "f1dff938141b507da4a409b6bb3431382088a97a963acd246a41f2f344ae831f",
|
126 |
+
"start_time": 1392684.818305694,
|
127 |
+
"end_time": 1392900.218863064,
|
128 |
+
"total_evaluation_time_seconds": "215.40055736992508"
|
129 |
+
}
|
evaluations/ar/Allam-7b-instruct-preview/openaimmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/Falcon3-7B-Instruct/acva_5_shot.json
ADDED
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"acva": {
|
4 |
+
"alias": "acva",
|
5 |
+
"acc,none": 0.6045924225028703,
|
6 |
+
"acc_stderr,none": 0.00523925695392083,
|
7 |
+
"acc_norm,none": 0.5897818599311137,
|
8 |
+
"acc_norm_stderr,none": 0.005270708411925859
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"acva": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"acva": {
|
16 |
+
"task": "acva",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
|
21 |
+
"dataset_kwargs": {
|
22 |
+
"trust_remote_code": true
|
23 |
+
},
|
24 |
+
"test_split": "test",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "choices",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"num_fewshot": 5,
|
33 |
+
"metric_list": [
|
34 |
+
{
|
35 |
+
"metric": "acc",
|
36 |
+
"aggregation": "mean",
|
37 |
+
"higher_is_better": true
|
38 |
+
},
|
39 |
+
{
|
40 |
+
"metric": "acc_norm",
|
41 |
+
"aggregation": "mean",
|
42 |
+
"higher_is_better": true
|
43 |
+
}
|
44 |
+
],
|
45 |
+
"output_type": "multiple_choice",
|
46 |
+
"repeats": 1,
|
47 |
+
"should_decontaminate": false,
|
48 |
+
"metadata": {
|
49 |
+
"version": 0.0
|
50 |
+
}
|
51 |
+
}
|
52 |
+
},
|
53 |
+
"versions": {
|
54 |
+
"acva": 0.0
|
55 |
+
},
|
56 |
+
"n-shot": {
|
57 |
+
"acva": 5
|
58 |
+
},
|
59 |
+
"higher_is_better": {
|
60 |
+
"acva": {
|
61 |
+
"acc": true,
|
62 |
+
"acc_norm": true
|
63 |
+
}
|
64 |
+
},
|
65 |
+
"n-samples": {
|
66 |
+
"acva": {
|
67 |
+
"original": 8710,
|
68 |
+
"effective": 8710
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"config": {
|
72 |
+
"model": "hf",
|
73 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
74 |
+
"model_num_parameters": 7455550464,
|
75 |
+
"model_dtype": "torch.bfloat16",
|
76 |
+
"model_revision": "main",
|
77 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
78 |
+
"batch_size": 1,
|
79 |
+
"batch_sizes": [],
|
80 |
+
"device": null,
|
81 |
+
"use_cache": null,
|
82 |
+
"limit": null,
|
83 |
+
"bootstrap_iters": 100000,
|
84 |
+
"gen_kwargs": null,
|
85 |
+
"random_seed": 0,
|
86 |
+
"numpy_seed": 1234,
|
87 |
+
"torch_seed": 1234,
|
88 |
+
"fewshot_seed": 1234
|
89 |
+
},
|
90 |
+
"git_hash": "5e10e017",
|
91 |
+
"date": 1736889821.9957027,
|
92 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
93 |
+
"transformers_version": "4.48.0",
|
94 |
+
"upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
|
95 |
+
"tokenizer_pad_token": [
|
96 |
+
"<|pad|>",
|
97 |
+
"2023"
|
98 |
+
],
|
99 |
+
"tokenizer_eos_token": [
|
100 |
+
"<|endoftext|>",
|
101 |
+
"11"
|
102 |
+
],
|
103 |
+
"tokenizer_bos_token": [
|
104 |
+
null,
|
105 |
+
"None"
|
106 |
+
],
|
107 |
+
"eot_token_id": 11,
|
108 |
+
"max_length": 32768,
|
109 |
+
"task_hashes": {
|
110 |
+
"acva": "f573ae5740e68711d257f2dc4a23db7c6b1c04895364f1af4b4eb64bfab793a4"
|
111 |
+
},
|
112 |
+
"model_source": "hf",
|
113 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
114 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
115 |
+
"system_instruction": null,
|
116 |
+
"system_instruction_sha": null,
|
117 |
+
"fewshot_as_multiturn": false,
|
118 |
+
"chat_template": null,
|
119 |
+
"chat_template_sha": null,
|
120 |
+
"start_time": 600072.370318618,
|
121 |
+
"end_time": 600217.222010416,
|
122 |
+
"total_evaluation_time_seconds": "144.85169179795776"
|
123 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/ar_ifeval_0_shot.json
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"ar_ifeval": {
|
4 |
+
"alias": "ar_ifeval",
|
5 |
+
"prompt_level_strict_acc,none": 0.08582089552238806,
|
6 |
+
"prompt_level_strict_acc_stderr,none": 0.012109752724743699,
|
7 |
+
"inst_level_strict_acc,none": 0.47918088737201364,
|
8 |
+
"inst_level_strict_acc_stderr,none": "N/A",
|
9 |
+
"prompt_level_loose_acc,none": 0.13805970149253732,
|
10 |
+
"prompt_level_loose_acc_stderr,none": 0.014914035308708435,
|
11 |
+
"inst_level_loose_acc,none": 0.5276450511945392,
|
12 |
+
"inst_level_loose_acc_stderr,none": "N/A"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"group_subtasks": {
|
16 |
+
"ar_ifeval": []
|
17 |
+
},
|
18 |
+
"configs": {
|
19 |
+
"ar_ifeval": {
|
20 |
+
"task": "ar_ifeval",
|
21 |
+
"dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
|
22 |
+
"dataset_name": "ar_ifeval",
|
23 |
+
"dataset_kwargs": {
|
24 |
+
"trust_remote_code": true
|
25 |
+
},
|
26 |
+
"test_split": "test",
|
27 |
+
"doc_to_text": "prompt",
|
28 |
+
"doc_to_target": 0,
|
29 |
+
"process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
|
30 |
+
"description": "",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 0,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "prompt_level_strict_acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "inst_level_strict_acc",
|
42 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "prompt_level_loose_acc",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"metric": "inst_level_loose_acc",
|
52 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
53 |
+
"higher_is_better": true
|
54 |
+
}
|
55 |
+
],
|
56 |
+
"output_type": "generate_until",
|
57 |
+
"generation_kwargs": {
|
58 |
+
"until": [],
|
59 |
+
"do_sample": false,
|
60 |
+
"temperature": 0.0,
|
61 |
+
"max_gen_toks": 1280
|
62 |
+
},
|
63 |
+
"repeats": 1,
|
64 |
+
"should_decontaminate": false,
|
65 |
+
"metadata": {
|
66 |
+
"version": 4.0
|
67 |
+
}
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"versions": {
|
71 |
+
"ar_ifeval": 4.0
|
72 |
+
},
|
73 |
+
"n-shot": {
|
74 |
+
"ar_ifeval": 0
|
75 |
+
},
|
76 |
+
"higher_is_better": {
|
77 |
+
"ar_ifeval": {
|
78 |
+
"prompt_level_strict_acc": true,
|
79 |
+
"inst_level_strict_acc": true,
|
80 |
+
"prompt_level_loose_acc": true,
|
81 |
+
"inst_level_loose_acc": true
|
82 |
+
}
|
83 |
+
},
|
84 |
+
"n-samples": {
|
85 |
+
"ar_ifeval": {
|
86 |
+
"original": 536,
|
87 |
+
"effective": 536
|
88 |
+
}
|
89 |
+
},
|
90 |
+
"config": {
|
91 |
+
"model": "hf",
|
92 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
93 |
+
"model_num_parameters": 7455550464,
|
94 |
+
"model_dtype": "torch.bfloat16",
|
95 |
+
"model_revision": "main",
|
96 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
97 |
+
"batch_size": 1,
|
98 |
+
"batch_sizes": [],
|
99 |
+
"device": null,
|
100 |
+
"use_cache": null,
|
101 |
+
"limit": null,
|
102 |
+
"bootstrap_iters": 100000,
|
103 |
+
"gen_kwargs": null,
|
104 |
+
"random_seed": 0,
|
105 |
+
"numpy_seed": 1234,
|
106 |
+
"torch_seed": 1234,
|
107 |
+
"fewshot_seed": 1234
|
108 |
+
},
|
109 |
+
"git_hash": "b955b2950",
|
110 |
+
"date": 1739621196.897086,
|
111 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
112 |
+
"transformers_version": "4.48.3",
|
113 |
+
"upper_git_hash": null,
|
114 |
+
"tokenizer_pad_token": [
|
115 |
+
"<|pad|>",
|
116 |
+
"2023"
|
117 |
+
],
|
118 |
+
"tokenizer_eos_token": [
|
119 |
+
"<|endoftext|>",
|
120 |
+
"11"
|
121 |
+
],
|
122 |
+
"tokenizer_bos_token": [
|
123 |
+
null,
|
124 |
+
"None"
|
125 |
+
],
|
126 |
+
"eot_token_id": 11,
|
127 |
+
"max_length": 32768,
|
128 |
+
"task_hashes": {
|
129 |
+
"ar_ifeval": "ca837eed1e9f468712643d1fab81b7b48c88a8799239851476bdc889990e6b41"
|
130 |
+
},
|
131 |
+
"model_source": "hf",
|
132 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
133 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
134 |
+
"system_instruction": null,
|
135 |
+
"system_instruction_sha": null,
|
136 |
+
"fewshot_as_multiturn": false,
|
137 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
138 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
139 |
+
"start_time": 1395880.012817552,
|
140 |
+
"end_time": 1401371.318791154,
|
141 |
+
"total_evaluation_time_seconds": "5491.305973601993"
|
142 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/araMath_v3_5_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araMath_v3": {
|
4 |
+
"alias": "araMath_v3",
|
5 |
+
"acc,none": 0.5652892561983471,
|
6 |
+
"acc_stderr,none": 0.020170519477736983,
|
7 |
+
"acc_norm,none": 0.5652892561983471,
|
8 |
+
"acc_norm_stderr,none": 0.020170519477736983
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araMath_v3": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araMath_v3": {
|
16 |
+
"task": "araMath_v3",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
|
21 |
+
"dataset_name": "araMath_v3",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "{{choices}}",
|
31 |
+
"description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"araMath_v3": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"araMath_v3": 5
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"araMath_v3": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"araMath_v3": {
|
70 |
+
"original": 605,
|
71 |
+
"effective": 605
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
77 |
+
"model_num_parameters": 7455550464,
|
78 |
+
"model_dtype": "torch.bfloat16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739621084.921236,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|pad|>",
|
100 |
+
"2023"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|endoftext|>",
|
104 |
+
"11"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
null,
|
108 |
+
"None"
|
109 |
+
],
|
110 |
+
"eot_token_id": 11,
|
111 |
+
"max_length": 32768,
|
112 |
+
"task_hashes": {
|
113 |
+
"araMath_v3": "b7e29b20c532c7420cc659c6586d56642070560abff0925ed01ad8f200d8e72b"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
117 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
122 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
123 |
+
"start_time": 1395768.116667791,
|
124 |
+
"end_time": 1395816.745740765,
|
125 |
+
"total_evaluation_time_seconds": "48.629072973970324"
|
126 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/araPro_0_shot.json
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araPro": {
|
4 |
+
"alias": "araPro",
|
5 |
+
"acc,none": 0.41471705658868224,
|
6 |
+
"acc_stderr,none": 0.006967450316480296,
|
7 |
+
"acc_norm,none": 0.41471705658868224,
|
8 |
+
"acc_norm_stderr,none": 0.006967450316480296
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araPro": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araPro": {
|
16 |
+
"task": "araPro",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araPro/araPro.py",
|
21 |
+
"dataset_name": "araPro",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "{{choices}}",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": true,
|
54 |
+
"doc_to_decontamination_query": "Question",
|
55 |
+
"metadata": {
|
56 |
+
"version": 2.0
|
57 |
+
}
|
58 |
+
}
|
59 |
+
},
|
60 |
+
"versions": {
|
61 |
+
"araPro": 2.0
|
62 |
+
},
|
63 |
+
"n-shot": {
|
64 |
+
"araPro": 0
|
65 |
+
},
|
66 |
+
"higher_is_better": {
|
67 |
+
"araPro": {
|
68 |
+
"acc": true,
|
69 |
+
"acc_norm": true
|
70 |
+
}
|
71 |
+
},
|
72 |
+
"n-samples": {
|
73 |
+
"araPro": {
|
74 |
+
"original": 5001,
|
75 |
+
"effective": 5001
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"config": {
|
79 |
+
"model": "hf",
|
80 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
81 |
+
"model_num_parameters": 7455550464,
|
82 |
+
"model_dtype": "torch.bfloat16",
|
83 |
+
"model_revision": "main",
|
84 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
85 |
+
"batch_size": 1,
|
86 |
+
"batch_sizes": [],
|
87 |
+
"device": null,
|
88 |
+
"use_cache": null,
|
89 |
+
"limit": null,
|
90 |
+
"bootstrap_iters": 100000,
|
91 |
+
"gen_kwargs": null,
|
92 |
+
"random_seed": 0,
|
93 |
+
"numpy_seed": 1234,
|
94 |
+
"torch_seed": 1234,
|
95 |
+
"fewshot_seed": 1234
|
96 |
+
},
|
97 |
+
"git_hash": "b955b2950",
|
98 |
+
"date": 1739617143.3614087,
|
99 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
100 |
+
"transformers_version": "4.48.3",
|
101 |
+
"upper_git_hash": null,
|
102 |
+
"tokenizer_pad_token": [
|
103 |
+
"<|pad|>",
|
104 |
+
"2023"
|
105 |
+
],
|
106 |
+
"tokenizer_eos_token": [
|
107 |
+
"<|endoftext|>",
|
108 |
+
"11"
|
109 |
+
],
|
110 |
+
"tokenizer_bos_token": [
|
111 |
+
null,
|
112 |
+
"None"
|
113 |
+
],
|
114 |
+
"eot_token_id": 11,
|
115 |
+
"max_length": 32768,
|
116 |
+
"task_hashes": {
|
117 |
+
"araPro": "063166ad2e52146b6a051c978bf54b1397281e222da633e81fa50357d2409ee9"
|
118 |
+
},
|
119 |
+
"model_source": "hf",
|
120 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
121 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
122 |
+
"system_instruction": null,
|
123 |
+
"system_instruction_sha": null,
|
124 |
+
"fewshot_as_multiturn": false,
|
125 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
126 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
127 |
+
"start_time": 1391826.416201954,
|
128 |
+
"end_time": 1394850.089034202,
|
129 |
+
"total_evaluation_time_seconds": "3023.672832248034"
|
130 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/arabicmmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/Falcon3-7B-Instruct/etec_v2_0_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"etec_v2": {
|
4 |
+
"alias": "etec_v2",
|
5 |
+
"acc,none": 0.3751987281399046,
|
6 |
+
"acc_stderr,none": 0.01114886834610489,
|
7 |
+
"acc_norm,none": 0.3751987281399046,
|
8 |
+
"acc_norm_stderr,none": 0.01114886834610489
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"etec_v2": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"etec_v2": {
|
16 |
+
"task": "etec_v2",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/etec_v2/etec.py",
|
21 |
+
"dataset_name": "etec_v2",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n print(doc[\"label\"])\n keys_ar = [\"\u0623\", \"\u0628\", \"\u062c\", \"\u062f\"]\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": int(doc[\"label\"])-1,\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d\n ",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 0,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"etec_v2": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"etec_v2": 0
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"etec_v2": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"etec_v2": {
|
70 |
+
"original": 1887,
|
71 |
+
"effective": 1887
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
77 |
+
"model_num_parameters": 7455550464,
|
78 |
+
"model_dtype": "torch.bfloat16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "b955b2950",
|
94 |
+
"date": 1739620236.678696,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.3",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|pad|>",
|
100 |
+
"2023"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|endoftext|>",
|
104 |
+
"11"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
null,
|
108 |
+
"None"
|
109 |
+
],
|
110 |
+
"eot_token_id": 11,
|
111 |
+
"max_length": 32768,
|
112 |
+
"task_hashes": {
|
113 |
+
"etec_v2": "3a8dc6484af6c9538f122c1bbe5c6866dbe14df841fdf04ab7ff2b6437e8aeae"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
117 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
122 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
123 |
+
"start_time": 1394919.684315533,
|
124 |
+
"end_time": 1394995.42617788,
|
125 |
+
"total_evaluation_time_seconds": "75.7418623471167"
|
126 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/exams_ar_5_shot.json
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"exams_ar": {
|
4 |
+
"alias": "exams_ar",
|
5 |
+
"acc,none": 0.31843575418994413,
|
6 |
+
"acc_stderr,none": 0.020122499132803468,
|
7 |
+
"acc_norm,none": 0.31843575418994413,
|
8 |
+
"acc_norm_stderr,none": 0.020122499132803468
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"exams_ar": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"exams_ar": {
|
16 |
+
"task": "exams_ar",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/exams_ar",
|
21 |
+
"dataset_name": "exams_ar",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"test_split": "test",
|
26 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n\n def _process_docs(doc):\n def format_example(doc, keys):\n \"\"\"\n <prompt>\n \u0633\u0624\u0627\u0644:\n A. <choice1>\n B. <choice2>\n C. <choice3>\n D. <choice4>\n \u0627\u062c\u0627\u0628\u0629:\n \"\"\"\n \n question = doc[\"question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {choice}\\n\" for key, choice in zip(keys, doc[\"choices\"])]\n )\n prompt = f\"\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n def _format_subject(subject):\n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n\n keys = [\"A\", \"B\", \"C\", \"D\"]\n \n subject = doc['id'].split(\"-\")[0]\n description = f\"\ufed2\ufef4\ufee3\ufe8d \ufef2\ufee0\ufef3 \ufe84\ufeb4\ufe8c\ufedf\ufe93 \ufe8d\ufefc\ufea8\ufe98\ufef3\ufe8d\ufead \ufee2\ufee7 \ufee2\ufe98\ufecb\ufea9\ufea9 (\ufee2\ufecb \ufe8d\ufefa\ufe9f\ufe8e\ufe91\ufe8e\ufe97) \ufea1\ufeee\ufedf {_format_subject(subject)} \\n\" #\ufee2\ufee7 \ufed2\ufec0\ufee0\ufedb \ufe8e\ufea8\ufe97\ufead \ufe88\ufe9f\ufe8e\ufe91\ufe93 \ufeed\ufe8e\ufea3\ufea9\ufe93 \ufee2\ufee7 \ufe90\ufef4\ufee7 'A\u060c B\u060c C\u060c D' \ufea9\ufeee\ufee7 \ufeb5\ufeae\ufea3\\n\"\n\n out_doc = {\n \"idx\": doc[\"idx\"],\n \"id\": doc[\"id\"],\n 'dsecription': description,\n \"query\": format_example(doc, keys), # \"Question: \" + doc[\"question\"]['stem'] + \"\\nAnswer:\",\n \"choices\": keys,\n \"gold\": [\"A\", \"B\", \"C\", \"D\"].index(doc[\"label\"]),\n }\n return out_doc\n\n return dataset.map(_process_docs)\n",
|
27 |
+
"doc_to_text": "query",
|
28 |
+
"doc_to_target": "gold",
|
29 |
+
"doc_to_choice": "choices",
|
30 |
+
"description": "description",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 5,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "acc_norm",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
}
|
45 |
+
],
|
46 |
+
"output_type": "multiple_choice",
|
47 |
+
"repeats": 1,
|
48 |
+
"should_decontaminate": true,
|
49 |
+
"doc_to_decontamination_query": "query",
|
50 |
+
"metadata": {
|
51 |
+
"version": 0.0
|
52 |
+
}
|
53 |
+
}
|
54 |
+
},
|
55 |
+
"versions": {
|
56 |
+
"exams_ar": 0.0
|
57 |
+
},
|
58 |
+
"n-shot": {
|
59 |
+
"exams_ar": 5
|
60 |
+
},
|
61 |
+
"higher_is_better": {
|
62 |
+
"exams_ar": {
|
63 |
+
"acc": true,
|
64 |
+
"acc_norm": true
|
65 |
+
}
|
66 |
+
},
|
67 |
+
"n-samples": {
|
68 |
+
"exams_ar": {
|
69 |
+
"original": 537,
|
70 |
+
"effective": 537
|
71 |
+
}
|
72 |
+
},
|
73 |
+
"config": {
|
74 |
+
"model": "hf",
|
75 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
76 |
+
"model_num_parameters": 7455550464,
|
77 |
+
"model_dtype": "torch.bfloat16",
|
78 |
+
"model_revision": "main",
|
79 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
80 |
+
"batch_size": 1,
|
81 |
+
"batch_sizes": [],
|
82 |
+
"device": null,
|
83 |
+
"use_cache": null,
|
84 |
+
"limit": null,
|
85 |
+
"bootstrap_iters": 100000,
|
86 |
+
"gen_kwargs": null,
|
87 |
+
"random_seed": 0,
|
88 |
+
"numpy_seed": 1234,
|
89 |
+
"torch_seed": 1234,
|
90 |
+
"fewshot_seed": 1234
|
91 |
+
},
|
92 |
+
"git_hash": "5e10e017",
|
93 |
+
"date": 1736889028.6416683,
|
94 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
95 |
+
"transformers_version": "4.48.0",
|
96 |
+
"upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
|
97 |
+
"tokenizer_pad_token": [
|
98 |
+
"<|pad|>",
|
99 |
+
"2023"
|
100 |
+
],
|
101 |
+
"tokenizer_eos_token": [
|
102 |
+
"<|endoftext|>",
|
103 |
+
"11"
|
104 |
+
],
|
105 |
+
"tokenizer_bos_token": [
|
106 |
+
null,
|
107 |
+
"None"
|
108 |
+
],
|
109 |
+
"eot_token_id": 11,
|
110 |
+
"max_length": 32768,
|
111 |
+
"task_hashes": {
|
112 |
+
"exams_ar": "f52ab3f14b240558420910fdb453ccb45c945cec187c0e60ea51cf6eff08973a"
|
113 |
+
},
|
114 |
+
"model_source": "hf",
|
115 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
116 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
117 |
+
"system_instruction": null,
|
118 |
+
"system_instruction_sha": null,
|
119 |
+
"fewshot_as_multiturn": false,
|
120 |
+
"chat_template": null,
|
121 |
+
"chat_template_sha": null,
|
122 |
+
"start_time": 599279.04705073,
|
123 |
+
"end_time": 599692.233103212,
|
124 |
+
"total_evaluation_time_seconds": "413.1860524819931"
|
125 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/gat_0_shot.json
ADDED
@@ -0,0 +1,553 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"gat": {
|
4 |
+
"acc,none": 0.27994481374639407,
|
5 |
+
"acc_stderr,none": 0.003542796359675536,
|
6 |
+
"alias": "gat"
|
7 |
+
},
|
8 |
+
"gat_algebra": {
|
9 |
+
"alias": " - gat_algebra",
|
10 |
+
"acc,none": 0.2571428571428571,
|
11 |
+
"acc_stderr,none": 0.008420562208967575
|
12 |
+
},
|
13 |
+
"gat_analogy": {
|
14 |
+
"alias": " - gat_analogy",
|
15 |
+
"acc,none": 0.24553734061930782,
|
16 |
+
"acc_stderr,none": 0.008216476082874105
|
17 |
+
},
|
18 |
+
"gat_arithmetic": {
|
19 |
+
"alias": " - gat_arithmetic",
|
20 |
+
"acc,none": 0.26573426573426573,
|
21 |
+
"acc_stderr,none": 0.008475894211016492
|
22 |
+
},
|
23 |
+
"gat_association": {
|
24 |
+
"alias": " - gat_association",
|
25 |
+
"acc,none": 0.24019138755980862,
|
26 |
+
"acc_stderr,none": 0.013221495215360054
|
27 |
+
},
|
28 |
+
"gat_comparisons": {
|
29 |
+
"alias": " - gat_comparisons",
|
30 |
+
"acc,none": 0.319672131147541,
|
31 |
+
"acc_stderr,none": 0.013357022766710734
|
32 |
+
},
|
33 |
+
"gat_completion": {
|
34 |
+
"alias": " - gat_completion",
|
35 |
+
"acc,none": 0.27520661157024795,
|
36 |
+
"acc_stderr,none": 0.012844683062506254
|
37 |
+
},
|
38 |
+
"gat_contextual": {
|
39 |
+
"alias": " - gat_contextual",
|
40 |
+
"acc,none": 0.26993865030674846,
|
41 |
+
"acc_stderr,none": 0.01229815625441917
|
42 |
+
},
|
43 |
+
"gat_geometry": {
|
44 |
+
"alias": " - gat_geometry",
|
45 |
+
"acc,none": 0.2876712328767123,
|
46 |
+
"acc_stderr,none": 0.023726723391354485
|
47 |
+
},
|
48 |
+
"gat_reading": {
|
49 |
+
"alias": " - gat_reading",
|
50 |
+
"acc,none": 0.3568998109640832,
|
51 |
+
"acc_stderr,none": 0.009317121354774414
|
52 |
+
}
|
53 |
+
},
|
54 |
+
"groups": {
|
55 |
+
"gat": {
|
56 |
+
"acc,none": 0.27994481374639407,
|
57 |
+
"acc_stderr,none": 0.003542796359675536,
|
58 |
+
"alias": "gat"
|
59 |
+
}
|
60 |
+
},
|
61 |
+
"group_subtasks": {
|
62 |
+
"gat": [
|
63 |
+
"gat_analogy",
|
64 |
+
"gat_association",
|
65 |
+
"gat_completion",
|
66 |
+
"gat_reading",
|
67 |
+
"gat_algebra",
|
68 |
+
"gat_arithmetic",
|
69 |
+
"gat_comparisons",
|
70 |
+
"gat_contextual",
|
71 |
+
"gat_geometry"
|
72 |
+
]
|
73 |
+
},
|
74 |
+
"configs": {
|
75 |
+
"gat_algebra": {
|
76 |
+
"task": "gat_algebra",
|
77 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
78 |
+
"dataset_name": "algebra",
|
79 |
+
"dataset_kwargs": {
|
80 |
+
"trust_remote_code": true
|
81 |
+
},
|
82 |
+
"test_split": "test",
|
83 |
+
"fewshot_split": "validation",
|
84 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
85 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
86 |
+
"doc_to_target": "{{label}}",
|
87 |
+
"doc_to_choice": [
|
88 |
+
"\u0623",
|
89 |
+
"\u0628",
|
90 |
+
"\u062c",
|
91 |
+
"\u062f"
|
92 |
+
],
|
93 |
+
"description": "",
|
94 |
+
"target_delimiter": " ",
|
95 |
+
"fewshot_delimiter": "\n\n",
|
96 |
+
"num_fewshot": 0,
|
97 |
+
"metric_list": [
|
98 |
+
{
|
99 |
+
"metric": "acc",
|
100 |
+
"aggregation": "mean",
|
101 |
+
"higher_is_better": true
|
102 |
+
}
|
103 |
+
],
|
104 |
+
"output_type": "multiple_choice",
|
105 |
+
"repeats": 1,
|
106 |
+
"should_decontaminate": false,
|
107 |
+
"metadata": {
|
108 |
+
"version": 0.0
|
109 |
+
}
|
110 |
+
},
|
111 |
+
"gat_analogy": {
|
112 |
+
"task": "gat_analogy",
|
113 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
114 |
+
"dataset_name": "analogy",
|
115 |
+
"dataset_kwargs": {
|
116 |
+
"trust_remote_code": true
|
117 |
+
},
|
118 |
+
"test_split": "test",
|
119 |
+
"fewshot_split": "validation",
|
120 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
121 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
122 |
+
"doc_to_target": "{{label}}",
|
123 |
+
"doc_to_choice": [
|
124 |
+
"\u0623",
|
125 |
+
"\u0628",
|
126 |
+
"\u062c",
|
127 |
+
"\u062f"
|
128 |
+
],
|
129 |
+
"description": "",
|
130 |
+
"target_delimiter": " ",
|
131 |
+
"fewshot_delimiter": "\n\n",
|
132 |
+
"num_fewshot": 0,
|
133 |
+
"metric_list": [
|
134 |
+
{
|
135 |
+
"metric": "acc",
|
136 |
+
"aggregation": "mean",
|
137 |
+
"higher_is_better": true
|
138 |
+
}
|
139 |
+
],
|
140 |
+
"output_type": "multiple_choice",
|
141 |
+
"repeats": 1,
|
142 |
+
"should_decontaminate": false,
|
143 |
+
"metadata": {
|
144 |
+
"version": 0.0
|
145 |
+
}
|
146 |
+
},
|
147 |
+
"gat_arithmetic": {
|
148 |
+
"task": "gat_arithmetic",
|
149 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
150 |
+
"dataset_name": "arithmetic",
|
151 |
+
"dataset_kwargs": {
|
152 |
+
"trust_remote_code": true
|
153 |
+
},
|
154 |
+
"test_split": "test",
|
155 |
+
"fewshot_split": "validation",
|
156 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
157 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
158 |
+
"doc_to_target": "{{label}}",
|
159 |
+
"doc_to_choice": [
|
160 |
+
"\u0623",
|
161 |
+
"\u0628",
|
162 |
+
"\u062c",
|
163 |
+
"\u062f"
|
164 |
+
],
|
165 |
+
"description": "",
|
166 |
+
"target_delimiter": " ",
|
167 |
+
"fewshot_delimiter": "\n\n",
|
168 |
+
"num_fewshot": 0,
|
169 |
+
"metric_list": [
|
170 |
+
{
|
171 |
+
"metric": "acc",
|
172 |
+
"aggregation": "mean",
|
173 |
+
"higher_is_better": true
|
174 |
+
}
|
175 |
+
],
|
176 |
+
"output_type": "multiple_choice",
|
177 |
+
"repeats": 1,
|
178 |
+
"should_decontaminate": false,
|
179 |
+
"metadata": {
|
180 |
+
"version": 0.0
|
181 |
+
}
|
182 |
+
},
|
183 |
+
"gat_association": {
|
184 |
+
"task": "gat_association",
|
185 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
186 |
+
"dataset_name": "association",
|
187 |
+
"dataset_kwargs": {
|
188 |
+
"trust_remote_code": true
|
189 |
+
},
|
190 |
+
"test_split": "test",
|
191 |
+
"fewshot_split": "validation",
|
192 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
193 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
194 |
+
"doc_to_target": "{{label}}",
|
195 |
+
"doc_to_choice": [
|
196 |
+
"\u0623",
|
197 |
+
"\u0628",
|
198 |
+
"\u062c",
|
199 |
+
"\u062f"
|
200 |
+
],
|
201 |
+
"description": "",
|
202 |
+
"target_delimiter": " ",
|
203 |
+
"fewshot_delimiter": "\n\n",
|
204 |
+
"num_fewshot": 0,
|
205 |
+
"metric_list": [
|
206 |
+
{
|
207 |
+
"metric": "acc",
|
208 |
+
"aggregation": "mean",
|
209 |
+
"higher_is_better": true
|
210 |
+
}
|
211 |
+
],
|
212 |
+
"output_type": "multiple_choice",
|
213 |
+
"repeats": 1,
|
214 |
+
"should_decontaminate": false,
|
215 |
+
"metadata": {
|
216 |
+
"version": 0.0
|
217 |
+
}
|
218 |
+
},
|
219 |
+
"gat_comparisons": {
|
220 |
+
"task": "gat_comparisons",
|
221 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
222 |
+
"dataset_name": "comparisons",
|
223 |
+
"dataset_kwargs": {
|
224 |
+
"trust_remote_code": true
|
225 |
+
},
|
226 |
+
"test_split": "test",
|
227 |
+
"fewshot_split": "validation",
|
228 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
229 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
230 |
+
"doc_to_target": "{{label}}",
|
231 |
+
"doc_to_choice": [
|
232 |
+
"\u0623",
|
233 |
+
"\u0628",
|
234 |
+
"\u062c",
|
235 |
+
"\u062f"
|
236 |
+
],
|
237 |
+
"description": "",
|
238 |
+
"target_delimiter": " ",
|
239 |
+
"fewshot_delimiter": "\n\n",
|
240 |
+
"num_fewshot": 0,
|
241 |
+
"metric_list": [
|
242 |
+
{
|
243 |
+
"metric": "acc",
|
244 |
+
"aggregation": "mean",
|
245 |
+
"higher_is_better": true
|
246 |
+
}
|
247 |
+
],
|
248 |
+
"output_type": "multiple_choice",
|
249 |
+
"repeats": 1,
|
250 |
+
"should_decontaminate": false,
|
251 |
+
"metadata": {
|
252 |
+
"version": 0.0
|
253 |
+
}
|
254 |
+
},
|
255 |
+
"gat_completion": {
|
256 |
+
"task": "gat_completion",
|
257 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
258 |
+
"dataset_name": "completion",
|
259 |
+
"dataset_kwargs": {
|
260 |
+
"trust_remote_code": true
|
261 |
+
},
|
262 |
+
"test_split": "test",
|
263 |
+
"fewshot_split": "validation",
|
264 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
265 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
266 |
+
"doc_to_target": "{{label}}",
|
267 |
+
"doc_to_choice": [
|
268 |
+
"\u0623",
|
269 |
+
"\u0628",
|
270 |
+
"\u062c",
|
271 |
+
"\u062f"
|
272 |
+
],
|
273 |
+
"description": "",
|
274 |
+
"target_delimiter": " ",
|
275 |
+
"fewshot_delimiter": "\n\n",
|
276 |
+
"num_fewshot": 0,
|
277 |
+
"metric_list": [
|
278 |
+
{
|
279 |
+
"metric": "acc",
|
280 |
+
"aggregation": "mean",
|
281 |
+
"higher_is_better": true
|
282 |
+
}
|
283 |
+
],
|
284 |
+
"output_type": "multiple_choice",
|
285 |
+
"repeats": 1,
|
286 |
+
"should_decontaminate": false,
|
287 |
+
"metadata": {
|
288 |
+
"version": 0.0
|
289 |
+
}
|
290 |
+
},
|
291 |
+
"gat_contextual": {
|
292 |
+
"task": "gat_contextual",
|
293 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
294 |
+
"dataset_name": "contextual",
|
295 |
+
"dataset_kwargs": {
|
296 |
+
"trust_remote_code": true
|
297 |
+
},
|
298 |
+
"test_split": "test",
|
299 |
+
"fewshot_split": "validation",
|
300 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
301 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
302 |
+
"doc_to_target": "{{label}}",
|
303 |
+
"doc_to_choice": [
|
304 |
+
"\u0623",
|
305 |
+
"\u0628",
|
306 |
+
"\u062c",
|
307 |
+
"\u062f"
|
308 |
+
],
|
309 |
+
"description": "\u0627\u0648\u062c\u062f \u0627\u0644\u062e\u0637\u0623 \u0627\u0644\u0633\u064a\u0627\u0642\u064a \u0641\u064a \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0627\u0644\u062a\u0627\u0644\u064a\u0629 \u0645\u0646 \u0628\u064a\u0646 \u0627\u0644\u062e\u064a\u0627\u0631\u0627\u062a:",
|
310 |
+
"target_delimiter": " ",
|
311 |
+
"fewshot_delimiter": "\n\n",
|
312 |
+
"num_fewshot": 0,
|
313 |
+
"metric_list": [
|
314 |
+
{
|
315 |
+
"metric": "acc",
|
316 |
+
"aggregation": "mean",
|
317 |
+
"higher_is_better": true
|
318 |
+
}
|
319 |
+
],
|
320 |
+
"output_type": "multiple_choice",
|
321 |
+
"repeats": 1,
|
322 |
+
"should_decontaminate": false,
|
323 |
+
"metadata": {
|
324 |
+
"version": 0.0
|
325 |
+
}
|
326 |
+
},
|
327 |
+
"gat_geometry": {
|
328 |
+
"task": "gat_geometry",
|
329 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
330 |
+
"dataset_name": "geometry",
|
331 |
+
"dataset_kwargs": {
|
332 |
+
"trust_remote_code": true
|
333 |
+
},
|
334 |
+
"test_split": "test",
|
335 |
+
"fewshot_split": "validation",
|
336 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
337 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
338 |
+
"doc_to_target": "{{label}}",
|
339 |
+
"doc_to_choice": [
|
340 |
+
"\u0623",
|
341 |
+
"\u0628",
|
342 |
+
"\u062c",
|
343 |
+
"\u062f"
|
344 |
+
],
|
345 |
+
"description": "",
|
346 |
+
"target_delimiter": " ",
|
347 |
+
"fewshot_delimiter": "\n\n",
|
348 |
+
"num_fewshot": 0,
|
349 |
+
"metric_list": [
|
350 |
+
{
|
351 |
+
"metric": "acc",
|
352 |
+
"aggregation": "mean",
|
353 |
+
"higher_is_better": true
|
354 |
+
}
|
355 |
+
],
|
356 |
+
"output_type": "multiple_choice",
|
357 |
+
"repeats": 1,
|
358 |
+
"should_decontaminate": false,
|
359 |
+
"metadata": {
|
360 |
+
"version": 0.0
|
361 |
+
}
|
362 |
+
},
|
363 |
+
"gat_reading": {
|
364 |
+
"task": "gat_reading",
|
365 |
+
"dataset_path": "lm_eval/tasks/gat/gat_data/gat.py",
|
366 |
+
"dataset_name": "reading",
|
367 |
+
"dataset_kwargs": {
|
368 |
+
"trust_remote_code": true
|
369 |
+
},
|
370 |
+
"test_split": "test",
|
371 |
+
"fewshot_split": "validation",
|
372 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n # def _process_doc(doc):\n \n # subject = doc['id'].split(\"-\")[0]\n # subject_ar = subtasks_ar[subtasks.index(subject)]\n # out_doc = {**doc, 'subject_ar': subject_ar}\n # print(subject_ar)\n # print(out_doc)\n # return out_doc\n\n return dataset\n",
|
373 |
+
"doc_to_text": "{{question}}\n\u0623. {{choices[0]}}\n\u0628. {{choices[1]}}\n\u062c. {{choices[2]}}\n\u062f. {{choices[3]}}\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:",
|
374 |
+
"doc_to_target": "{{label}}",
|
375 |
+
"doc_to_choice": [
|
376 |
+
"\u0623",
|
377 |
+
"\u0628",
|
378 |
+
"\u062c",
|
379 |
+
"\u062f"
|
380 |
+
],
|
381 |
+
"description": "",
|
382 |
+
"target_delimiter": " ",
|
383 |
+
"fewshot_delimiter": "\n\n",
|
384 |
+
"num_fewshot": 0,
|
385 |
+
"metric_list": [
|
386 |
+
{
|
387 |
+
"metric": "acc",
|
388 |
+
"aggregation": "mean",
|
389 |
+
"higher_is_better": true
|
390 |
+
}
|
391 |
+
],
|
392 |
+
"output_type": "multiple_choice",
|
393 |
+
"repeats": 1,
|
394 |
+
"should_decontaminate": false,
|
395 |
+
"metadata": {
|
396 |
+
"version": 0.0
|
397 |
+
}
|
398 |
+
}
|
399 |
+
},
|
400 |
+
"versions": {
|
401 |
+
"gat": 0,
|
402 |
+
"gat_algebra": 0.0,
|
403 |
+
"gat_analogy": 0.0,
|
404 |
+
"gat_arithmetic": 0.0,
|
405 |
+
"gat_association": 0.0,
|
406 |
+
"gat_comparisons": 0.0,
|
407 |
+
"gat_completion": 0.0,
|
408 |
+
"gat_contextual": 0.0,
|
409 |
+
"gat_geometry": 0.0,
|
410 |
+
"gat_reading": 0.0
|
411 |
+
},
|
412 |
+
"n-shot": {
|
413 |
+
"gat_algebra": 0,
|
414 |
+
"gat_analogy": 0,
|
415 |
+
"gat_arithmetic": 0,
|
416 |
+
"gat_association": 0,
|
417 |
+
"gat_comparisons": 0,
|
418 |
+
"gat_completion": 0,
|
419 |
+
"gat_contextual": 0,
|
420 |
+
"gat_geometry": 0,
|
421 |
+
"gat_reading": 0
|
422 |
+
},
|
423 |
+
"higher_is_better": {
|
424 |
+
"gat": {
|
425 |
+
"acc": true
|
426 |
+
},
|
427 |
+
"gat_algebra": {
|
428 |
+
"acc": true
|
429 |
+
},
|
430 |
+
"gat_analogy": {
|
431 |
+
"acc": true
|
432 |
+
},
|
433 |
+
"gat_arithmetic": {
|
434 |
+
"acc": true
|
435 |
+
},
|
436 |
+
"gat_association": {
|
437 |
+
"acc": true
|
438 |
+
},
|
439 |
+
"gat_comparisons": {
|
440 |
+
"acc": true
|
441 |
+
},
|
442 |
+
"gat_completion": {
|
443 |
+
"acc": true
|
444 |
+
},
|
445 |
+
"gat_contextual": {
|
446 |
+
"acc": true
|
447 |
+
},
|
448 |
+
"gat_geometry": {
|
449 |
+
"acc": true
|
450 |
+
},
|
451 |
+
"gat_reading": {
|
452 |
+
"acc": true
|
453 |
+
}
|
454 |
+
},
|
455 |
+
"n-samples": {
|
456 |
+
"gat_analogy": {
|
457 |
+
"original": 2745,
|
458 |
+
"effective": 2745
|
459 |
+
},
|
460 |
+
"gat_association": {
|
461 |
+
"original": 1045,
|
462 |
+
"effective": 1045
|
463 |
+
},
|
464 |
+
"gat_completion": {
|
465 |
+
"original": 1210,
|
466 |
+
"effective": 1210
|
467 |
+
},
|
468 |
+
"gat_reading": {
|
469 |
+
"original": 2645,
|
470 |
+
"effective": 2645
|
471 |
+
},
|
472 |
+
"gat_algebra": {
|
473 |
+
"original": 2695,
|
474 |
+
"effective": 2695
|
475 |
+
},
|
476 |
+
"gat_arithmetic": {
|
477 |
+
"original": 2717,
|
478 |
+
"effective": 2717
|
479 |
+
},
|
480 |
+
"gat_comparisons": {
|
481 |
+
"original": 1220,
|
482 |
+
"effective": 1220
|
483 |
+
},
|
484 |
+
"gat_contextual": {
|
485 |
+
"original": 1304,
|
486 |
+
"effective": 1304
|
487 |
+
},
|
488 |
+
"gat_geometry": {
|
489 |
+
"original": 365,
|
490 |
+
"effective": 365
|
491 |
+
}
|
492 |
+
},
|
493 |
+
"config": {
|
494 |
+
"model": "hf",
|
495 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
496 |
+
"model_num_parameters": 7455550464,
|
497 |
+
"model_dtype": "torch.bfloat16",
|
498 |
+
"model_revision": "main",
|
499 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
500 |
+
"batch_size": 1,
|
501 |
+
"batch_sizes": [],
|
502 |
+
"device": null,
|
503 |
+
"use_cache": null,
|
504 |
+
"limit": null,
|
505 |
+
"bootstrap_iters": 100000,
|
506 |
+
"gen_kwargs": null,
|
507 |
+
"random_seed": 0,
|
508 |
+
"numpy_seed": 1234,
|
509 |
+
"torch_seed": 1234,
|
510 |
+
"fewshot_seed": 1234
|
511 |
+
},
|
512 |
+
"git_hash": "5e10e017",
|
513 |
+
"date": 1736891004.0192773,
|
514 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
515 |
+
"transformers_version": "4.48.0",
|
516 |
+
"upper_git_hash": "f64fe2f2a86055aaecced603b56097fd79201711",
|
517 |
+
"tokenizer_pad_token": [
|
518 |
+
"<|pad|>",
|
519 |
+
"2023"
|
520 |
+
],
|
521 |
+
"tokenizer_eos_token": [
|
522 |
+
"<|endoftext|>",
|
523 |
+
"11"
|
524 |
+
],
|
525 |
+
"tokenizer_bos_token": [
|
526 |
+
null,
|
527 |
+
"None"
|
528 |
+
],
|
529 |
+
"eot_token_id": 11,
|
530 |
+
"max_length": 32768,
|
531 |
+
"task_hashes": {
|
532 |
+
"gat_analogy": "04ac010c48ed039457058b512b7ac0586c7c76a628da7caaf9aeb8f3e99ae5e3",
|
533 |
+
"gat_association": "2cbd868d220125bfcc54ae738592ad902191e4b7f804ce1772ae29e2d3bb3bf6",
|
534 |
+
"gat_completion": "74cf159ef4a3455a6a0e984fed8e9e9a12f0dc21fde95c2058216c5a711a4d31",
|
535 |
+
"gat_reading": "6f21934e536e7dca65361d01e5cafc27f8070c4f0dccf5a88c1fe071194b78a4",
|
536 |
+
"gat_algebra": "20750c926608570eaf87d29981e5ab49b2b097bd52d7f749c44ab4e175d9fdd2",
|
537 |
+
"gat_arithmetic": "c4b0c73c269d9eb3e8482fbda42e69191c28b95e75e1517d5f9142c6ef410204",
|
538 |
+
"gat_comparisons": "88bc22db186a50cab28938ec1fc332366fa0bc886bc98edf810cc9ae938405db",
|
539 |
+
"gat_contextual": "b8e88ff29b62b54eb834dca696304ca0fe1ce55d5cf7d0a9f0204456e3955be6",
|
540 |
+
"gat_geometry": "229545188469d0512a3297737f4ec7afe88d8a30e7e04f87b4982548e83b1e56"
|
541 |
+
},
|
542 |
+
"model_source": "hf",
|
543 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
544 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
545 |
+
"system_instruction": null,
|
546 |
+
"system_instruction_sha": null,
|
547 |
+
"fewshot_as_multiturn": false,
|
548 |
+
"chat_template": null,
|
549 |
+
"chat_template_sha": null,
|
550 |
+
"start_time": 601254.206185867,
|
551 |
+
"end_time": 601373.470204397,
|
552 |
+
"total_evaluation_time_seconds": "119.26401853002608"
|
553 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/moe_ien_mcq_0_shot.json
ADDED
@@ -0,0 +1,127 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_mcq": {
|
4 |
+
"alias": "moe_ien_mcq",
|
5 |
+
"acc,none": 0.5265265265265265,
|
6 |
+
"acc_stderr,none": 0.004995706870392996,
|
7 |
+
"acc_norm,none": 0.5265265265265265,
|
8 |
+
"acc_norm_stderr,none": 0.004995706870392996
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_mcq": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_mcq": {
|
16 |
+
"task": "moe_ien_mcq",
|
17 |
+
"dataset_path": "lm_eval/tasks/moe_ien_mcq/ien_moe_mcq.py",
|
18 |
+
"dataset_name": "moe_ien_mcq",
|
19 |
+
"dataset_kwargs": {
|
20 |
+
"trust_remote_code": true
|
21 |
+
},
|
22 |
+
"validation_split": "validation",
|
23 |
+
"test_split": "test",
|
24 |
+
"fewshot_split": "validation",
|
25 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.split(\". \", 1)[1] if \". \" in choice else choice\n\n def format_example(doc, keys):\n question = doc[\"Question\"].strip()\n \n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"Choices\"])]\n \n )\n prompt = f\"\\n\\n\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys = [\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"][0:len(doc[\"Choices\"])]\n out_doc = {\n \"Query\": format_example(doc, keys), \n \"Choices\": keys,\n \"gold\": int(doc[\"Answer\"])-1, ## \n } \n return out_doc\n \n return dataset.map(_process_docs)\n",
|
26 |
+
"doc_to_text": "Query",
|
27 |
+
"doc_to_target": "gold",
|
28 |
+
"doc_to_choice": "{{Choices}}",
|
29 |
+
"description": "\u0641\u064a\u0645\u0627\u202f\u064a\u0644\u064a\u202f\u0623\u0633\u0626\u0644\u0629\u202f\u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631\u202f\u0645\u0646\u202f\u0645\u062a\u0639\u062f\u062f\u202f(\u0645\u0639\u202f\u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a)\u202f\u0641\u064a\u202f{{Subject}}",
|
30 |
+
"target_delimiter": " ",
|
31 |
+
"fewshot_delimiter": "\n\n",
|
32 |
+
"fewshot_config": {
|
33 |
+
"sampler": "balanced_cat"
|
34 |
+
},
|
35 |
+
"num_fewshot": 0,
|
36 |
+
"metric_list": [
|
37 |
+
{
|
38 |
+
"metric": "acc",
|
39 |
+
"aggregation": "mean",
|
40 |
+
"higher_is_better": true
|
41 |
+
},
|
42 |
+
{
|
43 |
+
"metric": "acc_norm",
|
44 |
+
"aggregation": "mean",
|
45 |
+
"higher_is_better": true
|
46 |
+
}
|
47 |
+
],
|
48 |
+
"output_type": "multiple_choice",
|
49 |
+
"repeats": 1,
|
50 |
+
"should_decontaminate": true,
|
51 |
+
"doc_to_decontamination_query": "Query",
|
52 |
+
"metadata": {
|
53 |
+
"version": 0.0
|
54 |
+
}
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"versions": {
|
58 |
+
"moe_ien_mcq": 0.0
|
59 |
+
},
|
60 |
+
"n-shot": {
|
61 |
+
"moe_ien_mcq": 0
|
62 |
+
},
|
63 |
+
"higher_is_better": {
|
64 |
+
"moe_ien_mcq": {
|
65 |
+
"acc": true,
|
66 |
+
"acc_norm": true
|
67 |
+
}
|
68 |
+
},
|
69 |
+
"n-samples": {
|
70 |
+
"moe_ien_mcq": {
|
71 |
+
"original": 9990,
|
72 |
+
"effective": 9990
|
73 |
+
}
|
74 |
+
},
|
75 |
+
"config": {
|
76 |
+
"model": "hf",
|
77 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
78 |
+
"model_num_parameters": 7455550464,
|
79 |
+
"model_dtype": "torch.bfloat16",
|
80 |
+
"model_revision": "main",
|
81 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
82 |
+
"batch_size": 1,
|
83 |
+
"batch_sizes": [],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "b955b2950",
|
95 |
+
"date": 1739620378.768502,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.3",
|
98 |
+
"upper_git_hash": null,
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<|pad|>",
|
101 |
+
"2023"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"<|endoftext|>",
|
105 |
+
"11"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
null,
|
109 |
+
"None"
|
110 |
+
],
|
111 |
+
"eot_token_id": 11,
|
112 |
+
"max_length": 32768,
|
113 |
+
"task_hashes": {
|
114 |
+
"moe_ien_mcq": "1ae93edb904d572143b5f36dd5dfcc4b901240916d4735ea328083598c912446"
|
115 |
+
},
|
116 |
+
"model_source": "hf",
|
117 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
118 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
119 |
+
"system_instruction": null,
|
120 |
+
"system_instruction_sha": null,
|
121 |
+
"fewshot_as_multiturn": false,
|
122 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
123 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
124 |
+
"start_time": 1395061.894176973,
|
125 |
+
"end_time": 1395336.684131379,
|
126 |
+
"total_evaluation_time_seconds": "274.78995440597646"
|
127 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/moe_ien_tf_0_shot.json
ADDED
@@ -0,0 +1,129 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"moe_ien_tf": {
|
4 |
+
"alias": "moe_ien_tf",
|
5 |
+
"acc,none": 0.576335222393955,
|
6 |
+
"acc_stderr,none": 0.006476086786980228,
|
7 |
+
"acc_norm,none": 0.576335222393955,
|
8 |
+
"acc_norm_stderr,none": 0.006476086786980228
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"moe_ien_tf": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"moe_ien_tf": {
|
16 |
+
"task": "moe_ien_tf",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/moe_ien_tf/moe_ien_tf.py",
|
21 |
+
"dataset_name": "moe_ien_tf",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n keys=[\"\u0635\u062d\u064a\u062d\u0629\",\n \"\u062e\u0627\u0637\u0626\u0629\"\n ]\n #keys =[\"\u0635\u0648\u0627\u0628\",\n # \"\u062e\u0637\u0623\"]\n target_key = int(doc[\"Answer\"])-1\n\n out_doc = {\n \"query\": \"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" +doc[\"Question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\", \n \"choices\": keys,\n \"gold\": target_key,\n }\n return out_doc\n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "choices",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{Subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d\u064a\u062d\u0629' \u0623\u0648 '\u062e\u0627\u0637\u0626\u0629' \u062f\u0648\u0646 \u0634\u0631\u062d ",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": false,
|
54 |
+
"metadata": {
|
55 |
+
"version": 2.0
|
56 |
+
}
|
57 |
+
}
|
58 |
+
},
|
59 |
+
"versions": {
|
60 |
+
"moe_ien_tf": 2.0
|
61 |
+
},
|
62 |
+
"n-shot": {
|
63 |
+
"moe_ien_tf": 0
|
64 |
+
},
|
65 |
+
"higher_is_better": {
|
66 |
+
"moe_ien_tf": {
|
67 |
+
"acc": true,
|
68 |
+
"acc_norm": true
|
69 |
+
}
|
70 |
+
},
|
71 |
+
"n-samples": {
|
72 |
+
"moe_ien_tf": {
|
73 |
+
"original": 5823,
|
74 |
+
"effective": 5823
|
75 |
+
}
|
76 |
+
},
|
77 |
+
"config": {
|
78 |
+
"model": "hf",
|
79 |
+
"model_args": "pretrained=tiiuae/Falcon3-7B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
80 |
+
"model_num_parameters": 7455550464,
|
81 |
+
"model_dtype": "torch.bfloat16",
|
82 |
+
"model_revision": "main",
|
83 |
+
"model_sha": "5563a370c1848366c7a095bde4bbff2cdb419cc6",
|
84 |
+
"batch_size": 1,
|
85 |
+
"batch_sizes": [],
|
86 |
+
"device": null,
|
87 |
+
"use_cache": null,
|
88 |
+
"limit": null,
|
89 |
+
"bootstrap_iters": 100000,
|
90 |
+
"gen_kwargs": null,
|
91 |
+
"random_seed": 0,
|
92 |
+
"numpy_seed": 1234,
|
93 |
+
"torch_seed": 1234,
|
94 |
+
"fewshot_seed": 1234
|
95 |
+
},
|
96 |
+
"git_hash": "b955b2950",
|
97 |
+
"date": 1739620722.9521024,
|
98 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.88\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
99 |
+
"transformers_version": "4.48.3",
|
100 |
+
"upper_git_hash": null,
|
101 |
+
"tokenizer_pad_token": [
|
102 |
+
"<|pad|>",
|
103 |
+
"2023"
|
104 |
+
],
|
105 |
+
"tokenizer_eos_token": [
|
106 |
+
"<|endoftext|>",
|
107 |
+
"11"
|
108 |
+
],
|
109 |
+
"tokenizer_bos_token": [
|
110 |
+
null,
|
111 |
+
"None"
|
112 |
+
],
|
113 |
+
"eot_token_id": 11,
|
114 |
+
"max_length": 32768,
|
115 |
+
"task_hashes": {
|
116 |
+
"moe_ien_tf": "ed81617ccb178d095c9a81fef15f5ba8b655782b26d36117f53c38b0a84e62e5"
|
117 |
+
},
|
118 |
+
"model_source": "hf",
|
119 |
+
"model_name": "tiiuae/Falcon3-7B-Instruct",
|
120 |
+
"model_name_sanitized": "tiiuae__Falcon3-7B-Instruct",
|
121 |
+
"system_instruction": null,
|
122 |
+
"system_instruction_sha": null,
|
123 |
+
"fewshot_as_multiturn": false,
|
124 |
+
"chat_template": "{%- if tools %}\n{{- '<|system|>\\n' }}\n{%- if messages[0]['role'] == 'system' %}\n{{- messages[0]['content'] }}\n{%- set remaining_messages = messages[1:] %}\n{%- else %}\n{%- set remaining_messages = messages %}\n{%- endif %}\n{{- 'You are a Falcon assistant skilled in function calling. You are helpful, respectful, and concise.\\n\\n# Tools\\n\\nYou have access to the following functions. You MUST use them to answer questions when needed. For each function call, you MUST return a JSON object inside <tool_call></tool_call> tags.\\n\\n<tools>' + tools|tojson(indent=2) + '</tools>\\n\\n# Output Format\\n\\nYour response MUST follow this format when making function calls:\\n<tool_call>\\n[\\n {\"name\": \"function_name\", \"arguments\": {\"arg1\": \"value1\", \"arg2\": \"value2\"}},\\n {\"name\": \"another_function\", \"arguments\": {\"arg\": \"value\"}}\\n]\\n</tool_call>\\nIf no function calls are needed, respond normally without the tool_call tags.\\n' }}\n{%- for message in remaining_messages %}\n{%- if message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if message.content %}\n{{- '<|assistant|>\\n' + message['content'] }}\n{%- endif %}\n{%- if message.tool_calls %}\n{{- '\\n<tool_call>\\n' }}\n{{- message.tool_calls|tojson(indent=2) }}\n{{- '\\n</tool_call>' }}\n{%- endif %}\n{{- eos_token + '\\n' }}\n{%- elif message['role'] == 'tool' %}\n{{- '<|assistant|>\\n<tool_response>\\n' + message['content'] + '\\n</tool_response>\\n' }}\n{%- endif %}\n{%- endfor %}\n{{- '<|assistant|>\\n' if add_generation_prompt }}\n{%- else %}\n{%- for message in messages %}\n{%- if message['role'] == 'system' %}\n{{- '<|system|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'user' %}\n{{- '<|user|>\\n' + message['content'] + '\\n' }}\n{%- elif message['role'] == 'assistant' %}\n{%- if not loop.last %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token + '\\n' }}\n{%- else %}\n{{- '<|assistant|>\\n' + message['content'] + eos_token }}\n{%- endif %}\n{%- endif %}\n{%- if loop.last and add_generation_prompt %}\n{{- '<|assistant|>\\n' }}\n{%- endif %}\n{%- endfor %}\n{%- endif %}",
|
125 |
+
"chat_template_sha": "914ccd80356f5822d1a50d97546e37f60c04ed831fe431aa40346574ec266901",
|
126 |
+
"start_time": 1395406.00589162,
|
127 |
+
"end_time": 1395704.54657667,
|
128 |
+
"total_evaluation_time_seconds": "298.54068504995666"
|
129 |
+
}
|
evaluations/ar/Falcon3-7B-Instruct/openaimmlu_0_shot.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
evaluations/ar/Llama-3.3-70B-Instruct/acva_5_shot.json
ADDED
@@ -0,0 +1,125 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"acva": {
|
4 |
+
"alias": "acva",
|
5 |
+
"acc,none": 0.7847301951779564,
|
6 |
+
"acc_stderr,none": 0.004404205705558861,
|
7 |
+
"acc_norm,none": 0.769345579793341,
|
8 |
+
"acc_norm_stderr,none": 0.004513957617295361
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"acva": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"acva": {
|
16 |
+
"task": "acva",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment",
|
21 |
+
"dataset_kwargs": {
|
22 |
+
"trust_remote_code": true
|
23 |
+
},
|
24 |
+
"validation_split": "validation",
|
25 |
+
"test_split": "test",
|
26 |
+
"fewshot_split": "validation",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _format_subject(subject):\n \n arabic_words = subtasks_ar[subtasks.index(subject)]\n return arabic_words\n \n def _generate_subject(doc):\n subject = _format_subject(doc[\"id\"].split(\"-\")[0])\n\n return subject\n \n def _process_docs(doc):\n keys = [\"\u0635\u062d\",\n \"\u062e\u0637\u0623\"]\n subject = _generate_subject(doc)\n gold = keys.index(doc['answer'])\n out_doc = {\n \"id\": doc[\"id\"],\n \"query\": \"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644:\" + doc[\"question\"]+\"\\n\u0625\u062c\u0627\u0628\u0629:'\",\n \"choices\": keys,\n \"gold\": gold,\n \"subject\": subject,\n }\n \n return out_doc\n\n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "choices",
|
31 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0639\u0628\u0627\u0631\u0627\u062a \u0625\u0645\u0627 \u0635\u062d\u064a\u062d\u0629 \u0623\u0648 \u062e\u0627\u0637\u0626\u0629 \u062d\u0648\u0644 {{subject}}\n \u0627\u0644\u0631\u062c\u0627\u0621 \u062a\u0635\u0646\u064a\u0641 \u0627\u0644\u0639\u0628\u0627\u0631\u0629 \u0625\u0644\u0649 '\u0635\u062d' \u0623\u0648 '\u062e\u0637\u0623' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": false,
|
50 |
+
"metadata": {
|
51 |
+
"version": 1.0
|
52 |
+
}
|
53 |
+
}
|
54 |
+
},
|
55 |
+
"versions": {
|
56 |
+
"acva": 1.0
|
57 |
+
},
|
58 |
+
"n-shot": {
|
59 |
+
"acva": 5
|
60 |
+
},
|
61 |
+
"higher_is_better": {
|
62 |
+
"acva": {
|
63 |
+
"acc": true,
|
64 |
+
"acc_norm": true
|
65 |
+
}
|
66 |
+
},
|
67 |
+
"n-samples": {
|
68 |
+
"acva": {
|
69 |
+
"original": 8710,
|
70 |
+
"effective": 8710
|
71 |
+
}
|
72 |
+
},
|
73 |
+
"config": {
|
74 |
+
"model": "hf",
|
75 |
+
"model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
76 |
+
"model_num_parameters": 70553706496,
|
77 |
+
"model_dtype": "torch.bfloat16",
|
78 |
+
"model_revision": "main",
|
79 |
+
"model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
|
80 |
+
"batch_size": "auto",
|
81 |
+
"batch_sizes": [
|
82 |
+
64
|
83 |
+
],
|
84 |
+
"device": null,
|
85 |
+
"use_cache": null,
|
86 |
+
"limit": null,
|
87 |
+
"bootstrap_iters": 100000,
|
88 |
+
"gen_kwargs": null,
|
89 |
+
"random_seed": 0,
|
90 |
+
"numpy_seed": 1234,
|
91 |
+
"torch_seed": 1234,
|
92 |
+
"fewshot_seed": 1234
|
93 |
+
},
|
94 |
+
"git_hash": "788a3672",
|
95 |
+
"date": 1737861513.0031924,
|
96 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
97 |
+
"transformers_version": "4.48.1",
|
98 |
+
"upper_git_hash": "086919bd66f4e15fdcd4b792a7b27a698c1ba091",
|
99 |
+
"tokenizer_pad_token": [
|
100 |
+
"<|finetune_right_pad_id|>",
|
101 |
+
"128004"
|
102 |
+
],
|
103 |
+
"tokenizer_eos_token": [
|
104 |
+
"<|eot_id|>",
|
105 |
+
"128009"
|
106 |
+
],
|
107 |
+
"tokenizer_bos_token": [
|
108 |
+
"<|begin_of_text|>",
|
109 |
+
"128000"
|
110 |
+
],
|
111 |
+
"eot_token_id": 128009,
|
112 |
+
"max_length": 131072,
|
113 |
+
"task_hashes": {},
|
114 |
+
"model_source": "hf",
|
115 |
+
"model_name": "meta-llama/Llama-3.3-70B-Instruct",
|
116 |
+
"model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
|
117 |
+
"system_instruction": null,
|
118 |
+
"system_instruction_sha": null,
|
119 |
+
"fewshot_as_multiturn": false,
|
120 |
+
"chat_template": null,
|
121 |
+
"chat_template_sha": null,
|
122 |
+
"start_time": 822799.725415956,
|
123 |
+
"end_time": 824041.525682158,
|
124 |
+
"total_evaluation_time_seconds": "1241.8002662019571"
|
125 |
+
}
|
evaluations/ar/Llama-3.3-70B-Instruct/ar_ifeval_0_shot.json
ADDED
@@ -0,0 +1,142 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"ar_ifeval": {
|
4 |
+
"alias": "ar_ifeval",
|
5 |
+
"prompt_level_strict_acc,none": 0.7089552238805971,
|
6 |
+
"prompt_level_strict_acc_stderr,none": 0.019638685568678992,
|
7 |
+
"inst_level_strict_acc,none": 0.8860068259385665,
|
8 |
+
"inst_level_strict_acc_stderr,none": "N/A",
|
9 |
+
"prompt_level_loose_acc,none": 0.7947761194029851,
|
10 |
+
"prompt_level_loose_acc_stderr,none": 0.017460611985170207,
|
11 |
+
"inst_level_loose_acc,none": 0.9208191126279863,
|
12 |
+
"inst_level_loose_acc_stderr,none": "N/A"
|
13 |
+
}
|
14 |
+
},
|
15 |
+
"group_subtasks": {
|
16 |
+
"ar_ifeval": []
|
17 |
+
},
|
18 |
+
"configs": {
|
19 |
+
"ar_ifeval": {
|
20 |
+
"task": "ar_ifeval",
|
21 |
+
"dataset_path": "lm_eval/tasks/ar_ifeval/ar_ifeval.py",
|
22 |
+
"dataset_name": "ar_ifeval",
|
23 |
+
"dataset_kwargs": {
|
24 |
+
"trust_remote_code": true
|
25 |
+
},
|
26 |
+
"test_split": "test",
|
27 |
+
"doc_to_text": "prompt",
|
28 |
+
"doc_to_target": 0,
|
29 |
+
"process_results": "def process_results(doc, results):\n\n response = results[0]\n out_strict = process_sample(doc, response, 'strict')\n out_loose = process_sample(doc, response, 'loose')\n\n\n return {\n \"prompt_level_strict_acc\": out_strict.follow_all_instructions,\n \"inst_level_strict_acc\": out_strict.follow_instruction_list,\n \"prompt_level_loose_acc\": out_loose.follow_all_instructions,\n \"inst_level_loose_acc\": out_loose.follow_instruction_list,\n }\n",
|
30 |
+
"description": "",
|
31 |
+
"target_delimiter": " ",
|
32 |
+
"fewshot_delimiter": "\n\n",
|
33 |
+
"num_fewshot": 0,
|
34 |
+
"metric_list": [
|
35 |
+
{
|
36 |
+
"metric": "prompt_level_strict_acc",
|
37 |
+
"aggregation": "mean",
|
38 |
+
"higher_is_better": true
|
39 |
+
},
|
40 |
+
{
|
41 |
+
"metric": "inst_level_strict_acc",
|
42 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "prompt_level_loose_acc",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"metric": "inst_level_loose_acc",
|
52 |
+
"aggregation": "def agg_inst_level_acc(items):\n flat_items = [item for sublist in items for item in sublist]\n inst_level_acc = sum(flat_items) / len(flat_items)\n return inst_level_acc\n",
|
53 |
+
"higher_is_better": true
|
54 |
+
}
|
55 |
+
],
|
56 |
+
"output_type": "generate_until",
|
57 |
+
"generation_kwargs": {
|
58 |
+
"until": [],
|
59 |
+
"do_sample": false,
|
60 |
+
"temperature": 0.0,
|
61 |
+
"max_gen_toks": 1280
|
62 |
+
},
|
63 |
+
"repeats": 1,
|
64 |
+
"should_decontaminate": false,
|
65 |
+
"metadata": {
|
66 |
+
"version": 4.0
|
67 |
+
}
|
68 |
+
}
|
69 |
+
},
|
70 |
+
"versions": {
|
71 |
+
"ar_ifeval": 4.0
|
72 |
+
},
|
73 |
+
"n-shot": {
|
74 |
+
"ar_ifeval": 0
|
75 |
+
},
|
76 |
+
"higher_is_better": {
|
77 |
+
"ar_ifeval": {
|
78 |
+
"prompt_level_strict_acc": true,
|
79 |
+
"inst_level_strict_acc": true,
|
80 |
+
"prompt_level_loose_acc": true,
|
81 |
+
"inst_level_loose_acc": true
|
82 |
+
}
|
83 |
+
},
|
84 |
+
"n-samples": {
|
85 |
+
"ar_ifeval": {
|
86 |
+
"original": 536,
|
87 |
+
"effective": 536
|
88 |
+
}
|
89 |
+
},
|
90 |
+
"config": {
|
91 |
+
"model": "hf",
|
92 |
+
"model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
93 |
+
"model_num_parameters": 70553706496,
|
94 |
+
"model_dtype": "torch.bfloat16",
|
95 |
+
"model_revision": "main",
|
96 |
+
"model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
|
97 |
+
"batch_size": 1,
|
98 |
+
"batch_sizes": [],
|
99 |
+
"device": null,
|
100 |
+
"use_cache": null,
|
101 |
+
"limit": null,
|
102 |
+
"bootstrap_iters": 100000,
|
103 |
+
"gen_kwargs": null,
|
104 |
+
"random_seed": 0,
|
105 |
+
"numpy_seed": 1234,
|
106 |
+
"torch_seed": 1234,
|
107 |
+
"fewshot_seed": 1234
|
108 |
+
},
|
109 |
+
"git_hash": "788a3672",
|
110 |
+
"date": 1738755018.193393,
|
111 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
112 |
+
"transformers_version": "4.48.2",
|
113 |
+
"upper_git_hash": null,
|
114 |
+
"tokenizer_pad_token": [
|
115 |
+
"<|finetune_right_pad_id|>",
|
116 |
+
"128004"
|
117 |
+
],
|
118 |
+
"tokenizer_eos_token": [
|
119 |
+
"<|eot_id|>",
|
120 |
+
"128009"
|
121 |
+
],
|
122 |
+
"tokenizer_bos_token": [
|
123 |
+
"<|begin_of_text|>",
|
124 |
+
"128000"
|
125 |
+
],
|
126 |
+
"eot_token_id": 128009,
|
127 |
+
"max_length": 131072,
|
128 |
+
"task_hashes": {
|
129 |
+
"ar_ifeval": "6bd5bfb26ee4f5909e16d66ee0e564fb2a5826815f16755272465c9e03f98a20"
|
130 |
+
},
|
131 |
+
"model_source": "hf",
|
132 |
+
"model_name": "meta-llama/Llama-3.3-70B-Instruct",
|
133 |
+
"model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
|
134 |
+
"system_instruction": null,
|
135 |
+
"system_instruction_sha": null,
|
136 |
+
"fewshot_as_multiturn": false,
|
137 |
+
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
|
138 |
+
"chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
|
139 |
+
"start_time": 744977.123888747,
|
140 |
+
"end_time": 758450.608805326,
|
141 |
+
"total_evaluation_time_seconds": "13473.484916579095"
|
142 |
+
}
|
evaluations/ar/Llama-3.3-70B-Instruct/araMath_v3_5_shot.json
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araMath_v3": {
|
4 |
+
"alias": "araMath_v3",
|
5 |
+
"acc,none": 0.7090909090909091,
|
6 |
+
"acc_stderr,none": 0.01848039016780232,
|
7 |
+
"acc_norm,none": 0.7090909090909091,
|
8 |
+
"acc_norm_stderr,none": 0.01848039016780232
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araMath_v3": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araMath_v3": {
|
16 |
+
"task": "araMath_v3",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araMath_v3/araMath_v3.py",
|
21 |
+
"dataset_name": "araMath_v3",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc):\n def remove_prefix(choice):\n prefixes = [\"(A)\", \"(B)\", \"(C)\", \"(D)\"]\n for prefix in prefixes:\n if choice.startswith(prefix + \" \"):\n return choice[len(prefix) + 1:] \n return choice \n\n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n choices = \"\".join(\n [f\"{key}. {remove_prefix(choice)}\\n\" for key, choice in zip(keys, doc[\"options\"])]\n )\n\n prompt = f\"\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices}\\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n keys_en = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys_en),\n \"choices\": keys_en,\n \"gold\": keys_en.index(doc[\"label\"]),\n }\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
28 |
+
"doc_to_text": "query",
|
29 |
+
"doc_to_target": "gold",
|
30 |
+
"doc_to_choice": "{{choices}}",
|
31 |
+
"description": "\u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u0645\u0646 \u0628\u064a\u0646 'A\u060c B\u060c C\u060c D' \u062f\u0648\u0646 \u0634\u0631\u062d",
|
32 |
+
"target_delimiter": " ",
|
33 |
+
"fewshot_delimiter": "\n\n",
|
34 |
+
"num_fewshot": 5,
|
35 |
+
"metric_list": [
|
36 |
+
{
|
37 |
+
"metric": "acc",
|
38 |
+
"aggregation": "mean",
|
39 |
+
"higher_is_better": true
|
40 |
+
},
|
41 |
+
{
|
42 |
+
"metric": "acc_norm",
|
43 |
+
"aggregation": "mean",
|
44 |
+
"higher_is_better": true
|
45 |
+
}
|
46 |
+
],
|
47 |
+
"output_type": "multiple_choice",
|
48 |
+
"repeats": 1,
|
49 |
+
"should_decontaminate": true,
|
50 |
+
"doc_to_decontamination_query": "query",
|
51 |
+
"metadata": {
|
52 |
+
"version": 0.0
|
53 |
+
}
|
54 |
+
}
|
55 |
+
},
|
56 |
+
"versions": {
|
57 |
+
"araMath_v3": 0.0
|
58 |
+
},
|
59 |
+
"n-shot": {
|
60 |
+
"araMath_v3": 5
|
61 |
+
},
|
62 |
+
"higher_is_better": {
|
63 |
+
"araMath_v3": {
|
64 |
+
"acc": true,
|
65 |
+
"acc_norm": true
|
66 |
+
}
|
67 |
+
},
|
68 |
+
"n-samples": {
|
69 |
+
"araMath_v3": {
|
70 |
+
"original": 605,
|
71 |
+
"effective": 605
|
72 |
+
}
|
73 |
+
},
|
74 |
+
"config": {
|
75 |
+
"model": "hf",
|
76 |
+
"model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
77 |
+
"model_num_parameters": 70553706496,
|
78 |
+
"model_dtype": "torch.bfloat16",
|
79 |
+
"model_revision": "main",
|
80 |
+
"model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
|
81 |
+
"batch_size": 1,
|
82 |
+
"batch_sizes": [],
|
83 |
+
"device": null,
|
84 |
+
"use_cache": null,
|
85 |
+
"limit": null,
|
86 |
+
"bootstrap_iters": 100000,
|
87 |
+
"gen_kwargs": null,
|
88 |
+
"random_seed": 0,
|
89 |
+
"numpy_seed": 1234,
|
90 |
+
"torch_seed": 1234,
|
91 |
+
"fewshot_seed": 1234
|
92 |
+
},
|
93 |
+
"git_hash": "788a3672",
|
94 |
+
"date": 1738750317.5038416,
|
95 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
96 |
+
"transformers_version": "4.48.2",
|
97 |
+
"upper_git_hash": null,
|
98 |
+
"tokenizer_pad_token": [
|
99 |
+
"<|finetune_right_pad_id|>",
|
100 |
+
"128004"
|
101 |
+
],
|
102 |
+
"tokenizer_eos_token": [
|
103 |
+
"<|eot_id|>",
|
104 |
+
"128009"
|
105 |
+
],
|
106 |
+
"tokenizer_bos_token": [
|
107 |
+
"<|begin_of_text|>",
|
108 |
+
"128000"
|
109 |
+
],
|
110 |
+
"eot_token_id": 128009,
|
111 |
+
"max_length": 131072,
|
112 |
+
"task_hashes": {
|
113 |
+
"araMath_v3": "154ea94d6776e7d3980c98343cec49115ef3dc4dab8897fb4668f68494d55c76"
|
114 |
+
},
|
115 |
+
"model_source": "hf",
|
116 |
+
"model_name": "meta-llama/Llama-3.3-70B-Instruct",
|
117 |
+
"model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
|
118 |
+
"system_instruction": null,
|
119 |
+
"system_instruction_sha": null,
|
120 |
+
"fewshot_as_multiturn": false,
|
121 |
+
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
|
122 |
+
"chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
|
123 |
+
"start_time": 740276.643313964,
|
124 |
+
"end_time": 740434.169818474,
|
125 |
+
"total_evaluation_time_seconds": "157.5265045099659"
|
126 |
+
}
|
evaluations/ar/Llama-3.3-70B-Instruct/araPro_0_shot.json
ADDED
@@ -0,0 +1,130 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"results": {
|
3 |
+
"araPro": {
|
4 |
+
"alias": "araPro",
|
5 |
+
"acc,none": 0.7048590281943611,
|
6 |
+
"acc_stderr,none": 0.006450314388729491,
|
7 |
+
"acc_norm,none": 0.7048590281943611,
|
8 |
+
"acc_norm_stderr,none": 0.006450314388729491
|
9 |
+
}
|
10 |
+
},
|
11 |
+
"group_subtasks": {
|
12 |
+
"araPro": []
|
13 |
+
},
|
14 |
+
"configs": {
|
15 |
+
"araPro": {
|
16 |
+
"task": "araPro",
|
17 |
+
"tag": [
|
18 |
+
"multiple_choice"
|
19 |
+
],
|
20 |
+
"dataset_path": "lm_eval/tasks/araPro/araPro.py",
|
21 |
+
"dataset_name": "araPro",
|
22 |
+
"dataset_kwargs": {
|
23 |
+
"trust_remote_code": true
|
24 |
+
},
|
25 |
+
"validation_split": "validation",
|
26 |
+
"test_split": "test",
|
27 |
+
"fewshot_split": "validation",
|
28 |
+
"process_docs": "def process_docs(dataset: datasets.Dataset) -> datasets.Dataset:\n def _process_docs(doc): \n def remove_prefix(choice):\n return choice.replace('.', '') if '.' in choice[:2] else choice\n \n def format_example(doc, keys):\n question = doc[\"question\"].strip()\n \n choice_num = ['choice1', 'choice2', 'choice3', 'choice4']\n choices = \"\".join(\n [f\"{key}. {remove_prefix(doc[choice_num[index]])}\\n\" for index, key in enumerate(keys)]\n )\n\n prompt = f\"\\n\\n\\n\u0627\u0644\u0633\u0624\u0627\u0644: {question}\\n{choices} \\n\u0627\u0644\u0627\u062c\u0627\u0628\u0629:\"\n return prompt\n\n #keys = [\"1\", \"2\", \"3\", \"4\"]\n keys = [\"A\", \"B\", \"C\", \"D\"]\n out_doc = {\n \"query\": format_example(doc, keys), \n \"choices\": keys,\n \"gold\": doc[\"answer\"]-1,\n } \n\n return out_doc\n \n return dataset.map(_process_docs)\n",
|
29 |
+
"doc_to_text": "query",
|
30 |
+
"doc_to_target": "gold",
|
31 |
+
"doc_to_choice": "{{choices}}",
|
32 |
+
"description": "\u0641\u064a\u0645\u0627 \u064a\u0644\u064a \u0623\u0633\u0626\u0644\u0629 \u0627\u0644\u0627\u062e\u062a\u064a\u0627\u0631 \u0645\u0646 \u0645\u062a\u0639\u062f\u062f (\u0645\u0639 \u0627\u0644\u0625\u062c\u0627\u0628\u0627\u062a) \u0645\u0646 \u0641\u0636\u0644\u0643 \u0627\u062e\u062a\u0631 \u0625\u062c\u0627\u0628\u0629 \u0648\u0627\u062d\u062f\u0629 \u062f\u0648\u0646 \u0634\u0631\u062d",
|
33 |
+
"target_delimiter": " ",
|
34 |
+
"fewshot_delimiter": "\n\n",
|
35 |
+
"fewshot_config": {
|
36 |
+
"sampler": "balanced_cat"
|
37 |
+
},
|
38 |
+
"num_fewshot": 0,
|
39 |
+
"metric_list": [
|
40 |
+
{
|
41 |
+
"metric": "acc",
|
42 |
+
"aggregation": "mean",
|
43 |
+
"higher_is_better": true
|
44 |
+
},
|
45 |
+
{
|
46 |
+
"metric": "acc_norm",
|
47 |
+
"aggregation": "mean",
|
48 |
+
"higher_is_better": true
|
49 |
+
}
|
50 |
+
],
|
51 |
+
"output_type": "multiple_choice",
|
52 |
+
"repeats": 1,
|
53 |
+
"should_decontaminate": true,
|
54 |
+
"doc_to_decontamination_query": "Question",
|
55 |
+
"metadata": {
|
56 |
+
"version": 2.0
|
57 |
+
}
|
58 |
+
}
|
59 |
+
},
|
60 |
+
"versions": {
|
61 |
+
"araPro": 2.0
|
62 |
+
},
|
63 |
+
"n-shot": {
|
64 |
+
"araPro": 0
|
65 |
+
},
|
66 |
+
"higher_is_better": {
|
67 |
+
"araPro": {
|
68 |
+
"acc": true,
|
69 |
+
"acc_norm": true
|
70 |
+
}
|
71 |
+
},
|
72 |
+
"n-samples": {
|
73 |
+
"araPro": {
|
74 |
+
"original": 5001,
|
75 |
+
"effective": 5001
|
76 |
+
}
|
77 |
+
},
|
78 |
+
"config": {
|
79 |
+
"model": "hf",
|
80 |
+
"model_args": "pretrained=meta-llama/Llama-3.3-70B-Instruct,trust_remote_code=True,cache_dir=/tmp,parallelize=True",
|
81 |
+
"model_num_parameters": 70553706496,
|
82 |
+
"model_dtype": "torch.bfloat16",
|
83 |
+
"model_revision": "main",
|
84 |
+
"model_sha": "6f6073b423013f6a7d4d9f39144961bfbfbc386b",
|
85 |
+
"batch_size": 1,
|
86 |
+
"batch_sizes": [],
|
87 |
+
"device": null,
|
88 |
+
"use_cache": null,
|
89 |
+
"limit": null,
|
90 |
+
"bootstrap_iters": 100000,
|
91 |
+
"gen_kwargs": null,
|
92 |
+
"random_seed": 0,
|
93 |
+
"numpy_seed": 1234,
|
94 |
+
"torch_seed": 1234,
|
95 |
+
"fewshot_seed": 1234
|
96 |
+
},
|
97 |
+
"git_hash": "788a3672",
|
98 |
+
"date": 1738742514.712935,
|
99 |
+
"pretty_env_info": "PyTorch version: 2.4.0+cu121\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: Ubuntu 22.04.3 LTS (x86_64)\nGCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0\nClang version: Could not collect\nCMake version: version 3.27.1\nLibc version: glibc-2.35\n\nPython version: 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] (64-bit runtime)\nPython platform: Linux-5.15.0-1064-azure-x86_64-with-glibc2.35\nIs CUDA available: True\nCUDA runtime version: 12.2.128\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: \nGPU 0: NVIDIA A100-SXM4-80GB\nGPU 1: NVIDIA A100-SXM4-80GB\nGPU 2: NVIDIA A100-SXM4-80GB\nGPU 3: NVIDIA A100-SXM4-80GB\nGPU 4: NVIDIA A100-SXM4-80GB\nGPU 5: NVIDIA A100-SXM4-80GB\nGPU 6: NVIDIA A100-SXM4-80GB\nGPU 7: NVIDIA A100-SXM4-80GB\n\nNvidia driver version: 535.161.08\ncuDNN version: Probably one of the following:\n/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4\n/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nAddress sizes: 48 bits physical, 48 bits virtual\nByte Order: Little Endian\nCPU(s): 96\nOn-line CPU(s) list: 0-95\nVendor ID: AuthenticAMD\nModel name: AMD EPYC 7V12 64-Core Processor\nCPU family: 23\nModel: 49\nThread(s) per core: 1\nCore(s) per socket: 48\nSocket(s): 2\nStepping: 0\nBogoMIPS: 4890.89\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru arat umip rdpid\nHypervisor vendor: Microsoft\nVirtualization type: full\nL1d cache: 3 MiB (96 instances)\nL1i cache: 3 MiB (96 instances)\nL2 cache: 48 MiB (96 instances)\nL3 cache: 384 MiB (24 instances)\nNUMA node(s): 4\nNUMA node0 CPU(s): 0-23\nNUMA node1 CPU(s): 24-47\nNUMA node2 CPU(s): 48-71\nNUMA node3 CPU(s): 72-95\nVulnerability Gather data sampling: Not affected\nVulnerability Itlb multihit: Not affected\nVulnerability L1tf: Not affected\nVulnerability Mds: Not affected\nVulnerability Meltdown: Not affected\nVulnerability Mmio stale data: Not affected\nVulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled\nVulnerability Spec rstack overflow: Mitigation; safe RET, no microcode\nVulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp\nVulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization\nVulnerability Spectre v2: Mitigation; Retpolines; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected\nVulnerability Srbds: Not affected\nVulnerability Tsx async abort: Not affected\n\nVersions of relevant libraries:\n[pip3] numpy==1.26.4\n[pip3] onnx==1.14.0\n[pip3] pytorch-lightning==2.0.7\n[pip3] pytorch-quantization==2.1.2\n[pip3] torch==2.4.0\n[pip3] torch-tensorrt==2.0.0.dev0\n[pip3] torchaudio==2.1.0\n[pip3] torchdata==0.7.0a0\n[pip3] torchmetrics==1.2.0\n[pip3] torchvision==0.19.0\n[pip3] triton==3.0.0\n[conda] Could not collect",
|
100 |
+
"transformers_version": "4.48.2",
|
101 |
+
"upper_git_hash": null,
|
102 |
+
"tokenizer_pad_token": [
|
103 |
+
"<|finetune_right_pad_id|>",
|
104 |
+
"128004"
|
105 |
+
],
|
106 |
+
"tokenizer_eos_token": [
|
107 |
+
"<|eot_id|>",
|
108 |
+
"128009"
|
109 |
+
],
|
110 |
+
"tokenizer_bos_token": [
|
111 |
+
"<|begin_of_text|>",
|
112 |
+
"128000"
|
113 |
+
],
|
114 |
+
"eot_token_id": 128009,
|
115 |
+
"max_length": 131072,
|
116 |
+
"task_hashes": {
|
117 |
+
"araPro": "ab4849e5668de72a27844a2a354787cbce92af5027f46a32300417b41913c5db"
|
118 |
+
},
|
119 |
+
"model_source": "hf",
|
120 |
+
"model_name": "meta-llama/Llama-3.3-70B-Instruct",
|
121 |
+
"model_name_sanitized": "meta-llama__Llama-3.3-70B-Instruct",
|
122 |
+
"system_instruction": null,
|
123 |
+
"system_instruction_sha": null,
|
124 |
+
"fewshot_as_multiturn": false,
|
125 |
+
"chat_template": "{{- bos_token }}\n{%- if custom_tools is defined %}\n {%- set tools = custom_tools %}\n{%- endif %}\n{%- if not tools_in_user_message is defined %}\n {%- set tools_in_user_message = true %}\n{%- endif %}\n{%- if not date_string is defined %}\n {%- set date_string = \"26 Jul 2024\" %}\n{%- endif %}\n{%- if not tools is defined %}\n {%- set tools = none %}\n{%- endif %}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = \"\" %}\n{%- endif %}\n\n{#- System message + builtin tools #}\n{{- \"<|start_header_id|>system<|end_header_id|>\\n\\n\" }}\n{%- if builtin_tools is defined or tools is not none %}\n {{- \"Environment: ipython\\n\" }}\n{%- endif %}\n{%- if builtin_tools is defined %}\n {{- \"Tools: \" + builtin_tools | reject('equalto', 'code_interpreter') | join(\", \") + \"\\n\\n\"}}\n{%- endif %}\n{{- \"Cutting Knowledge Date: December 2023\\n\" }}\n{{- \"Today Date: \" + date_string + \"\\n\\n\" }}\n{%- if tools is not none and not tools_in_user_message %}\n {{- \"You have access to the following functions. To call a function, please respond with JSON for a function call.\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n{%- endif %}\n{{- system_message }}\n{{- \"<|eot_id|>\" }}\n\n{#- Custom tools are passed in a user message with some extra guidance #}\n{%- if tools_in_user_message and not tools is none %}\n {#- Extract the first user message so we can plug it in here #}\n {%- if messages | length != 0 %}\n {%- set first_user_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n {%- else %}\n {{- raise_exception(\"Cannot put tools in the first user message when there's no first user message!\") }}\n{%- endif %}\n {{- '<|start_header_id|>user<|end_header_id|>\\n\\n' -}}\n {{- \"Given the following functions, please respond with a JSON for a function call \" }}\n {{- \"with its proper arguments that best answers the given prompt.\\n\\n\" }}\n {{- 'Respond in the format {\"name\": function name, \"parameters\": dictionary of argument name and its value}.' }}\n {{- \"Do not use variables.\\n\\n\" }}\n {%- for t in tools %}\n {{- t | tojson(indent=4) }}\n {{- \"\\n\\n\" }}\n {%- endfor %}\n {{- first_user_message + \"<|eot_id|>\"}}\n{%- endif %}\n\n{%- for message in messages %}\n {%- if not (message.role == 'ipython' or message.role == 'tool' or 'tool_calls' in message) %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\\n\\n'+ message['content'] | trim + '<|eot_id|>' }}\n {%- elif 'tool_calls' in message %}\n {%- if not message.tool_calls|length == 1 %}\n {{- raise_exception(\"This model only supports single tool-calls at once!\") }}\n {%- endif %}\n {%- set tool_call = message.tool_calls[0].function %}\n {%- if builtin_tools is defined and tool_call.name in builtin_tools %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- \"<|python_tag|>\" + tool_call.name + \".call(\" }}\n {%- for arg_name, arg_val in tool_call.arguments | items %}\n {{- arg_name + '=\"' + arg_val + '\"' }}\n {%- if not loop.last %}\n {{- \", \" }}\n {%- endif %}\n {%- endfor %}\n {{- \")\" }}\n {%- else %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' -}}\n {{- '{\"name\": \"' + tool_call.name + '\", ' }}\n {{- '\"parameters\": ' }}\n {{- tool_call.arguments | tojson }}\n {{- \"}\" }}\n {%- endif %}\n {%- if builtin_tools is defined %}\n {#- This means we're in ipython mode #}\n {{- \"<|eom_id|>\" }}\n {%- else %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n {%- elif message.role == \"tool\" or message.role == \"ipython\" %}\n {{- \"<|start_header_id|>ipython<|end_header_id|>\\n\\n\" }}\n {%- if message.content is mapping or message.content is iterable %}\n {{- message.content | tojson }}\n {%- else %}\n {{- message.content }}\n {%- endif %}\n {{- \"<|eot_id|>\" }}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id|>\\n\\n' }}\n{%- endif %}\n",
|
126 |
+
"chat_template_sha": "e10ca381b1ccc5cf9db52e371f3b6651576caee0a630b452e2816b2d404d4b65",
|
127 |
+
"start_time": 732473.787962617,
|
128 |
+
"end_time": 736407.61692168,
|
129 |
+
"total_evaluation_time_seconds": "3933.8289590630447"
|
130 |
+
}
|