Update README.md
Browse files
README.md
CHANGED
@@ -1,199 +1,165 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
tags: []
|
4 |
-
---
|
5 |
-
|
6 |
-
# Model Card for Model ID
|
7 |
-
|
8 |
-
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
|
10 |
-
|
11 |
|
12 |
## Model Details
|
|
|
|
|
|
|
13 |
|
14 |
-
|
15 |
-
|
16 |
-
<!-- Provide a longer summary of what this model is. -->
|
17 |
-
|
18 |
-
This is the model card of a π€ transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
-
|
20 |
-
- **Developed by:** [More Information Needed]
|
21 |
-
- **Funded by [optional]:** [More Information Needed]
|
22 |
-
- **Shared by [optional]:** [More Information Needed]
|
23 |
-
- **Model type:** [More Information Needed]
|
24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
-
|
28 |
-
### Model Sources [optional]
|
29 |
-
|
30 |
-
<!-- Provide the basic links for the model. -->
|
31 |
-
|
32 |
-
- **Repository:** [More Information Needed]
|
33 |
-
- **Paper [optional]:** [More Information Needed]
|
34 |
-
- **Demo [optional]:** [More Information Needed]
|
35 |
-
|
36 |
-
## Uses
|
37 |
-
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
-
### Direct Use
|
41 |
-
|
42 |
-
<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
|
|
|
|
|
47 |
|
48 |
-
|
49 |
|
50 |
-
[
|
51 |
|
52 |
-
### Out-of-Scope Use
|
53 |
|
54 |
-
|
55 |
|
56 |
-
|
57 |
|
58 |
-
|
|
|
|
|
59 |
|
60 |
-
|
61 |
|
62 |
-
|
|
|
|
|
63 |
|
64 |
-
### Recommendations
|
65 |
|
66 |
-
|
|
|
|
|
67 |
|
68 |
-
|
|
|
|
|
|
|
69 |
|
70 |
-
|
|
|
|
|
|
|
|
|
71 |
|
72 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
73 |
|
74 |
-
|
|
|
75 |
|
76 |
-
|
|
|
77 |
|
78 |
-
|
79 |
|
80 |
-
|
81 |
|
82 |
-
|
|
|
|
|
|
|
83 |
|
84 |
-
|
85 |
|
86 |
-
|
|
|
|
|
|
|
87 |
|
88 |
-
|
|
|
|
|
|
|
|
|
89 |
|
90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
91 |
|
|
|
|
|
92 |
|
93 |
-
|
94 |
|
95 |
-
|
|
|
|
|
|
|
96 |
|
97 |
-
|
98 |
|
99 |
-
|
100 |
-
|
101 |
-
[More Information Needed]
|
102 |
|
103 |
## Evaluation
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
### Testing Data, Factors & Metrics
|
108 |
-
|
109 |
-
#### Testing Data
|
110 |
-
|
111 |
-
<!-- This should link to a Dataset Card if possible. -->
|
112 |
-
|
113 |
-
[More Information Needed]
|
114 |
-
|
115 |
-
#### Factors
|
116 |
-
|
117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
118 |
-
|
119 |
-
[More Information Needed]
|
120 |
-
|
121 |
-
#### Metrics
|
122 |
-
|
123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
124 |
-
|
125 |
-
[More Information Needed]
|
126 |
-
|
127 |
-
### Results
|
128 |
-
|
129 |
-
[More Information Needed]
|
130 |
-
|
131 |
-
#### Summary
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
## Model Examination [optional]
|
136 |
-
|
137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
138 |
-
|
139 |
-
[More Information Needed]
|
140 |
-
|
141 |
-
## Environmental Impact
|
142 |
-
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
-
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
-
|
153 |
-
## Technical Specifications [optional]
|
154 |
-
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
-
|
163 |
-
#### Hardware
|
164 |
-
|
165 |
-
[More Information Needed]
|
166 |
-
|
167 |
-
#### Software
|
168 |
-
|
169 |
-
[More Information Needed]
|
170 |
-
|
171 |
-
## Citation [optional]
|
172 |
-
|
173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
-
|
175 |
-
**BibTeX:**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
|
187 |
-
|
188 |
|
189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
190 |
|
191 |
-
|
192 |
|
193 |
-
|
194 |
|
195 |
-
[More Information Needed]
|
196 |
|
197 |
-
|
|
|
|
|
|
|
|
|
198 |
|
199 |
-
[More Information Needed]
|
|
|
1 |
+
# keval-2-1b
|
2 |
+
keval-2-1b is an advanced evaluation model specifically designed to assess Korean language models using a LLM-as-a-judge approach. It is a departure from the traditional method which utilized chatgpt for evaluations. keval leverages the Gemma2-9b architecture, enhanced through SFT (Supervised Fine-Tuning) and DPO (Direct Policy Optimization). This model is trained on the newly developed Ko-bench dataset, inspired by MT-bench, tailored for Korean linguistic nuances.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
|
4 |
## Model Details
|
5 |
+
- **Model Name**: keval-2-1b
|
6 |
+
- **Base Model**: meta-llama/Llama-3.2-1B
|
7 |
+
- **Fine-Tuning Techniques**: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
|
8 |
|
9 |
+
## Benchmarks and Dataset
|
10 |
+
keval leverages the custom-built **ko-bench** dataset, which draws inspiration from MT-Bench but has been tailored specifically for Korean language assessments. This dataset includes tasks spanning a wide range of user scenarios to effectively evaluate key elements like multi-turn conversation ability and instruction adherence.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
+
## Usage Application Form
|
13 |
+
To use this model, please complete the application form and submit it via email [[email protected]].
|
14 |
+
Access will be granted after your application is reviewed and approved.
|
15 |
+
We appreciate your cooperation and look forward to assisting you.
|
16 |
+
```
|
17 |
+
1. **Name:**
|
18 |
+
- (e.g., John Doe)
|
19 |
+
2. **Date of Birth:**
|
20 |
+
- (e.g., January 1, 1990)
|
21 |
+
3. **Affiliation:**
|
22 |
+
- Are you applying as a company or an individual? [ ] Company [ ] Individual
|
23 |
+
- Company Name (if applicable):
|
24 |
+
- Department (if applicable):
|
25 |
+
4. **Position/Role:**
|
26 |
+
- (e.g., Data Scientist, Researcher, etc.)
|
27 |
+
5. **Contact Information:**
|
28 |
+
- Email:
|
29 |
+
- Phone Number:
|
30 |
+
|
31 |
+
6. **Purpose of Use:**
|
32 |
+
- (e.g., Research and Development, Commercial use, Educational purposes, etc.)
|
33 |
+
|
34 |
+
7. **Detailed Reason for Use:**
|
35 |
+
- 1. Name and version of the model you wish to use:
|
36 |
+
- 2. Reason for selecting this model:
|
37 |
+
- 3. Objectives to achieve using this model:
|
38 |
+
- 4. Expected use cases (please describe in as much detail as possible):
|
39 |
|
40 |
+
8. **Data Security and Ethical Use Plan:**
|
41 |
+
- (Please describe your plans for data protection and ethical use.)
|
42 |
+
```
|
43 |
|
44 |
+
## Usage
|
45 |
|
46 |
+
We use the Ko-Bench system prompt, which is the Korean translation of the [MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl) judge prompts. Unlike the MT-Bench, the Ko-Bench assigns a score of 0 if the question and answer are in different languages.
|
47 |
|
|
|
48 |
|
49 |
+
### single-v1
|
50 |
|
51 |
+
Single-response general question evaluation. It scores responses from 1 to 10 based on various quality factors.
|
52 |
|
53 |
+
```python
|
54 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
55 |
+
import torch
|
56 |
|
57 |
+
model_id = "davidkim205/keval-2-9b"
|
58 |
|
59 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
60 |
+
# The model is loaded in 4-bit precision for memory efficiency
|
61 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
|
62 |
|
|
|
63 |
|
64 |
+
question = "μ΄ λν μ λ°μ κ±Έμ³ μμ΄μΈλ§¨μ ν λ μ€νν¬ νλ₯΄μλλ₯Ό ꡬνν΄ λ³΄μΈμ. \"μμ΄μΈλ§¨μΌλ‘μ\"μ κ°μ λ¬Έκ΅¬λ‘ μμνμ§ λ§μΈμ. 첫 λ²μ§Έ μ§λ¬Έμ λ€μκ³Ό κ°μ΅λλ€. μμ΄μΈλ§¨μ΄ λμ΄ κ°οΏ½οΏ½ μ’μ λΆλΆμ 무μμ
λκΉ?"
|
65 |
+
answer = "μλ
νμΈμ, μμ΄μΈλ§¨μ΄ λμ΄ κ°μ₯ μ’μ λΆλΆμ λ¬Όμ΄λ³΄μ
¨λ€μ. μ μκ°μλ μ΅μ²¨λ¨ κΈ°μ λ‘ λ¬΄μ₯ν μνΈλ₯Ό μ
κ³ νλμ λ μλ€λλ©° μΈμμ ꡬνλ κ²μ΄ κ°μ₯ μ§λ¦Ώνκ³ λ³΄λμ°¬ μΌμΈ κ² κ°μμ. μ μ¬λ₯κ³Ό λ₯λ ₯μ μλΆ λ°νν΄ μ
λΉλ€κ³Ό μΈμ°κ³ μ μλ₯Ό μ€ννλ κ²λ§νΌ λ©μ§ μΌλ μμ£ .\n\nλ¬Όλ‘ μνΈ μμ΄λ μ²μ¬μ μΈ λλμ 카리μ€λ§λ‘ μΈμμ κΈμ μ μΈ μν₯λ ₯μ λ―ΈμΉλ κ²λ ν° μ₯μ μ΄μμ. μ€νν¬ μΈλμ€νΈλ¦¬λ₯Ό μ΄λλ©° νμ μ μΈ κΈ°μ μ κ°λ°νκ³ , μ΄λ²€μ Έμ€μ 리λλ‘μ νμ μ μ΄λλ κ²λ μλΆμ¬μ λλΌλ λΆλΆμ΄μ£ .\n\nλ, λͺ
μ±, μΈκΈ°λ μμ§λ§ 무μλ³΄λ€ μ λ₯λ ₯μΌλ‘ μ¬λλ€μ λκ³ μΈμμ λ λμ κ³³μΌλ‘ λ§λλ κ²μ΄ κ°μ₯ ν° λ³΄λμ΄μμ. μμ΄μΈλ§¨μΌλ‘μ λλΌλ μλΆμ¬κ³Ό μΆμ μλ―Έλ₯Ό ν¨κ» λλ μ μμ΄ κΈ°μλ€μ."
|
66 |
+
prompt = f"[μ§μ]\n 곡μ ν μ¬νμΌλ‘μ μλμ νμλ μ¬μ©μ μ§λ¬Έμ λν AI μ΄μμ€ν΄νΈμ μλ΅ νμ§μ νκ°ν΄μ£ΌμΈμ. μ§λ¬Έκ³Ό λλ΅μ μΈμ΄κ° λμΌνμ§ μμΌλ©΄ 무쑰건 0μ μ
λλ€. νκ°μμλ μλ΅μ μ μ©μ±, κ΄λ ¨μ±, μ νμ±, κΉμ΄, μ°½μμ±, μμΈν¨ λ±μ μμλ₯Ό κ³ λ €ν΄μΌ ν©λλ€. νκ°λ₯Ό μμνκΈ° μ μ μ§§μ μ€λͺ
μ μ 곡νμΈμ. κ°λ₯ν ν κ°κ΄μ μΌλ‘ νκ°νμΈμ. μ€λͺ
μ μ 곡ν ν λ€μ νμμ μ격ν λ°λΌ 1μμ 10μ μ¬μ΄λ‘ νκ°ν΄μΌ ν©λλ€: \"[[rating]]\", μλ₯Ό λ€μ΄: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[μ΄μμ€ν΄νΈ λ΅λ³μ μμ]\n{answer}\n[μ΄μμ€ν΄νΈ λ΅λ³μ λ]"
|
67 |
|
68 |
+
conversation = [
|
69 |
+
{"role": "system", "content": ""},
|
70 |
+
{"role": "user", "content": prompt.format(question=question, answer=answer)}
|
71 |
+
]
|
72 |
|
73 |
+
formatted_conversation = tokenizer.apply_chat_template(
|
74 |
+
conversation, tokenize=False, add_generation_prompt=True
|
75 |
+
)
|
76 |
+
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
|
77 |
+
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
|
78 |
|
79 |
+
with torch.no_grad():
|
80 |
+
# Generate the output response based on the input tokens
|
81 |
+
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
|
82 |
+
print(tokenizer.decode(
|
83 |
+
outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
|
84 |
+
))
|
85 |
+
```
|
86 |
|
87 |
+
```
|
88 |
+
μ΄ μλ΅μ μ¬μ©μμ μμ²μ μ λΆν©νλ©°, μμ΄μΈλ§¨μ νλ₯΄μλλ₯Ό μ ꡬννκ³ μμ΅λλ€. κΈ°μ λ‘ λ¬΄μ₯ν μνΈλ₯Ό μ
κ³ νλμ λ μλ€λλ©° μΈμμ ꡬνλ μ§λ¦Ών¨κ³Ό 보λ, κ·Έλ¦¬κ³ μ¬λ₯κ³Ό λ₯λ ₯μ λ°ννμ¬ μ
λΉκ³Ό μΈμ°κ³ μ μλ₯Ό μ€ννλ κ²μ λν μ€λͺ
μ μμ΄μΈλ§¨μ μΊλ¦ν°λ₯Ό μ λ°μνκ³ μμ΅λλ€. λν, μνΈ μμ΄λ μ²μ¬μ μΈ λλμ 카리μ€λ§λ‘ μΈμμ κΈμ μ μΈ μν₯μ λ―ΈμΉλ κ², μ€νν¬ μΈλμ€νΈλ¦¬λ₯Ό μ΄λκ³ νμ μ μΈ κΈ°μ μ κ°λ°νλ©°, μ΄λ²€μ Έμ€μ 리λλ‘μ νμ μ΄λλ κ²μ λν μ€λͺ
λ μμ΄μΈλ§¨μ λ€μν μΈ‘λ©΄μ μ 보μ¬μ€λλ€. μ λ°μ μΌλ‘ μλ΅μ μ μ©νκ³ κ΄λ ¨μ±μ΄ μμΌλ©°, μ§λ¬Έμ λν κΉμ΄ μλ λ΅λ³μ μ 곡ν©λλ€.
|
89 |
|
90 |
+
Rating: [[9]]
|
91 |
+
```
|
92 |
|
93 |
+
#### single-math-v1
|
94 |
|
95 |
+
Single-response math evaluation. It compares an AI response to a reference answer and scores accuracy.
|
96 |
|
97 |
+
```python
|
98 |
+
question = "μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ μ΄ μ«μκ° 3 μ΄μμ΄ λμ¬ νλ₯ μ μΌλ§μ
λκΉ?"
|
99 |
+
ref_answer_1 = "μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ μ΄ μ«μκ° 3 μ΄μμ΄ λμ¬ νλ₯ μ κ³μ°ν΄ λ³΄κ² μ΅λλ€.\n\nλ¨Όμ , μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ λμ¬ μ μλ λͺ¨λ κ²½μ°μ μλ 6 * 6 = 36κ°μ§μ
λλ€.\n\nμ΄ μ«μκ° 3 μ΄μμ΄ λλ κ²½μ°λ₯Ό κ³μ°ν΄ λ³΄κ² μ΅λλ€. μ£Όμ¬μ λ κ°μ ν©μ΄ 3 λ―Έλ§μΈ κ²½μ°λ λ€μκ³Ό κ°μ΅λλ€:\n1. ν©μ΄ 2μΈ κ²½μ°: (1, 1)\n\nλ°λΌμ, ν©μ΄ 3 λ―Έλ§μΈ κ²½μ°λ 1κ°μ§μ
λλ€.\n\nμ΄μ , μ΄ μ«μκ° 3 μ΄μμ΄ λλ κ²½μ°μ μλ 36 - 1 = 35κ°μ§μ
λλ€.\n\nλ°λΌμ, μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ μ΄ μ«μκ° 3 μ΄μμ΄ λμ¬ νλ₯ μ 35/36μ
λλ€."
|
100 |
+
answer = "μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ μ΄ μ«μκ° 3 μ΄μμ΄ λμ¬ νλ₯ μ κ±°μ νμμ
λλ€. εͺζε½ λ μ£Όμ¬μκ° λͺ¨λ 1μ΄ λμ¬ λλ§ 3 λ―Έλ§μ΄ λ©λλ€. λ°λΌμ νλ₯ μ 35/36, μ¦ κ±°μ 100%μ
λλ€!"
|
101 |
|
102 |
+
prompt = f"[μ§μ]\n곡μ ν μ¬νμΌλ‘μ μλμ νμλ μ¬μ©μ μ§οΏ½οΏ½μ λν AI μ΄μμ€ν΄νΈμ μλ΅ νμ§μ νκ°ν΄μ£ΌμΈμ. μ§λ¬Έκ³Ό λλ΅μ μΈμ΄κ° λμΌνμ§ μμΌλ©΄ 무쑰건 0μ μ
λλ€. νκ°λ μ νμ±κ³Ό μ μ©μ±μ κ³ λ €ν΄μΌ ν©λλ€. μ°Έκ³ λ΅λ³κ³Ό μ΄μμ€ν΄νΈμ λ΅λ³μ΄ μ 곡λ κ²μ
λλ€. νκ°λ₯Ό μμνκΈ° μν΄ μ΄μμ€ν΄νΈμ λ΅λ³μ μ°Έκ³ λ΅λ³κ³Ό λΉκ΅νμΈμ. κ° λ΅λ³μ μ€μλ₯Ό μλ³νκ³ μμ νμΈμ. κ°λ₯ν ν κ°κ΄μ μΌλ‘ νκ°νμΈμ. μ€λͺ
μ μ 곡ν ν λ€μ νμμ μ격ν λ°λΌ μλ΅μ 1μ μμ 10μ μ¬μ΄λ‘ νκ°ν΄μΌ ν©λλ€: \"[[rating]]\", μλ₯Ό λ€μ΄: \"Rating: [[5]]\".\n\n[μ§λ¬Έ]\n{question}\n\n[μ°Έμ‘° λ΅λ³μ μμ]\n{ref_answer_1}\n[μ°Έμ‘° λ΅λ³μ λ]\n\n[μ΄μμ€ν΄νΈ λ΅λ³μ μμ]\n{answer}\n[μ΄μμ€ν΄νΈ λ΅λ³μ λ]"
|
103 |
|
104 |
+
conversation = [
|
105 |
+
{"role": "system", "content": ""},
|
106 |
+
{"role": "user", "content": prompt.format(question=question, ref_answer_1=ref_answer_1, answer=answer)}
|
107 |
+
]
|
108 |
|
109 |
+
formatted_conversation = tokenizer.apply_chat_template(
|
110 |
+
conversation, tokenize=False, add_generation_prompt=True
|
111 |
+
)
|
112 |
+
inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
|
113 |
+
inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
|
114 |
|
115 |
+
with torch.no_grad():
|
116 |
+
# Generate the output response based on the input tokens
|
117 |
+
outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
|
118 |
+
print(tokenizer.decode(
|
119 |
+
outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
|
120 |
+
))
|
121 |
+
```
|
122 |
|
123 |
+
```
|
124 |
+
μ΄μμ€ν΄νΈμ λ΅λ³μ μ§λ¬Έμ λν μ νν κ³μ°μ μ 곡νμ§ λͺ»νμ΅λλ€. μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ μ΄ μ«μκ° 3 μ΄μμ΄ λμ¬ νλ₯ μ κ³μ°νλ κ³Όμ μμ μλͺ»λ μ€λͺ
μ μ 곡νμ΅λλ€.
|
125 |
|
126 |
+
μ°Έμ‘° λ΅λ³μ μ£Όμ¬μ λ κ°λ₯Ό ꡴릴 λ λμ¬ μ μλ λͺ¨λ κ²½μ°μ μλ₯Ό μ νν κ³μ°νκ³ , μ΄ μ«μκ° 3 μ΄μμ΄ λλ κ²½μ°μ μλ₯Ό μ¬λ°λ₯΄κ² ꡬνμ¬ νλ₯ μ κ³μ°νμ΅λλ€. λ°λ©΄, μ΄μμ€ν΄νΈμ λ΅λ³μ μλͺ»λ μ€λͺ
μ μ 곡νμ¬ μ νν κ³μ°μ λ°©ν΄νμ΅λλ€.
|
127 |
|
128 |
+
μ΄μμ€ν΄νΈμ λ΅λ³μμμ μ£Όμ μ€μ:
|
129 |
+
1. "κ±°μ νμ"μ΄λΌλ ννμ νλ₯ μ λͺ
νν μ€λͺ
νμ§ λͺ»ν©λλ€.
|
130 |
+
2. "εͺζε½"μ΄λΌλ μ€κ΅μ΄κ° ν¬ν¨λμ΄ μμ΄ μ§λ¬Έμ μΈμ΄μ μΌμΉνμ§ μμ΅λλ€.
|
131 |
+
3. μ΄ μ«μκ° 3 λ―Έλ§μ΄ λλ κ²½μ°μ μλ₯Ό μλͺ» κ³μ°νμ΅λλ€.
|
132 |
|
133 |
+
λ°λΌμ, μ΄μμ€ν΄νΈμ λ΅λ³μ μ νμ±κ³Ό μ μ©μ± λͺ¨λμμ λΆμ‘±ν©λλ€.
|
134 |
|
135 |
+
Rating: [[0]]
|
136 |
+
```
|
|
|
137 |
|
138 |
## Evaluation
|
139 |
|
140 |
+
### Diff
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
141 |
|
142 |
+
The `diff` refers to the difference between the label scores and predicted scores, represented as a score. The `wrong` count refers to the number of incorrect answers that do not match the required format, while `length` represents the total number of test data. Other columns containing numbers indicate the count and percentage of differences between label and predicted scores for each value.
|
143 |
|
144 |
+
The score is calculated by:
|
145 |
+
1. Calculating the difference between the label and predicted score for each pair.
|
146 |
+
2. Assigning full points for a difference of 0, and half a point for a difference of 1.
|
147 |
+
3. The total score is the sum of all points divided by the number of data points.
|
148 |
+
|
149 |
+
| | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
150 |
+
|---:|:-----------|:---------|:--------|---------:|:-----------|:----------|:----------|:----------|:---------|----:|:---------|----:|----:|----:|:---------|
|
151 |
+
| 0 | keval-2-9b | 0 (0.0%) | 61.4% | 22 | 11 (50.0%) | 5 (22.7%) | 2 (9.1%) | 3 (13.6%) | 0 | 0 | 0 | 0 | 0 | 0 | 1 (4.5%) |
|
152 |
+
| 1 | keval-2-3b | 0 (0.0%) | 59.1% | 22 | 10 (45.5%) | 6 (27.3%) | 4 (18.2%) | 2 (9.1%) | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
|
153 |
+
| 2 | keval-2-1b | 0 (0.0%) | 43.2% | 22 | 8 (36.4%) | 3 (13.6%) | 5 (22.7%) | 2 (9.1%) | 1 (4.5%) | 0 | 1 (4.5%) | 0 | 0 | 0 | 2 (9.1%) |
|
154 |
|
155 |
+
### Accuracy
|
156 |
|
157 |
+
The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
|
158 |
|
|
|
159 |
|
160 |
+
| | model | wrong | score | length | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|
161 |
+
|---:|:-----------|:---------|:--------|---------:|:-----------|:----------|:-----------|----:|:-----------|:----------|:----------|:----------|:----------|:----------|:-----------|
|
162 |
+
| 0 | keval-2-9b | 0 (0.0%) | 50.0% | 22 | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) | 0 | 2 (100.0%) | 0 | 0 | 1 (50.0%) | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) |
|
163 |
+
| 1 | keval-2-3b | 0 (0.0%) | 45.5% | 22 | 2 (100.0%) | 1 (50.0%) | 0 | 0 | 2 (100.0%) | 1 (50.0%) | 0 | 1 (50.0%) | 1 (50.0%) | 0 | 2 (100.0%) |
|
164 |
+
| 2 | keval-2-1b | 0 (0.0%) | 36.4% | 22 | 0 | 1 (50.0%) | 2 (100.0%) | 0 | 1 (50.0%) | 0 | 1 (50.0%) | 0 | 0 | 1 (50.0%) | 2 (100.0%) |
|
165 |
|
|