davidkim205
/

keval-2-1b

@@ -1,199 +1,165 @@
----
-library_name: transformers
-tags: []
----
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

+# keval-2-1b
+keval-2-1b is an advanced evaluation model specifically designed to assess Korean language models using a LLM-as-a-judge approach. It is a departure from the traditional method which utilized chatgpt for evaluations. keval leverages the Gemma2-9b architecture, enhanced through SFT (Supervised Fine-Tuning) and DPO (Direct Policy Optimization). This model is trained on the newly developed Ko-bench dataset, inspired by MT-bench, tailored for Korean linguistic nuances.
 ## Model Details
+- **Model Name**: keval-2-1b
+- **Base Model**: meta-llama/Llama-3.2-1B
+- **Fine-Tuning Techniques**: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO)
+## Benchmarks and Dataset
+keval leverages the custom-built **ko-bench** dataset, which draws inspiration from MT-Bench but has been tailored specifically for Korean language assessments. This dataset includes tasks spanning a wide range of user scenarios to effectively evaluate key elements like multi-turn conversation ability and instruction adherence.
+## Usage Application Form
+To use this model, please complete the application form and submit it via email [[email protected]].
+Access will be granted after your application is reviewed and approved.
+We appreciate your cooperation and look forward to assisting you.
+```
+1. **Name:**
+- (e.g., John Doe)
+2. **Date of Birth:**
+- (e.g., January 1, 1990)
+3. **Affiliation:**
+- Are you applying as a company or an individual? [ ] Company [ ] Individual
+- Company Name (if applicable):
+- Department (if applicable):
+4. **Position/Role:**
+- (e.g., Data Scientist, Researcher, etc.)
+5. **Contact Information:**
+- Email:
+- Phone Number:
+6. **Purpose of Use:**
+- (e.g., Research and Development, Commercial use, Educational purposes, etc.)
+7. **Detailed Reason for Use:**
+- 1. Name and version of the model you wish to use:
+- 2. Reason for selecting this model:
+- 3. Objectives to achieve using this model:
+- 4. Expected use cases (please describe in as much detail as possible):
+8. **Data Security and Ethical Use Plan:**
+- (Please describe your plans for data protection and ethical use.)
+```
+## Usage
+We use the Ko-Bench system prompt, which is the Korean translation of the [MT-Bench](https://github.com/lm-sys/FastChat/blob/main/fastchat/llm_judge/data/judge_prompts.jsonl) judge prompts. Unlike the MT-Bench, the Ko-Bench assigns a score of 0 if the question and answer are in different languages.
+### single-v1
+Single-response general question evaluation. It scores responses from 1 to 10 based on various quality factors.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+model_id = "davidkim205/keval-2-9b"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# The model is loaded in 4-bit precision for memory efficiency
+model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True, device_map="auto")
+question = "이 대화 전반에 걸쳐 아이언맨의 토니 스타크 페르소나를 구현해 보세요. \"아이언맨으로서\"와 같은 문구로 시작하지 마세요. 첫 번째 질문은 다음과 같습니다. 아이언맨이 되어 가�� 좋은 부분은 무엇입니까?"
+answer = "안녕하세요, 아이언맨이 되어 가장 좋은 부분을 물어보셨네요. 제 생각에는 최첨단 기술로 무장한 슈트를 입고 하늘을 날아다니며 세상을 구하는 것이 가장 짜릿하고 보람찬 일인 것 같아요. 제 재능과 능력을 십분 발휘해 악당들과 싸우고 정의를 실현하는 것만큼 멋진 일도 없죠.\n\n물론 슈트 없이도 천재적인 두뇌와 카리스마로 세상에 긍정적인 영향력을 미치는 것도 큰 장점이에요. 스타크 인더스트리를 이끌며 혁신적인 기술을 개발하고, 어벤져스의 리더로서 팀을 잘 이끄는 것도 자부심을 느끼는 부분이죠.\n\n돈, 명성, 인기도 있지만 무엇보다 제 능력으로 사람들을 돕고 세상을 더 나은 곳으로 만드는 것이 가장 큰 보람이에요. 아이언맨으로서 느끼는 자부심과 삶의 의미를 함께 나눌 수 있어 기쁘네요."
+prompt = f"[지시]\n 공정한 심판으로서 아래에 표시된 사용자 질문에 대한 AI 어시스턴트의 응답 품질을 평가해주세요. 질문과 대답의 언어가 동일하지 않으면 무조건 0점입니다. 평가에서는 응답의 유용성, 관련성, 정확성, 깊이, 창의성, 상세함 등의 요소를 고려해야 합니다. 평가를 시작하기 전에 짧은 설명을 제공하세요. 가능한 한 객관적으로 평가하세요. 설명을 제공한 후 다음 형식을 엄격히 따라 1에서 10점 사이로 평가해야 합니다: \"[[rating]]\", 예를 들어: \"Rating: [[5]]\".\n\n[Question]\n{question}\n\n[어시스턴트 답변의 시작]\n{answer}\n[어시스턴트 답변의 끝]"
+conversation = [
+    {"role": "system", "content": ""},
+    {"role": "user", "content": prompt.format(question=question, answer=answer)}
+]
+formatted_conversation = tokenizer.apply_chat_template(
+    conversation, tokenize=False, add_generation_prompt=True
+)
+inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
+inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
+with torch.no_grad():
+    # Generate the output response based on the input tokens
+    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
+    print(tokenizer.decode(
+        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
+    ))
+```
+```
+이 응답은 사용자의 요청에 잘 부합하며, 아이언맨의 페르소나를 잘 구현하고 있습니다. 기술로 무장한 슈트를 입고 하늘을 날아다니며 세상을 구하는 짜릿함과 보람, 그리고 재능과 능력을 발휘하여 악당과 싸우고 정의를 실현하는 것에 대한 설명은 아이언맨의 캐릭터를 잘 반영하고 있습니다. 또한, 슈트 없이도 천재적인 두뇌와 카리스마로 세상에 긍정적인 영향을 미치는 것, 스타크 인더스트리를 이끌고 혁신적인 기술을 개발하며, 어벤져스의 리더로서 팀을 이끄는 것에 대한 설명도 아이언맨의 다양한 측면을 잘 보여줍니다. 전반적으로 응답은 유용하고 관련성이 있으며, 질문에 대한 깊이 있는 답변을 제공합니다.
+Rating: [[9]]
+```
+#### single-math-v1
+Single-response math evaluation. It compares an AI response to a reference answer and scores accuracy.
+```python
+question = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 얼마입니까?"
+ref_answer_1 = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률을 계산해 보겠습니다.\n\n먼저, 주사위 두 개를 굴릴 때 나올 수 있는 모든 경우의 수는 6 * 6 = 36가지입니다.\n\n총 숫자가 3 이상이 되는 경우를 계산해 보겠습니다. 주사위 두 개의 합이 3 미만인 경우는 다음과 같습니다:\n1. 합이 2인 경우: (1, 1)\n\n따라서, 합이 3 미만인 경우는 1가지입니다.\n\n이제, 총 숫자가 3 이상이 되는 경우의 수는 36 - 1 = 35가지입니다.\n\n따라서, 주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 35/36입니다."
+answer = "주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률은 거의 항상입니다. 只有当 두 주사위가 모두 1이 나올 때만 3 미만이 됩니다. 따라서 확률은 35/36, 즉 거의 100%입니다!"
+prompt = f"[지시]\n공정한 심판으로서 아래에 표시된 사용자 질��에 대한 AI 어시스턴트의 응답 품질을 평가해주세요. 질문과 대답의 언어가 동일하지 않으면 무조건 0점입니다. 평가는 정확성과 유용성을 고려해야 합니다. 참고 답변과 어시스턴트의 답변이 제공될 것입니다. 평가를 시작하기 위해 어시스턴트의 답변을 참고 답변과 비교하세요. 각 답변의 실수를 식별하고 수정하세요. 가능한 한 객관적으로 평가하세요. 설명을 제공한 후 다음 형식을 엄격히 따라 응답을 1점에서 10점 사이로 평가해야 합니다: \"[[rating]]\", 예를 들어: \"Rating: [[5]]\".\n\n[질문]\n{question}\n\n[참조 답변의 시작]\n{ref_answer_1}\n[참조 답변의 끝]\n\n[어시스턴트 답변의 시작]\n{answer}\n[어시스턴트 답변의 끝]"
+conversation = [
+    {"role": "system", "content": ""},
+    {"role": "user", "content": prompt.format(question=question, ref_answer_1=ref_answer_1, answer=answer)}
+]
+formatted_conversation = tokenizer.apply_chat_template(
+    conversation, tokenize=False, add_generation_prompt=True
+)
+inputs = tokenizer(formatted_conversation, return_tensors="pt", add_special_tokens=False)
+inputs = {key: tensor.to(model.device) for key, tensor in inputs.items()}
+with torch.no_grad():
+    # Generate the output response based on the input tokens
+    outputs = model.generate(**inputs, max_new_tokens=4096, temperature=0.7)
+    print(tokenizer.decode(
+        outputs[0][inputs['input_ids'].size(1):], skip_special_tokens=True
+    ))
+```
+```
+어시스턴트의 답변은 질문에 대한 정확한 계산을 제공하지 못했습니다. 주사위 두 개를 굴릴 때 총 숫자가 3 이상이 나올 확률을 계산하는 과정에서 잘못된 설명을 제공했습니다.
+참조 답변은 주사위 두 개를 굴릴 때 나올 수 있는 모든 경우의 수를 정확히 계산하고, 총 숫자가 3 이상이 되는 경우의 수를 올바르게 구하여 확률을 계산했습니다. 반면, 어시스턴트의 답변은 잘못된 설명을 제공하여 정확한 계산을 방해했습니다.
+어시스턴트의 답변에서의 주요 실수:
+1. "거의 항상"이라는 표현은 확률을 명확히 설명하지 못합니다.
+2. "只有当"이라는 중국어가 포함되어 있어 질문의 언어와 일치하지 않습니다.
+3. 총 숫자가 3 미만이 되는 경우의 수를 잘못 계산했습니다.
+따라서, 어시스턴트의 답변은 정확성과 유용성 모두에서 부족합니다.
+Rating: [[0]]
+```
 ## Evaluation
+### Diff
+The `diff` refers to the difference between the label scores and predicted scores, represented as a score. The `wrong` count refers to the number of incorrect answers that do not match the required format, while `length` represents the total number of test data. Other columns containing numbers indicate the count and percentage of differences between label and predicted scores for each value.
+The score is calculated by:
+  1. Calculating the difference between the label and predicted score for each pair.
+  2. Assigning full points for a difference of 0, and half a point for a difference of 1.
+  3. The total score is the sum of all points divided by the number of data points.
+|    | model      | wrong    | score   |   length | 0          | 1         | 2         | 3         | 4        |   5 | 6        |   7 |   8 |   9 | 10       |
+|---:|:-----------|:---------|:--------|---------:|:-----------|:----------|:----------|:----------|:---------|----:|:---------|----:|----:|----:|:---------|
+|  0 | keval-2-9b | 0 (0.0%) | 61.4%   |       22 | 11 (50.0%) | 5 (22.7%) | 2 (9.1%)  | 3 (13.6%) | 0        |   0 | 0        |   0 |   0 |   0 | 1 (4.5%) |
+|  1 | keval-2-3b | 0 (0.0%) | 59.1%   |       22 | 10 (45.5%) | 6 (27.3%) | 4 (18.2%) | 2 (9.1%)  | 0        |   0 | 0        |   0 |   0 |   0 | 0        |
+|  2 | keval-2-1b | 0 (0.0%) | 43.2%   |       22 | 8 (36.4%)  | 3 (13.6%) | 5 (22.7%) | 2 (9.1%)  | 1 (4.5%) |   0 | 1 (4.5%) |   0 |   0 |   0 | 2 (9.1%) |
+### Accuracy
+The `score` column represents the ratio of correctly predicted labels to the total number of data points. The `wrong` column shows the count and percentage of incorrectly formatted answers. The columns labeled "0" through "10" represent the number and percentage of correct predictions for each label, based on how well the model predicted each specific label.
+|    | model      | wrong    | score   |   length | 0          | 1         | 2          |   3 | 4          | 5         | 6         | 7         | 8         | 9         | 10         |
+|---:|:-----------|:---------|:--------|---------:|:-----------|:----------|:-----------|----:|:-----------|:----------|:----------|:----------|:----------|:----------|:-----------|
+|  0 | keval-2-9b | 0 (0.0%) | 50.0%   |       22 | 1 (50.0%)  | 1 (50.0%) | 2 (100.0%) |   0 | 2 (100.0%) | 0         | 0         | 1 (50.0%) | 1 (50.0%) | 1 (50.0%) | 2 (100.0%) |
+|  1 | keval-2-3b | 0 (0.0%) | 45.5%   |       22 | 2 (100.0%) | 1 (50.0%) | 0          |   0 | 2 (100.0%) | 1 (50.0%) | 0         | 1 (50.0%) | 1 (50.0%) | 0         | 2 (100.0%) |
+|  2 | keval-2-1b | 0 (0.0%) | 36.4%   |       22 | 0          | 1 (50.0%) | 2 (100.0%) |   0 | 1 (50.0%)  | 0         | 1 (50.0%) | 0         | 0         | 1 (50.0%) | 2 (100.0%) |