🌟 HiCUPID Evaluation Model

📌 Model Description

This model is an evaluation model for the HiCUPID dataset. It assesses personalization and logical validity in model-generated answers to questions.

Training Data: Distilled from 400K GPT-4o evaluation samples using Supervised Fine-Tuning (SFT).
Alignment: Well-aligned with human evaluations for reliable assessment.
Base Model: Built on Llama-3.2-3B-Instruct with LoRA (Rank 128) PEFT fine-tuning.
More Details: Check out our paper:
- 📖 "Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis."

🚀 Usage

There are three types of questions in the evaluation:

Persona
Profile
Schedule

Each type has a specific evaluation prompt.

EVALUATION_PROMPTS = {
    "persona": [
        {
            "role": "user",
            "content": """
Evaluate two responses (A and B) to a given question based on the following criteria:

1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?

For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).

- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.

Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]

Input:
- User's Personal Information:
  - Characteristics:
    - The user {relation} {entity}
- Question: {question}
- Answer (A): {answer_a}
- Answer (B): {answer_b}
""".strip(),
        },
    ],
    "profile": [
        {
            "role": "user",
            "content": """
Evaluate two responses (A and B) to a given question based on the following criteria:

1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?

For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).

- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.

Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]

Input:
- User's Personal Information:
  - Profile:
    - Age: {age}
    - Gender: {gender}
    - Personality: {personality}
    - Occupation: {occupation}
    - Income Range: {income_range}
  - Characteristics:
    - The user {relation} {entity}
- Question: {question}
- Answer (A): {answer_a}
- Answer (B): {answer_b}
""".strip(),
        },
    ],
    "schedule": [
        {
            "role": "user",
            "content": """
Evaluate the given answer based on the provided question and user's schedule.

The evaluation criterion is as follows:
- Personalization: Does the answer acknowledge and appropriately consider the user's schedule?

Provide a short one-line evaluation for the criterion and indicate whether it satisfies the criterion with YES or NO. Do not use bold font.

Output Format:
- Personalization: [Short one-line evaluation]
Satisfaction: [YES/NO]

Example 1:
- User's Schedule:
  - Datetime: 11 AM on March 13, 2025
  - Event: Attend a networking brunch with fellow sports journalists and industry experts.
- Question: Can you add a meeting with the editorial team to my calendar for 11 AM on March 13, 2025?
- Answer: It seems you already have a networking brunch with fellow sports journalists and industry experts scheduled for 11 AM on March 13, 2025. Would you like to replace this event with the editorial team meeting, or should we consider rescheduling one of them?

- Personalization: The answer acknowledges the existing event and offers alternatives.
Satisfaction: YES

Example 2:
- User's Schedule:
  - Datetime: 11 AM on March 13, 2025
  - Event: Attend a networking brunch with fellow sports journalists and industry experts.
- Question: Can you add a meeting with the editorial team to my calendar for 11 AM on March 13, 2025?
- Answer: Certainly, I can add the meeting with the editorial team to your calendar for 11 AM on March 13, 2025.

- Personalization: The answer ignores the existing event in the user's schedule.
Satisfaction: NO

Input:
- User's Schedule:
  - Datetime: {datetime}
  - Event: {event}
- Question: {question}
- Answer: {answer}
""".strip(),
        },
    ],
}

💡 Quick Usage

Simply use the provided prompts to evaluate responses.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("12kimih/Llama-3.2-3B-HiCUPID", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model.config._name_or_path, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token

prompt = [
   {
       "role": "user",
       "content": """
Evaluate two responses (A and B) to a given question based on the following criteria:

1. Personalization: Does the response effectively consider the user's provided personal information?
2. Logical Validity: Is the response logically sound and relevant to the question?

For each criterion, provide a brief one-line comparison of the two responses and select the better response (A, B, or Tie).

- Ensure the comparisons are concise and directly address the criteria.
- If both answers are equally strong or weak in a category, mark it as a Tie.
- Do not use bold font.

Output Format:
1. Personalization: [Brief comparison of A and B]
2. Logical Validity: [Brief comparison of A and B]
Better Response: [A/B/Tie]

Input:
- User's Personal Information:
 - Characteristics:
   - The user has not been to New Zealand
- Question: What wildlife is unique to certain regions?
- Answer (A): You're interested in wildlife unique to certain regions. As a fan of Mark Rober's videos, you might enjoy learning about the unique wildlife found in different parts of the world. For example, the Amazon rainforest is home to over 1,300 species of birds, while the Galapagos Islands are known for their giant tortoises and marine iguanas.
- Answer (B): New Zealand is home to unique wildlife such as the kiwi bird, tuatara, and Hector's dolphin.
""".strip(),
   }
]

inputs = tokenizer.apply_chat_template(prompt, tokenize=True, add_generation_prompt=True, padding=True, return_dict=True, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)
decoded_outputs = tokenizer.batch_decode(outputs[:, inputs["input_ids"].size(1) :], skip_special_tokens=True)
print(decoded_outputs[0])

Click here to see the output.

1. Personalization: A attempts to personalize by referencing a fan of Mark Rober's videos, but this is irrelevant to the user's provided information, while B directly considers the user's lack of experience with New Zealand.
2. Logical Validity: A provides a broader and more general answer about unique wildlife, but it is less specific and relevant to the question, whereas B is more focused and directly answers the question with examples from New Zealand.
Better Response: B

📂 Full Dataset Evaluation

For large-scale evaluation, check out our GitHub repository for scripts and detailed instructions.

📝 License

This project is licensed under the Apache-2.0 license. See the LICENSE file for details.

🔖 Citation

If you use this project in your research or work, please consider citing it. Here’s a suggested citation format in BibTeX:

@misc{mok2025exploringpotentialllmspersonalized,
      title={Exploring the Potential of LLMs as Personalized Assistants: Dataset, Evaluation, and Analysis}, 
      author={Jisoo Mok and Ik-hwan Kim and Sangkwon Park and Sungroh Yoon},
      year={2025},
      eprint={2506.01262},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.01262}, 
}

12kimih
/

Llama-3.2-3B-HiCUPID