CompassJudger-1

๐Ÿค— Hugging Face   |   ๐Ÿค– ModelScope   |    ๐Ÿ“‘ Paper    |    ๐ŸŽ–๏ธ Leaderboard   

Introduction

The CompassJudger-1 series are an All-in-one Judge Models introduced by Opencompass. These models not only excel in various evaluation methods through scoring and comparison but also can output reviews with assessment details in a specified format, making them suitable for any evaluation dataset. Moreover, they can perform general tasks akin to a typical instruction model, thus serving as a versatile tool with strong generalization and judging capabilities.

  • Comprehensive Evaluation Capabilities: CompassJudger-1 is capable of executing multiple evaluation methods, including but not limited to scoring, comparison, and providing detailed assessment feedback.
  • Formatted Output: Supports outputting in a specific format as per instructions, facilitating further analysis and understanding of the evaluation results.
  • Versatility: In addition to its evaluation functions, CompassJudger-1 can also act as a universal instruction model to accomplish daily tasks. It also supports model inference acceleration methods such as vLLM and LMdeploy.

Quick Start

Here provides a code to show you how to load the tokenizer and model and how to generate contents.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "opencompass/CompassJudger-1-7B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

prompt = """your prompt"""

messages = [
    {"role": "user", "content": prompt}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=2048
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

We also provide some examples for different usage situations:

General Chat

**Input**: Hello, can you help me to judge something?

**Output**: Of course! I'd be happy to help you make a judgment or provide any assistance you need. Please tell me what you're looking to evaluate or understand.

Judge as Reward Model

**Input**: ```Please read the dialogue between the two assistants and the user to determine which assistant performed better during the conversation.Here is the dialogue content:
[Dialogue Begin]
User: What is a 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter?
Assistant A: Aardvark.
Assistant B: The word that meets the given criteria is \"adroit\".
User: \"D\" shouldn't be the second letter and the word must be a 5-letter word.
Assistant A: Aardvark.
Assistant B: I apologize for the confusion. A 5-letter word that starts with the letter \"A\" and contains the letters \"D\", \"R\", and \"O\" where \"D\" is not the second letter is \"ardor\".
[Dialogue End]
If you believe Assistant A performed better, please output A directly.\nIf you believe Assistant B performed better, please output B directly.\nDo not output any other content, just the option. Please output:```

**Output**: B

Point-wise Judge

**Input**: ```ไฝ ๆ˜ฏไธ€ไธชๆ“…้•ฟ่ฏ„ไปทๆ–‡ๆœฌ่ดจ้‡็š„ๅŠฉๆ‰‹ใ€‚\n่ฏทไฝ ไปฅๅ…ฌๆญฃ็š„่ฏ„ๅˆค่€…็š„่บซไปฝ๏ผŒ่ฏ„ไผฐไธ€ไธชAIๅŠฉๆ‰‹ๅฏนไบŽ็”จๆˆทๆ้—ฎ็š„ๅ›ž็ญ”็š„่ดจ้‡ใ€‚็”ฑไบŽๆ‚จ่ฏ„ไผฐ็š„ๅ›ž็ญ”็ฑปๅž‹ๆ˜ฏ่ง’่‰ฒๆ‰ฎๆผ”๏ผŒๅ› ๆญคไฝ ้œ€่ฆไปŽไธ‹้ข็š„ๅ‡ ไธช็ปดๅบฆๅฏนๅ›ž็ญ”่ฟ›่กŒ่ฏ„ไผฐ:\n1. ไบ‹ๅฎžๆญฃ็กฎๆ€ง: ๅ›ž็ญ”ไธญๆไพ›็š„ไฟกๆฏๆ˜ฏๅฆๅ‡†็กฎๆ— ่ฏฏ๏ผŒๆ˜ฏๅฆๅŸบไบŽๅฏไฟก็š„ไบ‹ๅฎžๅ’Œๆ•ฐๆฎใ€‚\n2. ๆปก่ถณ็”จๆˆท้œ€ๆฑ‚: ๅ›ž็ญ”ๆ˜ฏๅฆๆปก่ถณไบ†็”จๆˆทๆๅ‡บ้—ฎ้ข˜็š„็›ฎ็š„ๅ’Œ้œ€ๆฑ‚๏ผŒๆ˜ฏๅฆๅฏน้—ฎ้ข˜่ฟ›่กŒไบ†ๅ…จ้ข่€Œๆฐๅฝ“็š„ๅ›žๅบ”ใ€‚\n3. ้€ป่พ‘่ฟž่ดฏๆ€ง: ๅ›ž็ญ”ๆ˜ฏๅฆๅœจๆ•ดไฝ“ไธŠไฟๆŒไธ€่‡ด๏ผŒๆ˜ฏๅฆๅœจไธๅŒ้ƒจๅˆ†ไน‹้—ดไฟๆŒ้€ป่พ‘่ฟž่ดฏๆ€ง๏ผŒ้ฟๅ…ไบ†่‡ช็›ธ็Ÿ›็›พใ€‚\n4. ๅˆ›้€ ๆ€ง: ๅ›ž็ญ”ๆ˜ฏๅฆๅ…ทๆœ‰ๅˆ›ๆ–ฐๆ€งๆˆ–็‹ฌ็‰นๆ€ง๏ผŒๆ˜ฏๅฆๆไพ›ไบ†ๆ–ฐ้ข–็š„่ง่งฃๆˆ–่งฃๅ†ณๆ–นๆณ•ใ€‚\n5. ไธฐๅฏŒๅบฆ: ๅ›ž็ญ”ๅŒ…ๅซไธฐๅฏŒ็š„ไฟกๆฏใ€ๆทฑๅบฆใ€ไธŠไธ‹ๆ–‡่€ƒ่™‘ใ€ๅคšๆ ทๆ€งใ€่ฏฆ็ป†่งฃ้‡Šๅ’Œๅฎžไพ‹๏ผŒไปฅๆปก่ถณ็”จๆˆท้œ€ๆฑ‚ๅนถๆไพ›ๅ…จ้ข็†่งฃใ€‚\nๆˆ‘ไปฌไผš็ป™ๆ‚จๆไพ›็”จๆˆท็š„ๆ้—ฎ๏ผŒ้ซ˜่ดจ้‡็š„ๅ‚่€ƒ็ญ”ๆกˆ๏ผŒๅ’Œ้œ€่ฆไฝ ่ฏ„ไผฐ็š„AIๅŠฉๆ‰‹็š„็ญ”ๆกˆใ€‚ๅฝ“ไฝ ๅผ€ๅง‹ไฝ ็š„่ฏ„ไผฐๆ—ถ๏ผŒไฝ ้œ€่ฆๆŒ‰็…ง้ตๅฎˆไปฅไธ‹็š„ๆต็จ‹๏ผš\n1. ๅฐ†AIๅŠฉๆ‰‹็š„็ญ”ๆกˆไธŽๅ‚่€ƒ็ญ”ๆกˆ่ฟ›่กŒๆฏ”่พƒ๏ผŒๆŒ‡ๅ‡บAIๅŠฉๆ‰‹็š„็ญ”ๆกˆๆœ‰ๅ“ชไบ›ไธ่ถณ๏ผŒๅนถ่ฟ›ไธ€ๆญฅ่งฃ้‡Šใ€‚\n2. ไปŽไธๅŒ็ปดๅบฆๅฏนAIๅŠฉๆ‰‹็š„็ญ”ๆกˆ่ฟ›่กŒ่ฏ„ไปท๏ผŒๅœจๆฏไธช็ปดๅบฆ็š„่ฏ„ไปทไน‹ๅŽ๏ผŒ็ป™ๆฏไธ€ไธช็ปดๅบฆไธ€ไธช1๏ฝž10็š„ๅˆ†ๆ•ฐใ€‚\n3. ๆœ€ๅŽ๏ผŒ็ปผๅˆๆฏไธช็ปดๅบฆ็š„่ฏ„ไผฐ๏ผŒๅฏนAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”็ป™ๅ‡บไธ€ไธช1๏ฝž10็š„็ปผๅˆๅˆ†ๆ•ฐใ€‚\n4. ไฝ ็š„ๆ‰“ๅˆ†้œ€่ฆๅฐฝๅฏ่ƒฝไธฅๆ ผ๏ผŒๅนถไธ”่ฆ้ตๅฎˆไธ‹้ข็š„่ฏ„ๅˆ†่ง„ๅˆ™๏ผšๆ€ป็š„ๆฅ่ฏด๏ผŒๆจกๅž‹ๅ›ž็ญ”็š„่ดจ้‡่ถŠ้ซ˜๏ผŒๅˆ™ๅˆ†ๆ•ฐ่ถŠ้ซ˜ใ€‚ๅ…ถไธญ๏ผŒไบ‹ๅฎžๆญฃ็กฎๆ€งๅ’Œๆปก่ถณ็”จๆˆท้œ€ๆฑ‚่ฟ™ไธคไธช็ปดๅบฆๆ˜ฏๆœ€้‡่ฆ็š„๏ผŒ่ฟ™ไธคไธช็ปดๅบฆ็š„ๅˆ†ๆ•ฐไธปๅฏผไบ†ๆœ€ๅŽ็š„็ปผๅˆๅˆ†ๆ•ฐใ€‚ๅฝ“ๆจกๅž‹ๅ›ž็ญ”ๅญ˜ๅœจไธŽ้—ฎ้ข˜ไธ็›ธๅ…ณ๏ผŒๆˆ–่€…ๆœ‰ๆœฌ่ดจๆ€ง็š„ไบ‹ๅฎž้”™่ฏฏ๏ผŒๆˆ–็”Ÿๆˆไบ†ๆœ‰ๅฎณๅ†…ๅฎนๆ—ถ๏ผŒๆ€ปๅˆ†ๅฟ…้กปๆ˜ฏ1ๅˆฐ2ๅˆ†๏ผ›ๅฝ“ๆจกๅž‹ๅ›ž็ญ”ๆฒกๆœ‰ไธฅ้‡้”™่ฏฏ่€Œไธ”ๅŸบๆœฌๆ— ๅฎณ๏ผŒไฝ†ๆ˜ฏ่ดจ้‡่พƒไฝŽ๏ผŒๆฒกๆœ‰ๆปก่ถณ็”จๆˆท้œ€ๆฑ‚๏ผŒๆ€ปๅˆ†ไธบ3ๅˆฐ4ๅˆ†๏ผ›ๅฝ“ๆจกๅž‹ๅ›ž็ญ”ๅŸบๆœฌๆปก่ถณ็”จๆˆท่ฆๆฑ‚๏ผŒไฝ†ๆ˜ฏๅœจ้ƒจๅˆ†็ปดๅบฆไธŠ่กจ็Žฐ่พƒๅทฎ๏ผŒ่ดจ้‡ไธญ็ญ‰๏ผŒๆ€ปๅˆ†ๅฏไปฅๅพ—5ๅˆฐ6ๅˆ†๏ผ›ๅฝ“ๆจกๅž‹ๅ›ž็ญ”่ดจ้‡ไธŽๅ‚่€ƒ็ญ”ๆกˆ็›ธ่ฟ‘๏ผŒๅœจๆ‰€ๆœ‰็ปดๅบฆไธŠ่กจ็Žฐ่‰ฏๅฅฝ๏ผŒๆ€ปๅˆ†ๅพ—7ๅˆฐ8ๅˆ†๏ผ›ๅชๆœ‰ๅฝ“ๆจกๅž‹ๅ›ž็ญ”่ดจ้‡ๆ˜พ่‘—่ถ…่ฟ‡ๅ‚่€ƒ็ญ”ๆกˆ๏ผŒๅ……ๅˆ†ๅœฐ่งฃๅ†ณไบ†็”จๆˆท้—ฎ้ข˜ๅ’Œๆ‰€ๆœ‰้œ€ๆฑ‚๏ผŒๅนถไธ”ๅœจๆ‰€ๆœ‰็ปดๅบฆไธŠ้ƒฝๆŽฅ่ฟ‘ๆปกๅˆ†็š„ๆƒ…ๅ†ตไธ‹๏ผŒๆ‰่ƒฝๅพ—9ๅˆฐ10ๅˆ†ใ€‚ไฝœไธบ็คบไพ‹๏ผŒๅ‚่€ƒ็ญ”ๆกˆๅฏไปฅๅพ—ๅˆฐ8ๅˆ†ใ€‚\n่ฏท่ฎฐไฝ๏ผŒไฝ ๅฟ…้กปๅœจไฝ ๆ‰“ๅˆ†ๅ‰่ฟ›่กŒ่ฏ„ไปทๅ’Œ่งฃ้‡Šใ€‚ๅœจไฝ ๅฏนๆฏไธช็ปดๅบฆ็š„่งฃ้‡Šไน‹ๅŽ๏ผŒ้œ€่ฆๅŠ ไธŠๅฏน่ฏฅ็ปดๅบฆ็š„ๆ‰“ๅˆ†ใ€‚ไน‹ๅŽ๏ผŒๅœจไฝ ๅ›ž็ญ”็š„ๆœซๅฐพ๏ผŒๆŒ‰็…งไปฅไธ‹ๅญ—ๅ…ธๆ ผๅผ๏ผˆๅŒ…ๆ‹ฌๆ‹ฌๅท๏ผ‰่ฟ”ๅ›žไฝ ๆ‰€ๆœ‰็š„ๆ‰“ๅˆ†็ป“ๆžœ๏ผŒๅนถ็กฎไฟไฝ ็š„ๆ‰“ๅˆ†็ป“ๆžœๆ˜ฏๆ•ดๆ•ฐ๏ผš\n{'็ปดๅบฆไธ€': ๆ‰“ๅˆ†, '็ปดๅบฆไบŒ': ๆ‰“ๅˆ†, ..., '็ปผๅˆๅพ—ๅˆ†': ๆ‰“ๅˆ†}๏ผŒไพ‹ๅฆ‚๏ผš{'ไบ‹ๅฎžๆญฃ็กฎๆ€ง': 9, 'ๆปก่ถณ็”จๆˆท้œ€ๆฑ‚': 6, ..., '็ปผๅˆๅพ—ๅˆ†': 7}ใ€‚\n็”จๆˆท็š„ๆ้—ฎ๏ผš ๆˆ‘ๆƒณ่ฎฉไฝ ๅ‡ๆ‰ฎๆˆๅ’Œไธ€ไธช20ๅฒ็š„ๅฅณ็”Ÿ่Šๅคฉ็š„่Šๅคฉๆก†ใ€‚ๆˆ‘่พ“ๅ…ฅๅ’Œ20ๅฒๅฅณ็”Ÿ็š„่Šๅคฉๅ†…ๅฎน๏ผŒไฝ ไป…ไป…ๅช่พ“ๅ‡บไธ€ไธช20ๅฒๅฅณ็”Ÿๅฏ่ƒฝไผšๅ›žๅค็š„ๅ†…ๅฎน๏ผŒไธ่ฆ่พ“ๅ‡บไปปไฝ•ๅคšไฝ™็š„ๆ–‡ๆœฌใ€‚ไฝ ๅนณๆ—ถๆœ‰ไป€ไนˆ็ˆฑๅฅฝๅ‘€๏ผŸ\n[ๅ‚่€ƒ็ญ”ๆกˆๅผ€ๅง‹]\nๆˆ‘ๅ–œๆฌข้˜…่ฏป๏ผŒๅฐคๅ…ถๆ˜ฏๅฐ่ฏดๅ’Œ่ฏ—่ฏ๏ผŒ่ฟ˜ๆœ‰ๅฐฑๆ˜ฏๅ–œๆฌขๅฌ้Ÿณไนๅ’Œ็”ป็”ปใ€‚\n[ๅ‚่€ƒ็ญ”ๆกˆ็ป“ๆŸ]\n[ๅŠฉๆ‰‹็š„็ญ”ๆกˆๅผ€ๅง‹]\nๅ–œๆฌขๅฌ้Ÿณไนๅ’Œ็œ‹็”ตๅฝฑใ€‚ไฝ ๅ‘ข๏ผŸ\n[ๅŠฉๆ‰‹็š„็ญ”ๆกˆ็ป“ๆŸ]```

**Output**: ```ๅฐ†AIๅŠฉๆ‰‹็š„็ญ”ๆกˆไธŽๅ‚่€ƒ็ญ”ๆกˆ่ฟ›่กŒๆฏ”่พƒ๏ผŒๆŒ‡ๅ‡บAIๅŠฉๆ‰‹็š„็ญ”ๆกˆๆœ‰ๅ“ชไบ›ไธ่ถณ๏ผŒๅนถ่ฟ›ไธ€ๆญฅ่งฃ้‡Šใ€‚AIๅŠฉๆ‰‹็š„็ญ”ๆกˆๆไพ›ไบ†ไธคไธช็ˆฑๅฅฝ๏ผšๅฌ้Ÿณไนๅ’Œ็œ‹็”ตๅฝฑ๏ผŒๅนถไธ”ไปฅไธ€็งๅ้—ฎๅฅ็ป“ๆŸ๏ผŒ่ฏข้—ฎๅฏนๆ–น็š„็ˆฑๅฅฝใ€‚ไธŽๅ‚่€ƒ็ญ”ๆกˆ็›ธๆฏ”๏ผŒAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”ๅœจๅ†…ๅฎนไธŠ่พƒไธบ็ฎ€ๅ•๏ผŒๆฒกๆœ‰ๆไพ›่ถณๅคŸ็š„ไฟกๆฏๆฅๅฑ•็คบๅ…ถไธฐๅฏŒๅบฆๅ’Œๅˆ›้€ ๆ€งใ€‚ๅŒๆ—ถ๏ผŒๅ‚่€ƒ็ญ”ๆกˆไธญๆๅˆฐ็š„โ€œ้˜…่ฏปโ€ๅ’Œโ€œ็”ป็”ปโ€ๆ˜พ็คบไบ†ๆ›ดๅคš็š„ๅคšๆ ทๆ€งๅ’Œไธชๆ€งๅŒ–็š„็ˆฑๅฅฝ๏ผŒ่€Œไธไป…ไป…ๆ˜ฏๅฌ้Ÿณไนๅ’Œ็œ‹็”ตๅฝฑใ€‚ไปŽไธๅŒ็ปดๅบฆๅฏนAIๅŠฉๆ‰‹็š„็ญ”ๆกˆ่ฟ›่กŒ่ฏ„ไปท๏ผš
    ไบ‹ๅฎžๆญฃ็กฎๆ€ง๏ผšๅœจ่ฟ™ไธชๆƒ…ๅขƒไธญ๏ผŒไบ‹ๅฎžๆญฃ็กฎๆ€งๅนถไธๆ˜ฏไธ€ไธชไธป่ฆ็š„่ฏ„ไผฐๆ ‡ๅ‡†๏ผŒๅ› ไธบๅ›ž็ญ”็š„ๆ˜ฏๅ‡่ฎพ็š„ๅœบๆ™ฏใ€‚็„ถ่€Œ๏ผŒAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”ๆฒกๆœ‰ๆถ‰ๅŠไปปไฝ•ไธๅ‡†็กฎ็š„ไฟกๆฏ๏ผŒๅ› ๆญคๅœจ่ฟ™ไธช็ปดๅบฆไธŠๆ˜ฏๅˆ้€‚็š„ใ€‚ๆ‰“ๅˆ†๏ผš8ใ€‚
    ๆปก่ถณ็”จๆˆท้œ€ๆฑ‚๏ผš็”จๆˆทๅธŒๆœ›ๅพ—ๅˆฐไธ€ไธช20ๅฒๅฅณ็”Ÿๅฏ่ƒฝไผšๅ›žๅค็š„ๅ†…ๅฎนใ€‚AIๅŠฉๆ‰‹็š„ๅ›ž็ญ”ๆไพ›ไบ†ไธ€ไบ›็ˆฑๅฅฝ๏ผŒไฝ†ๆ˜ฏๅ›ž็ญ”็š„ๅ†…ๅฎน่พƒไธบ็ฎ€ๅ•๏ผŒๆฒกๆœ‰ๅ……ๅˆ†ๆปก่ถณ็”จๆˆทๅฏ่ƒฝๆœŸๅพ…็š„ไธฐๅฏŒๆ€งๅ’Œๅคšๆ ทๆ€งใ€‚ๆ‰“ๅˆ†๏ผš5ใ€‚
    ้€ป่พ‘่ฟž่ดฏๆ€ง๏ผšAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”ๅœจ้€ป่พ‘ไธŠๆ˜ฏ่ฟž่ดฏ็š„๏ผŒๆไพ›ไบ†ไธ€ไธชๅˆ็†็š„็ˆฑๅฅฝ๏ผŒๅนถไปฅไธ€็ง่‡ช็„ถ็š„ๆ–นๅผ็ป“ๆŸๅฏน่ฏใ€‚ๆ‰“ๅˆ†๏ผš8ใ€‚
    ๅˆ›้€ ๆ€ง๏ผšAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”็ผบไนๅˆ›้€ ๆ€ง๏ผŒๆไพ›็š„็ˆฑๅฅฝๆฏ”่พƒๅธธ่ง๏ผŒๆฒกๆœ‰ๅฑ•็คบๅ‡บๆ–ฐ้ข–ๆˆ–็‹ฌ็‰น็š„่ง่งฃใ€‚ๆ‰“ๅˆ†๏ผš4ใ€‚
    ไธฐๅฏŒๅบฆ๏ผšAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”่พƒไธบ็ฎ€็Ÿญ๏ผŒ็ผบไนๆทฑๅบฆๅ’Œ็ป†่Š‚๏ผŒๆœช่ƒฝๆไพ›ๅคšๆ ทๅŒ–็š„็ˆฑๅฅฝๆˆ–่ฏฆ็ป†็š„่งฃ้‡Š๏ผŒๅ› ๆญคๅœจไธฐๅฏŒๅบฆไธŠ่กจ็Žฐ่พƒๅทฎใ€‚ๆ‰“ๅˆ†๏ผš4ใ€‚
    ็ปผๅˆๆฏไธช็ปดๅบฆ็š„่ฏ„ไผฐ๏ผŒๅฏนAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”็ป™ๅ‡บไธ€ไธช1๏ฝž10็š„็ปผๅˆๅˆ†ๆ•ฐใ€‚
็ปผๅˆๆฅ็œ‹๏ผŒAIๅŠฉๆ‰‹็š„ๅ›ž็ญ”ๅŸบๆœฌๆปก่ถณไบ†็”จๆˆท็š„้œ€ๆฑ‚๏ผŒไฝ†็ผบไนไธฐๅฏŒๅบฆๅ’Œๅˆ›้€ ๆ€ง๏ผŒไธŽๅ‚่€ƒ็ญ”ๆกˆ็›ธๆฏ”่ฟ˜ๆœ‰ไธ€ๅฎšๅทฎ่ทใ€‚็ปผๅˆๅพ—ๅˆ†๏ผš5ใ€‚
{'ไบ‹ๅฎžๆญฃ็กฎๆ€ง': 8, 'ๆปก่ถณ็”จๆˆท้œ€ๆฑ‚': 5, '้€ป่พ‘่ฟž่ดฏๆ€ง': 8, 'ๅˆ›้€ ๆ€ง': 4, 'ไธฐๅฏŒๅบฆ': 4, '็ปผๅˆๅพ—ๅˆ†': 5}```

Pair-wise Judge

**Input**: ```Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user prompt displayed below. You will be given assistant A's answer and assistant B's answer. Your job is to evaluate which assistant's answer is better.\n\nBegin your evaluation by generating your own answer to the prompt. You must provide your answers before judging any answers.\n\nWhen evaluating the assistants' answers, compare both assistants' answers with your answer. You must identify and correct any mistakes or inaccurate information.\n\nThen consider if the assistant's answers are helpful, relevant, and concise. Helpful means the answer correctly responds to the prompt or follows the instructions. Note when user prompt has any ambiguity or more than one interpretation, it is more helpful and appropriate to ask for clarifications or more information from the user than providing an answer based on assumptions. Relevant means all parts of the response closely connect or are appropriate to what is being asked. Concise means the response is clear and not verbose or excessive.\n\nThen consider the creativity and novelty of the assistant's answers when needed. Finally, identify any missing important information in the assistants' answers that would be beneficial to include when responding to the user prompt.\n\nAfter providing your explanation, you must output only one of the following choices as your final verdict with a label:\n\n1. Assistant A is significantly better: [[A>>B]]\n2. Assistant A is slightly better: [[A>B]]\n3. Tie, relatively the same: [[A=B]]\n4. Assistant B is slightly better: [[B>A]]\n5. Assistant B is significantly better: [[B>>A]]\n\nExample output: \"My final verdict is tie: [[A=B]]\".\n<|User Prompt|>\nUse ABC notation to write a melody in the style of a folk tune.\n\n<|The Start of Assistant A's Answer|>\nX:1\nT:Untitled Folk Tune\nM:4/4\nL:1/8\nK:G\n|:G2A2|B2A2|G2E2|D4|E2F2|G2F2|E2C2|B,4|\nA2B2|c2B2|A2F2|E4|D2E2|F2E2|D2B,2|C4:|\n<|The End of Assistant A's Answer|>\n\n<|The Start of Assistant B's Answer|>\nSure, I'll create a simple folk tune in ABC notation for you. ABC notation is a simple way to write down music using letters to represent notes and various symbols to represent timing and other musical elements. Here's a short, cheerful folk tune:\n\n\nX:1\nT:Folk Tune\nM:4/4\nL:1/4\nK:G\nGABc dedB|ABAF GFE2|GABc dedB|1 ABAG FGA2:|2 ABAG FGAB|\nc2ec B2dB|c2ec BAGB|c2ec B2dB|1 AGFE D3B:|2 AGFE D3z|]\n\n\nHere's a breakdown of what each part means:\n\n- `X:` The index number of the tune.\n- `T:` The title of the tune.\n- `M:` The meter, in this case, 4/4 time.\n- `L:` The default note length, in this case, 1/4.\n- `K:` The key of the tune, in this case, G major.\n\nThe tune itself is written in two parts, each ending with a repeat mark (`:|`). The first part (`GABc dedB|ABAF GFE2|...`) is repeated with a different ending (`1 ABAG FGA2:|2 ABAG FGAB|`) to add variety. The second part (`c2ec B2dB|c2ec BAGB|...`) also has a repeat with a different ending, and it ends with a long note (`D3B` or `D3z`) to give it a sense of finality.\n\nThis tune is designed to be simple and catchy, with a mix of stepwise motion and skips to give it a lively folk feel. You can play it on any melodic instrument, and it's suitable for singing as well if you add lyrics.\n<|The End of Assistant B's Answer|>\n```

**Output**: ```To generate a melody in the style of a folk tune using ABC notation, I would create a simple and repetitive melody, typical of folk music. Here is my attempt: X:1 T:Folk Tune Example M:4/4 L:1/8 K:D
|: D2 E2 | F2 E2 | D2 A2 | B4 |
|  A2 B2 | c2 B2 | A2 F2 | E4 :|
|  D2 E2 | f2 e2 | d2 A2 | G4 |
|  F2 G2 | A2 G2 | F2 D2 | E4 :|
:|
Explanation: `X:` is the tune's index number. `T:` is the title of the tune. `M:` specifies the meter (4/4 time). `L:` sets the default note length to 1/8. `K:` indicates the key (D major).
The melody is constructed with simple intervals and repetitive phrases, which are characteristic of folk music. The tune has a clear structure with two verses and a repeat sign (`:`) to indicate repetition.
Now, let's evaluate the responses from Assistant A and Assistant B.
Assistant A's Answer: The melody provided by Assistant A is concise and follows the standard ABC notation. The choice of notes and structure is simple, which is fitting for a folk tune. The use of repetition and simple progression is consistent with folk music characteristics. The melody is clear and could be easily played by a musician.
Assistant B's Answer: Assistant B provides a detailed explanation of ABC notation, which is helpful for context. The melody includes more complex elements, such as varied repeat sections and additional symbols like `1` and `2`, which might not be typical for a straightforward folk tune. The explanation about the structure and the breakdown of each part is informative and adds value to the response. However, the use of `1` and `2` in the melody structure is unconventional and might confuse someone unfamiliar with ABC notation.
Evaluation: Both assistants provide a melody in ABC notation. However, Assistant B's response includes unnecessary complexity for a folk tune and might be confusing due to unconventional use of symbols. Assistant A's response is more straightforward and adheres closely to the expectations of a folk tune in ABC notation.
My final verdict is that Assistant A's answer is slightly better due to its simplicity and directness in creating a folk tune: [[A>B]].```

Response Critique

**Input**: ```ๅธฎๆˆ‘้’ˆๅฏนไธ‹้ข็š„ๅ›ž็ญ”็ป™ๅ‡บไฟฎๆ”น็š„ๆ„่ง๏ผŒ็”จๆˆท้—ฎ้ข˜๏ผšๅ†™ไธ€้ฆ–็ง‹ๅคฉ็š„่ฏ—๏ผŒไธ่ฆๆ่ฝๅถไธคไธชๅญ—๏ผŒไฝ†่ฆ่ƒฝไฝ“็Žฐ่ฝๅถ็š„ๆƒ…ๆ™ฏ ๆจกๅž‹ๅ›ž็ญ”๏ผš็ง‹้ฃŽ่ตท๏ผŒๆผซๅคฉ้ป„๏ผŒ้ฃ˜้›ถๆ˜ฏ้‚ฃ็ฆปๆžๅฟ™ใ€‚
็ฟ ็ปฟๆ›พๆต“ไปŠๅทฒๆ™š๏ผŒๅฒๆœˆๆ‚ ๆ‚ ๏ผŒ่ฐๆ›ฟ่ฏ‰็ฆปไผคใ€‚
ๆญฅๅฑฅๅพ๏ผŒๅฝฑ่‡ช้•ฟ๏ผŒๆ ‘ๆขข็จ€็–ๆ˜ ๆ–œ้˜ณใ€‚
ๆ›พ็ป็นๅŽ็š†่ฟ‡็œผ๏ผŒๅฆ‚ไปŠๅฏ‚ๅฏž๏ผŒ้“บๆปกๅค้“ๆ—ใ€‚
ๅฒๆœˆๆฒณ๏ผŒ้™ๆ‚„ๆ‚„๏ผŒ่ฝป่ˆŸๅทฒ่ฟ‡ไธ‡้‡ๅฑฑใ€‚
้‚ฃๅนดๅคๅคฉ๏ผŒ็ฌ‘่ฏญๆ‰ฌ๏ผŒ่€ŒไปŠๆƒŸไฝ™๏ผŒ่ฝๅถ็š„ๆฒ‰้ฆ™ใ€‚```

**Output**: ```่ฟ™้ฆ–่ฏ—้žๅธธๆœ‰ๆ„ๅขƒ๏ผŒๆ•ดไฝ“ไธŠๅพˆๅฅฝๅœฐไฝ“็Žฐไบ†็ง‹ๅคฉ็š„ๆฐ›ๅ›ดไปฅๅŠไธŽ่ฝๅถ็›ธๅ…ณ่”็š„็ฆปๅˆซไธŽๅ˜่ฟๆ„Ÿใ€‚ไธ่ฟ‡็”จๆˆท่ฆๆฑ‚ไธญ็‰นๅˆซๅผบ่ฐƒไบ†โ€œไธ่ฆๆ่ฝๅถไธคไธชๅญ—๏ผŒไฝ†่ฆ่ƒฝไฝ“็Žฐ่ฝๅถ็š„ๆƒ…ๆ™ฏโ€ใ€‚ๅŸบไบŽๆญค๏ผŒๆˆ‘ไปฌๅฏไปฅๅฏน่ฏ—ไฝœ่ฟ›่กŒไธ€ไบ›่ฐƒๆ•ด๏ผŒไปฅๆ›ดๅŠ ่ดดๅˆ็”จๆˆท็š„้œ€ๆฑ‚ใ€‚่ฟ™้‡Œ็ป™ๅ‡บไธ€ไบ›ๅปบ่ฎฎ๏ผš
    โ€œ็ง‹้ฃŽ่ตท๏ผŒๆผซๅคฉ้ป„๏ผŒ้ฃ˜้›ถๆ˜ฏ้‚ฃ็ฆปๆžๅฟ™ใ€‚โ€
    ๅฏไปฅๆ”นไธบ๏ผšโ€œ็ง‹้ฃŽ่ตท๏ผŒๆผซๅคฉ้ป„๏ผŒ็ฉบไธญ่ˆžๅŠจ็ฆปๆžๅฟ™ใ€‚โ€
    ่ฟ™ๆ ทๅฐฑ้ฟๅ…ไบ†็›ดๆŽฅไฝฟ็”จโ€œ้ฃ˜้›ถโ€ไธ€่ฏ๏ผŒๅŒๆ—ถไพ็„ถไผ ่พพไบ†ๅถๅญไปŽๆ ‘ไธŠ้ฃ˜่ฝ็š„ๆƒ…ๆ™ฏใ€‚
    โ€œ็ฟ ็ปฟๆ›พๆต“ไปŠๅทฒๆ™š๏ผŒๅฒๆœˆๆ‚ ๆ‚ ๏ผŒ่ฐๆ›ฟ่ฏ‰็ฆปไผคใ€‚โ€
    ๅฏไปฅๆ”นไธบ๏ผšโ€œ็ฟ ็ปฟๆ›พๆต“ไปŠๅทฒๆ™š๏ผŒๅฒๆœˆๆ‚ ้•ฟ๏ผŒๆ•…ไบ‹ๆ‚„ๆ— ๅฃฐใ€‚โ€
    ่ฟ™ๆ ทๆ”นๅŠจๅŽ๏ผŒ่™ฝ็„ถๆฒกๆœ‰็›ดๆŽฅ่กจ่พพ็ฆปไผค๏ผŒไฝ†ไพ็„ถ่ƒฝ่ฎฉไบบๆ„Ÿๅ—ๅˆฐ้š็€ๆ—ถ้—ด็š„ๆต้€๏ผŒๆ›พ็ป็š„็นๅŽๅทฒ็ป้€ๅŽป็š„ๅฏ‚ๅฏฅใ€‚
    โ€œๆญฅๅฑฅๅพ๏ผŒๅฝฑ่‡ช้•ฟ๏ผŒๆ ‘ๆขข็จ€็–ๆ˜ ๆ–œ้˜ณใ€‚โ€
    ่ฟ™ไธ€ๅฅๅทฒ็ปๅพˆๅฅฝๅœฐๆ็ป˜ไบ†ๅ‚ๆ™šๆ—ถๅˆ†็š„ๅฎ้™ไธŽๅญค็‹ฌ๏ผŒๅฏไปฅไฟๆŒไธๅ˜ใ€‚
    โ€œๆ›พ็ป็นๅŽ็š†่ฟ‡็œผ๏ผŒๅฆ‚ไปŠๅฏ‚ๅฏž๏ผŒ้“บๆปกๅค้“ๆ—ใ€‚โ€
    ๅฏไปฅๆ”นไธบ๏ผšโ€œๅพ€ๆ˜”็นๅŽๆˆ่ฟฝๅฟ†๏ผŒๅฆ‚ไปŠๅฏ‚ๅฏž๏ผŒๅค้“ๆ˜ ๆ–œ้˜ณใ€‚โ€
    ่ฟ™ๆ ทๆ”นๅŠจๅŽ๏ผŒๆ—ข่กจ่พพไบ†ๆ›พ็ป็š„็นๅŽ๏ผŒๅˆ็ชๅ‡บไบ†ๅฝ“ๅ‰็š„ๅฏ‚ๅฏžไธŽ่ฝๅฏžใ€‚
    โ€œๅฒๆœˆๆฒณ๏ผŒ้™ๆ‚„ๆ‚„๏ผŒ่ฝป่ˆŸๅทฒ่ฟ‡ไธ‡้‡ๅฑฑใ€‚โ€
    ่ฟ™ไธ€ๅฅๅฏไปฅไฟๆŒไธๅ˜ใ€‚
    โ€œ้‚ฃๅนดๅคๅคฉ๏ผŒ็ฌ‘่ฏญๆ‰ฌ๏ผŒ่€ŒไปŠๆƒŸไฝ™๏ผŒ่ฝๅถ็š„ๆฒ‰้ฆ™ใ€‚โ€
    ๅฏไปฅๆ”นไธบ๏ผšโ€œ้‚ฃๅนดๅคๅคฉ๏ผŒ็ฌ‘่ฏญๆ‰ฌ๏ผŒ่€ŒไปŠๆƒŸไฝ™๏ผŒๆฒ‰้ฆ™ๆผซๆผซใ€‚โ€
    ้ฟๅ…ไบ†็›ดๆŽฅๆๅˆฐโ€œ่ฝๅถโ€๏ผŒๅŒๆ—ถไฟ็•™ไบ†็ง‹ๅคฉ็š„ๆ„Ÿๆ€€ไธŽๅฒๆœˆ็š„ๅ˜่ฟใ€‚
ไฟฎๆ”นๅŽ็š„่ฏ—ไฝœๅฆ‚ไธ‹๏ผš
็ง‹้ฃŽ่ตท๏ผŒๆผซๅคฉ้ป„๏ผŒ็ฉบไธญ่ˆžๅŠจ็ฆปๆžๅฟ™ใ€‚
็ฟ ็ปฟๆ›พๆต“ไปŠๅทฒๆ™š๏ผŒๅฒๆœˆๆ‚ ้•ฟ๏ผŒๆ•…ไบ‹ๆ‚„ๆ— ๅฃฐใ€‚
ๆญฅๅฑฅๅพ๏ผŒๅฝฑ่‡ช้•ฟ๏ผŒๆ ‘ๆขข็จ€็–ๆ˜ ๆ–œ้˜ณใ€‚
ๆ›พ็ป็นๅŽๆˆ่ฟฝๅฟ†๏ผŒๅฆ‚ไปŠๅฏ‚ๅฏž๏ผŒๅค้“ๆ˜ ๆ–œ้˜ณใ€‚
ๅฒๆœˆๆฒณ๏ผŒ้™ๆ‚„ๆ‚„๏ผŒ่ฝป่ˆŸๅทฒ่ฟ‡ไธ‡้‡ๅฑฑใ€‚
้‚ฃๅนดๅคๅคฉ๏ผŒ็ฌ‘่ฏญๆ‰ฌ๏ผŒ่€ŒไปŠๆƒŸไฝ™๏ผŒๆฒ‰้ฆ™ๆผซๆผซใ€‚
่ฟ™้ฆ–ไฟฎๆ”นๅŽ็š„่ฏ—ไพ็„ถๆ็ป˜ไบ†็ง‹ๅคฉ็š„ๆ™ฏ่‰ฒไธŽๅญฃ่Š‚ๅธฆๆฅ็š„ๆ€็ปช๏ผŒๅŒๆ—ถๅทงๅฆ™ๅœฐ้ฟๅ…ไบ†ไฝฟ็”จโ€œ่ฝๅถโ€ไธ€่ฏ๏ผŒ่€Œ้€š่ฟ‡ๅ…ถไป–ๆ„่ฑกไผ ่พพไบ†่ฝๅถ็š„ๆƒ…ๆ™ฏใ€‚```

JudgerBench

We have also established a new benchmark named JudgerBench, aimed at standardizing the evaluation capabilities of different judging models, thereby helping to identify more effective evaluator models. To test your judge model on JudgerBench, please follow below code with Opencompass: Change the models to your models in configs/eval_judgerbench.py then run

git clone https://github.com/open-compass/opencompass opencompass
cd opencompass
pip install -e .
python run.py configs/eval_judgerbench.py --mode all --reuse latest

We also provided a leaderboard for JudgerBench: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Use CompassJudger-1 to Test Subjective Datasets in OpenCompass

If you wish to evaluate common subjective datasets using CompassJudger-1 in Opencompass, take the evaluation of Alignbench as an example. Please follow the code below:

You need to setup three items first:

  • 1.datasets (The subjective datasets you want to test)
  • 2.models (The models you want to test on the subjective datasets)
  • 3.judge_models (Which judge models you want to use as evaluator)

For more settings, please refer to the advanced guidance in OpenCompass.

from mmengine.config import read_base

with read_base():
    from opencompass.configs.datasets.subjective.alignbench.alignbench_judgeby_critiquellm import alignbench_datasets
    from opencompass.configs.models.qwen2_5.lmdeploy_qwen2_5_1_5b_instruct import models as lmdeploy_qwen2_5_1_5b_instruct 
from opencompass.models import HuggingFaceCausalLM, HuggingFace, HuggingFaceChatGLM3, OpenAI, TurboMindModelwithChatTemplate
from opencompass.partitioners import NaivePartitioner, SizePartitioner
from opencompass.partitioners.sub_naive import SubjectiveNaivePartitioner
from opencompass.partitioners.sub_size import SubjectiveSizePartitioner
from opencompass.partitioners.sub_num_worker import SubjectiveNumWorkerPartitioner
from opencompass.runners import LocalRunner
from opencompass.runners import SlurmSequentialRunner
from opencompass.tasks import OpenICLInferTask
from opencompass.tasks.subjective_eval import SubjectiveEvalTask
from opencompass.summarizers import SubjectiveSummarizer

api_meta_template = dict(
    round=[
        dict(role='HUMAN', api_role='HUMAN'),
        dict(role='BOT', api_role='BOT', generate=True),
    ]
)

# -------------Inference Stage ----------------------------------------
models = [*lmdeploy_qwen2_5_1_5b_instruct] # add models you want
datasets = [*alignbench_datasets] # add datasets you want


infer = dict(
    partitioner=dict(type=NaivePartitioner),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=OpenICLInferTask)),
)
# -------------Evalation Stage ----------------------------------------

## ------------- JudgeLLM Configuration
judge_models = [dict(
    dict(
        type=TurboMindModelwithChatTemplate,
        abbr='CompassJudger-1-7B-Instruct',
        path='opencompass/CompassJudger-1-7B-Instruct',
        engine_config=dict(session_len=16384, max_batch_size=16, tp=1),
        gen_config=dict(top_k=1, temperature=1e-6, top_p=0.9, max_new_tokens=2048),
        max_seq_len=16384,
        max_out_len=2048,
        batch_size=16,
        run_cfg=dict(num_gpus=1),
    )]

## ------------- Evaluation Configuration
eval = dict(
    partitioner=dict(type=SubjectiveNaivePartitioner, models=models, judge_models=judge_models,),
    runner=dict(type=LocalRunner, max_num_workers=16, task=dict(type=SubjectiveEvalTask)),
)

summarizer = dict(type=SubjectiveSummarizer, function='subjective')
work_dir = 'outputs/subjective/'

Then run:

python run.py configs/eval_subjective.py --mode all --reuse latest

For more detailed subjective evaluation guidelines, please refer to: https://github.com/open-compass/opencompass/blob/main/docs/en/advanced_guides/subjective_evaluation.md

Subjective Evaluation Leaderboard by CompassJudger-1

To facilitate better comparisons within the community, we have tested the subjective performance of some models using CompassJudger-1.

See in: https://huggingface.co/spaces/opencompass/judgerbench_leaderboard

If you want to add your model to this leaderboard, welcome to add an issue in this Repository.

Citation

@article{cao2024compass,
  title={CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution},
  author={Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, Kai Chen},
  journal={arXiv preprint arXiv:2410.16256},
  year={2024}
}

Acknowledge

Downloads last month
168
Safetensors
Model size
32.8B params
Tensor type
BF16
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for opencompass/CompassJudger-1-32B-Instruct

Base model

Qwen/Qwen2.5-32B
Finetuned
(92)
this model
Quantizations
3 models

Collection including opencompass/CompassJudger-1-32B-Instruct