数据标注与 Argilla Spaces

本 Notebook 展示了系统地评估大型语言模型（LLM）输出并创建 LLM 训练数据的工作流程。你可以通过使用本 Notebook 评估你最喜爱的 LLM 在没有任何微调的情况下在特定任务上的零样本性能。如果你希望提高性能，你可以轻松地重用此工作流程来创建训练数据。

示例用例：代码生成。 在本教程中，我们展示了如何为代码生成任务创建高质量的测试和训练数据。然而，相同的工作流程也可以适应其他与你特定用例相关的任务。

在本 Notebook 中，我们将进行以下步骤：

下载示例任务的数据。
提示两个 LLM 对这些任务作出响应。这将生成“合成数据”以加速手动数据创建。
在 HF Spaces 上创建一个 Argilla 标注界面，以比较和评估两个 LLM 的输出。
将示例数据和零样本 LLM 的响应上传到 Argilla 标注界面。
下载已标注的数据。

你可以根据自己的需求调整本 Notebook，例如在步骤 (2) 使用不同的 LLM 和 API 提供商，或在步骤 (3) 调整标注任务。

安装需要的包并连接到 HF Hub

!pip install argilla~=2.0.0
!pip install transformers~=4.40.0
!pip install datasets~=2.19.0
!pip install huggingface_hub~=0.23.2

# Login to the HF Hub. We recommend using this login method 
# to avoid the need to explicitly store your HF token in variables 
import huggingface_hub
!git config --global credential.helper store
huggingface_hub.login(add_to_git_credential=True)

下载示例任务数据

首先，我们下载一个包含 LLM 代码生成任务的示例数据集。我们希望评估两个不同的 LLM 在这些代码生成任务上的表现。我们使用来自 bigcode/self-oss-instruct-sc2-exec-filter-50k 数据集的指令，该数据集用于训练 StarCoder2-Instruct 模型。

>>> from datasets import load_dataset

>>> # Small sample for faster testing
>>> dataset_codetask = load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train[:3]")
>>> print("Dataset structure:\n", dataset_codetask, "\n")

>>> # We are only interested in the instructions/prompts provided in the dataset
>>> instructions_lst = dataset_codetask["instruction"]
>>> print("Example instructions:\n", instructions_lst[:2])

Dataset structure:
 Dataset(&#123;
    features: ['fingerprint', 'sha1', 'seed', 'response', 'concepts', 'prompt', 'instruction', 'id'],
    num_rows: 3
}) 

Example instructions:
 ['Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.', 'Write a Python function `check_collision` that takes a list of `rectangles` as input and checks if there are any collisions between any two rectangles. A rectangle is represented as a tuple (x, y, w, h) where (x, y) is the top-left corner of the rectangle, `w` is the width, and `h` is the height.\n\nThe function should return True if any pair of rectangles collide, and False otherwise. Use an iterative approach and check for collisions based on the bounding box collision detection algorithm. If a collision is found, return True immediately without checking for more collisions.']

对示例任务提示两个 LLM 以获取它们的响应。

使用 chat_template 格式化指令

在将指令发送到 LLM API 之前，我们需要使用正确的 chat_template 格式化指令，以便对每个需要评估的模型进行适配。这实际上是指在指令周围加上特殊的标记。有关 chat_template 的详细信息，请参阅文档。

>>> # Apply correct chat formatting to instructions from the dataset 
>>> from transformers import AutoTokenizer

>>> models_to_compare = ["mistralai/Mixtral-8x7B-Instruct-v0.1", "meta-llama/Meta-Llama-3-70B-Instruct"]

>>> def format_prompt(prompt, tokenizer):
...     messages = [{"role": "user", "content": prompt}]
...     messages_tokenized = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
...     return messages_tokenized


>>> prompts_formatted_dic = {}
>>> for model in models_to_compare:
...     tokenizer = AutoTokenizer.from_pretrained(model)

...     prompt_formatted = []
...     for instruction in instructions_lst: 
...         prompt_formatted.append(format_prompt(instruction, tokenizer))
        
...     prompts_formatted_dic.update({model: prompt_formatted})


>>> print(f"\nFirst prompt formatted for {models_to_compare[0]}:\n\n", prompts_formatted_dic[models_to_compare[0]][0], "\n\n")
>>> print(f"First prompt formatted for {models_to_compare[1]}:\n\n", prompts_formatted_dic[models_to_compare[1]][0], "\n\n")

First prompt formatted for mistralai/Mixtral-8x7B-Instruct-v0.1:

 [INST] Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None. [/INST] 


First prompt formatted for meta-llama/Meta-Llama-3-70B-Instruct:

 <|begin_of_text|><|start_header_id|>user<|end_header_id|>

Write a Python function named `get_value` that takes a matrix (represented by a list of lists) and a tuple of indices, and returns the value at that index in the matrix. The function should handle index out of range errors by returning None.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

将指令发送到 HF 推理 API

现在，我们可以将指令发送到两个 LLM 的 API，以获取可以评估的输出。首先，我们需要定义一些参数，以确保正确生成响应。Hugging Face 的 LLM API 由文本生成推理 (TGI) 容器提供支持。有关 TGI OpenAPI 规范，请参见此处，并可参考 Transformers 生成参数的详细说明文档。

generation_params = dict(
    # we use low temperature and top_p to reduce creativity and increase likelihood of highly probable tokens
    temperature=0.2,
    top_p=0.60,
    top_k=None,
    repetition_penalty=1.0,
    do_sample=True,
    max_new_tokens=512*2,
    return_full_text=False,
    seed=42,
    #details=True,
    #stop=["<|END_OF_TURN_TOKEN|>"],
    #grammar={"type": "json"}
    max_time=None, 
    stream=False,
    use_cache=False,
    wait_for_model=False,
)

现在，我们可以向 Serverless 推理 API 发出标准的 API 请求 (文档)。需要注意的是，Serverless 推理 API 主要用于测试，并且有速率限制。如果需要在没有速率限制的情况下进行测试，可以通过 Hugging Face 专用端点创建自己的 API (文档)。另外，你还可以参考我们的相关教程，了解更多关于推理 API 的使用，详见开源 AI 指南。

一旦推理 API 示例完成，下面的代码将会更新。

>>> import requests
>>> from tqdm.auto import tqdm

>>> # Hint: use asynchronous API calls (and dedicated endpoints) to increase speed
>>> def query(payload=None, api_url=None):
...     response = requests.post(api_url, headers=headers, json=payload)
...     return response.json()

>>> headers = {"Authorization": f"Bearer {huggingface_hub.get_token()}"}

>>> output_dic = {}
>>> for model in models_to_compare:
...     # Create API urls for each model
...     # When using dedicated endpoints, you can reuse the same code and simply replace this URL
...     api_url = "https://api-inference.huggingface.co/models/" + model
    
...     # send requests to API 
...     output_lst = []
...     for prompt in tqdm(prompt_formatted):
...         output = query(
...             payload={
...                 "inputs": prompt,
...                 "parameters": {**generation_params}
...             },
...             api_url=api_url 
...         )
...         output_lst.append(output[0]["generated_text"])
    
...     output_dic.update({model: output_lst})

>>> print(f"---First generation of {models_to_compare[0]}:\n{output_dic[models_to_compare[0]][0]}\n\n")
>>> print(f"---First generation of {models_to_compare[1]}:\n{output_dic[models_to_compare[1]][0]}")

---First generation of mistralai/Mixtral-8x7B-Instruct-v0.1:
Here's a Python function that meets your requirements:

```python
def get_value(matrix, indices):
    try:
        return matrix[indices[0]][indices[1]]
    except IndexError:
        return None
```

This function takes a matrix (represented by a list of lists) and a tuple of indices as input. It first tries to access the value at the given indices in the matrix. If the indices are out of range, it catches the `IndexError` exception and returns `None`.


---First generation of meta-llama/Meta-Llama-3-70B-Instruct:
Here is a Python function that does what you described:
```
def get_value(matrix, indices):
    try:
        row, col = indices
        return matrix[row][col]
    except IndexError:
        return None
```
Here's an explanation of how the function works:

1. The function takes two arguments: `matrix` (a list of lists) and `indices` (a tuple of two integers, representing the row and column indices).
2. The function tries to access the value at the specified indices using `matrix[row][col]`.
3. If the indices are out of range (i.e., `row` or `col` is greater than the length of the corresponding dimension of the matrix), an `IndexError` exception is raised.
4. The `except` block catches the `IndexError` exception and returns `None` instead of raising an error.

Here's an example usage of the function:
```
matrix = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

print(get_value(matrix, (0, 0)))  # prints 1
print(get_value(matrix, (1, 1)))  # prints 5
print(get_value(matrix, (3, 0)))  # prints None (out of range)
print(get_value(matrix, (0, 3)))  # prints None (out of range)
```
I hope this helps! Let me know if you have any questions.

将 LLM 输出存储在数据集中

现在，我们可以将 LLM 的输出与原始指令一起存储在一个数据集中。

# create a HF dataset with the instructions and model outputs
from datasets import Dataset

dataset = Dataset.from_dict({
    "instructions": instructions_lst,
    "response_model_1": output_dic[models_to_compare[0]],
    "response_model_2": output_dic[models_to_compare[1]]
})

dataset

创建并配置你的 Argilla 数据集

我们使用 Argilla，这是一个为 AI 工程师和领域专家设计的协作工具，用于构建高质量的数据集以支持他们的项目。

我们通过 HF Space 运行 Argilla，你可以通过几次点击便能设置好，无需进行本地配置。你可以按照这些指引创建 HF Argilla Space。有关 HF Argilla Spaces 的更多配置，请参阅详细的文档。如果你希望，也可以通过 Argilla 的 Docker 容器在本地运行 Argilla（参见 Argilla 文档)。

Argilla 登录界面

程序化地与 Argilla 交互

在我们定制数据集以适应特定任务并上传将在 UI 中显示的数据之前，我们需要先完成一些设置。

将此 notebook 连接到 Argilla： 现在，我们可以将此 notebook 连接到 Argilla，以便程序化地配置你的数据集并上传/下载数据。

# After starting the Argilla Space (or local docker container) you can connect to the Space with the code below.
import argilla as rg

client = rg.Argilla(
    api_url="https://username-spacename.hf.space",  # Locally: "http://localhost:6900"
    api_key="your-apikey",  # You'll find it in the UI "My Settings > API key"
    # To use a private HF Argilla Space, also pass your HF token
    headers={"Authorization": f"Bearer {huggingface_hub.get_token()}"},
)

user = client.me
user

编写良好的标注指南

为你的人工标注员编写良好的指南与编写高质量的训练代码一样重要（也是一项具有挑战性的任务）。良好的指南应满足以下标准：

简洁明了：指南应简洁明了，以便那些对任务一无所知的人也能理解。在发布之前，最好请至少一位同事重新阅读指南，确保没有歧义。
可重复和明确：完成标注任务所需的所有信息应包含在指南中。一个常见的错误是，在与选定的标注员沟通时，未能正式解释指南的内容。未来的标注员将无法获取这些信息，如果没有在指南中明确说明，他们可能会以不同于预期的方式完成任务。
简短而全面：指南应尽可能简短，但包含所有必要的信息。标注员通常不会认真阅读冗长的指南，因此请尽量保持简洁，同时确保内容全面。

请注意，编写标注指南是一个迭代过程。建议你自己先做几十个标注，并根据数据中学到的经验来完善指南，然后再将任务分配给他人。随着任务的演变，对指南进行版本管理也很有帮助。有关更多提示，请参见这篇博客文章。

annotator_guidelines = """\
Your task is to evaluate the responses of two LLMs to code generation tasks. 

First, you need to score each response on a scale from 0 to 7. You add points to your final score based on the following criteria:
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code is overall correct, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
Your resulting final score can be any value between 0 to 7. 

If both responses have a final score of <= 4, select one response and correct it manually in the text field. 
The corrected response must fulfill all criteria from above. 
"""

rating_tooltip = """\
- Add up to +2 points, if the code is properly commented, with inline comments and doc strings for functions.
- Add up to +2 points, if the code contains a good example for testing. 
- Add up to +3 points, if the code runs and works correctly. Copy the code into an IDE and test it with at least two different inputs. Attribute one point if the code works mostly correctly, but has some issues. Attribute three points if the code is fully correct and robust against different scenarios. 
"""

累积评分 vs. Likert 量表： 注意，上述指南要求标注员通过为明确的标准加分来进行累积评分。另一种方法是使用 “Likert 量表”，在这种方法中，标注员被要求在一个连续的尺度上对响应进行评分，例如从 1（非常差）到 3（一般）再到 5（非常好）。我们通常推荐使用累积评分，因为它迫使你和标注员将质量标准明确化，而仅仅将响应评分为 “4”（好）是模糊的，不同的标注员可能会对其有不同的解释。

将你的 Argilla 数据集定制为特定任务

现在，我们可以为特定任务创建自己的 code-llm 任务，定义标注所需的字段、问题和元数据。有关如何配置 Argilla 数据集的更多信息，请参阅 Argilla 文档。

dataset_argilla_name = "code-llm"
workspace_name = "argilla"
reuse_existing_dataset = False  # for easier iterative testing

# Configure your dataset settings
settings = rg.Settings(
    # The overall annotation guidelines, which human annotators can refer back to inside of the interface
    guidelines="my guidelines",
    fields=[
        rg.TextField(
            name="instruction", title="Instruction:", use_markdown=True, required=True
        ),
        rg.TextField(
            name="generation_1",
            title="Response model 1:",
            use_markdown=True,
            required=True,
        ),
        rg.TextField(
            name="generation_2",
            title="Response model 2:",
            use_markdown=True,
            required=True,
        ),
    ],
    # These are the questions we ask annotators about the fields in the dataset
    questions=[
        rg.RatingQuestion(
            name="score_response_1",
            title="Your score for the response of model 1:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.RatingQuestion(
            name="score_response_2",
            title="Your score for the response of model 2:",
            description="0=very bad, 7=very good",
            values=[0, 1, 2, 3, 4, 5, 6, 7],
            required=True,
        ),
        rg.LabelQuestion(
            name="which_response_corrected",
            title="If both responses score below 4, select a response to correct:",
            description="Select the response you will correct in the text field below.",
            labels=["Response 1", "Response 2", "Combination of both", "Neither"],
            required=False,
        ),
        rg.TextQuestion(
            name="correction",
            title="Paste the selected response below and correct it manually:",
            description="Your corrected response must fulfill all criteria from the annotation guidelines.",
            use_markdown=True,
            required=False,
        ),
        rg.TextQuestion(
            name="comments",
            title="Annotator Comments",
            description="Add any additional comments here. E.g.: edge cases, issues with the interface etc.",
            use_markdown=True,
            required=False,
        ),
    ],
    metadata=[
        rg.TermsMetadataProperty(
            name="source-dataset",
            title="Original dataset source",
        ),
    ],
    allow_extra_metadata=False,
)

if reuse_existing_dataset:
    dataset_argilla = client.datasets(dataset_argilla_name, workspace=workspace_name)
else:
    dataset_argilla = rg.Dataset(
        name=dataset_argilla_name,
        settings=settings,
        workspace=workspace_name,
    )
    if client.datasets(dataset_argilla_name, workspace=workspace_name) is not None:
        client.datasets(dataset_argilla_name, workspace=workspace_name).delete()
    dataset_argilla = dataset_argilla.create()

dataset_argilla

运行上述代码后，你将在 Argilla 中看到新的自定义 code-llm 数据集（以及之前可能创建的任何其他数据集）。

将数据加载到 Argilla

此时，数据集仍然为空。让我们使用以下代码加载一些数据。

# Iterate over the samples in the dataset
records = [
    rg.Record(
        fields={
            "instruction": example["instructions"],
            "generation_1": example["response_model_1"],
            "generation_2": example["response_model_2"],
        },
        metadata={
            "source-dataset": "bigcode/self-oss-instruct-sc2-exec-filter-50k",
        },
        # Optional: add suggestions from an LLM-as-a-judge system
        # They will be indicated with a sparkle icon and shown as pre-filled responses
        # It will speed up manual annotation
        # suggestions=[
        #     rg.Suggestion(
        #         question_name="score_response_1",
        #         value=example["llm_judge_rating"],
        #         agent="llama-3-70b-instruct",
        #     ),
        # ],
    )
    for example in dataset
]

try:
    dataset_argilla.records.log(records)
except Exception as e:
    print("Exception:", e)

Argilla 的标注 UI 将类似于下图所示：

Argilla UI

标注

就是这样，我们已经创建了 Argilla 数据集，现在可以开始在 UI 中进行标注了！默认情况下，记录会在完成 1 次标注后标记为已完成。查看以下指南，了解如何自动分配标注任务和在 Argilla 中进行标注。

重要提示：如果你在 HF Space 中使用 Argilla，请确保启用持久存储，以便你的数据安全存储，并且不会在一段时间后被自动删除。对于生产环境，确保在进行任何标注之前就启用持久存储，以避免数据丢失。

下载标注数据

标注完成后，你可以从 Argilla 中提取数据，并以任何表格格式在本地存储和处理（详见文档）。你还可以下载数据集的过滤版本（详见文档）。

annotated_dataset = client.datasets(dataset_argilla_name, workspace=workspace_name)

hf_dataset = annotated_dataset.records.to_datasets()

# This HF dataset can then be formatted, stored and processed into any tabular data format
hf_dataset.to_pandas()

# Store the dataset locally
hf_dataset.to_csv("argilla-dataset-local.csv")  # Save as CSV
#hf_dataset.to_json("argilla-dataset-local.json")  # Save as JSON
#hf_dataset.save_to_disk("argilla-dataset-local")  # Save as a `datasets.Dataset` in the local filesystem
#hf_dataset.to_parquet()  # Save as Parquet

下一步

就这样！你已经使用 HF 推理 API 创建了合成 LLM 数据，在 Argilla 中创建了数据集，将 LLM 数据上传到 Argilla，评估/修正了数据，并在标注后将数据以简单的表格格式下载，用于后续使用。

我们特别设计了这个流程和界面来支持 两个主要用例：

评估：你现在可以简单地使用 score_response_1 和 score_response_2 列中的数值分数来计算哪一个模型整体表现更好。你还可以检查评分非常低或非常高的响应，以进行详细的错误分析。在测试或训练不同模型时，你可以重复使用这个流程，并跟踪不同模型随时间的改进。
训练：在标注足够数据后，你可以从数据中创建训练-测试集，并微调你自己的模型。你可以使用高度评价的响应文本进行监督式微调，利用 TRL SFTTrainer，或者直接使用评分来进行偏好微调技术，如 DPO，使用 TRL DPOTrainer。有关不同 LLM 微调技术的优缺点，请参见 TRL 文档。

调整和改进：许多方面可以进行改进，以便将此流程定制为适应你的特定用例。例如，你可以提示一个 LLM 来评估两个 LLM 输出的结果，指令与人类标注员指南非常相似（“LLM 作为评审” 方法）。这可以帮助进一步加速你的评估流程。请参见我们的 LLM 作为评审的示例实现和我们的整体开源 AI 指南，其中包含许多其他创意。

Update on GitHub

Open-Source AI Cookbook

数据标注与 Argilla Spaces

安装需要的包并连接到 HF Hub

下载示例任务数据

对示例任务提示两个 LLM 以获取它们的响应。

使用 chat_template 格式化指令

将指令发送到 HF 推理 API

将 LLM 输出存储在数据集中

创建并配置你的 Argilla 数据集

程序化地与 Argilla 交互

编写良好的标注指南

将你的 Argilla 数据集定制为特定任务

将数据加载到 Argilla

标注

下载标注数据

下一步