NLP Course documentation

Annotate your dataset

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Annotate your dataset

Ask a Question

Now it is time to start working from the Argilla UI to annotate our dataset.

Align your team with annotation guidelines

Before you start annotating your dataset, it is always good practice to write some guidelines, especially if you’re working as part of a team. This will help you align on the task and the use of the different labels, and resolve questions or conflicts when they come up.

In Argilla, you can go to your dataset settings page in the UI and modify the guidelines and the descriptions of your questions to help with alignment.

Screenshot of the Dataset Settings page in Argilla.

If you want to dive deeper into the topic of how to write good guidelines, we recommend reading this blogpost and the bibliographical references mentioned there.

Distribute the task

In the dataset settings page, you can also change the dataset distribution settings. This will help you annotate more efficiently when you’re working as part of a team. The default value for the minimum submitted responses is 1, meaning that as soon as a record has 1 submitted response it will be considered complete and count towards the progress in your dataset.

Sometimes, you want to have more than one submitted response per record, for example, if you want to analyze the inter-annotator agreement in your task. In that case, make sure to change this setting to a higher number, but always smaller or equal to the total number of annotators. If you’re working on the task alone, you want this setting to be 1.

Annotate records

💡 If you are deploying Argilla in a Hugging Face Space, any team members will be able to log in using the Hugging Face OAuth. Otherwise, you may need to create users for them following this guide.

When you open your dataset, you will realize that the first question is already filled in with some suggested labels. That’s because in the previous section we mapped our question called label to the label_text column in the dataset, so that we simply need to review and correct the already existing labels:

Screenshot of the dataset in Argilla.

For the token classification, we’ll need to add all labels manually, as we didn’t include any suggestions. This is how it might look after the span annotations:

Screenshot of the dataset in Argilla with spans annotated.

As you move through the different records, there are different actions you can take:

  • submit your responses, once you’re done with the record.
  • save them as a draft, in case you want to come back to them later.
  • discard them, if the record souldn’t be part of the dataset or you won’t give responses to it.

In the next section, you will learn how you can export and use those annotations.

< > Update on GitHub