Load your dataset to Argilla
Depending on the NLP task that you’re working with and the specific use case or application, your data and the annotation task will look differently. For this section of the course, we’ll use a dataset collecting news to complete two tasks: a text classification on the topic of each text and a token classification to identify the named entities mentioned.
It is possible to import datasets from the Hub using the Argilla UI directly, but we’ll be using the SDK to learn how we can make further edits to the data if needed.
Configure your dataset
The first step is to connect to our Argilla instance as we did in the previous section:
import argilla as rg
HF_TOKEN = "..." # only for private spaces
client = rg.Argilla(
api_url="...",
api_key="...",
headers={"Authorization": f"Bearer {HF_TOKEN}"}, # only for private spaces
)
We can now think about the settings of our dataset in Argilla. These represent the annotation task we’ll do over our data. First, we can load the dataset from the Hub and inspect its features, so that we can make sure that we configure the dataset correctly.
from datasets import load_dataset
data = load_dataset("SetFit/ag_news", split="train")
data.features
These are the features of our dataset:
{'text': Value(dtype='string', id=None),
'label': Value(dtype='int64', id=None),
'label_text': Value(dtype='string', id=None)}
It contains a text
and also some initial labels for the text classification. We’ll add those to our dataset settings together with a spans
question for the named entities:
settings = rg.Settings(
fields=[rg.TextField(name="text")],
questions=[
rg.LabelQuestion(
name="label", title="Classify the text:", labels=data.unique("label_text")
),
rg.SpanQuestion(
name="entities",
title="Highlight all the entities in the text:",
labels=["PERSON", "ORG", "LOC", "EVENT"],
field="text",
),
],
)
Let’s dive a bit deeper into what these settings mean. First, we’ve defined fields, these include the information that we’ll be annotating. In this case, we only have one field and it comes in the form of a text, so we’ve choosen a TextField
.
Then, we define questions that represent the tasks that we want to perform on our data:
- For the text classification task we’ve chosen a
LabelQuestion
and we used the unique values of thelabel_text
column as our labels, to make sure that the question is compatible with the labels that already exist in the dataset. - For the token classification task, we’ll need a
SpanQuestion
. We’ve defined a set of labels that we’ll be using for that task, plus the field on which we’ll be drawing the spans.
To learn more about all the available types of fields and questions and other advanced settings, like metadata and vectors, go to the Argilla docs.
Upload the dataset
Now that we’ve defined some settings, we can create the dataset:
dataset = rg.Dataset(name="ag_news", settings=settings)
dataset.create()
The dataset now appears in our Argilla instance, but you will see that it’s empty:
Now we need to add the records that we’ll be annotating i.e., the rows in our dataset. To do that, we’ll simply need to log the data as records and provide a mapping for those elements that don’t have the same name in the hub and Argilla datasets:
dataset.records.log(data, mapping={"label_text": "label"})
In our mapping, we’ve specified that the label_text
column in the dataset should be mapped to the question with the name label
. In this way, we’ll use the existing labels in the dataset as pre-annotations so we can annotate faster.
While the the records continue to log, you can already start working with your dataset in the Argilla UI. At this point, it should look like this:
Now our dataset is ready to start annotating!
< > Update on GitHub