NLP Course documentation

Use your annotated dataset

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Use your annotated dataset

Ask a Question Open In Colab Open In Studio Lab

We will learn now how to export and use the annotated data that we have in Argilla.

Load the dataset

First, we’ll need to make sure that we’re connected to our Argilla instance as in the previous steps:

import argilla as rg

HF_TOKEN = "..."  # only for private spaces

client = rg.Argilla(
    api_url="...",
    api_key="...",
    headers={"Authorization": f"Bearer {HF_TOKEN}"},  # only for private spaces
)

And now, we’ll load the dataset that we’ll be working with:

dataset = client.datasets(name="ag_news")

Loading the dataset and calling its records with dataset.records is enough to start using your dataset and records for your own purposes and pipelines. However, we’ll also learn how to do a few optional operations, like filtering the records and exporting your dataset to the Hugging Face Hub.

Filter the dataset

Sometimes you only want to use the records that have been completed, so we will first filter the records in our dataset based on their status:

status_filter = rg.Query(filter=rg.Filter([("status", "==", "completed")]))

filtered_records = dataset.records(status_filter)

⚠️ Note that the records with completed status (i.e., records that meet the minimum submitted responses configured in the task distribution settings) could have more than one response and that each response can have any status from submitted, draft or discarded.

Learn more about querying and filtering records in the Argilla docs.

Export to the Hub

We can now export our annotations to the Hugging Face Hub, so we can share them with others. To do this, we’ll need to convert the records into a 🤗 Dataset and then push it to the Hub:

filtered_records.to_datasets().push_to_hub("argilla/ag_news_annotated")

Alternatively, we can export directly the complete Argilla dataset (including pending records) like this:

dataset.to_hub(repo_id="argilla/ag_news_annotated")

This is an interesting choice in case others want to open the dataset in their Argilla instances, as the settings are automatically saved and they can simply import the full dataset using a single line of code:

dataset = rg.Dataset.from_hub(repo_id="argilla/ag_news_annotated")
< > Update on GitHub