๐ฅ Export a dataset#
Your Argilla instance will always have all your datasets and annotations saved and accessible. However, if youโd like to save your dataset either locally or in the Hugging Face Hub, in this section you will find some useful methods to do just that.
Feedback Dataset#
Note
The dataset class covered in this section is the FeedbackDataset
. This fully configurable dataset will replace the DatasetForTextClassification
, DatasetForTokenClassification
, and DatasetForText2Text
in Argilla 2.0. Not sure which dataset to use? Check out our section on choosing a dataset.
Pull from Argilla#
The first step will be to pull a dataset from Argilla with the FeedbackDataset.from_argilla()
method. This method will return a new instance of FeedbackDataset
with the same guidelines, fields, questions, and records (including responses if any) as the dataset in Argilla.
Note
From Argilla 1.14.0, calling from_argilla
will pull the FeedbackDataset
from Argilla, but the instance will be remote, which implies that the additions, updates, and deletions of records will be pushed to Argilla as soon as they are made. This is a change from previous versions of Argilla, where you had to call push_to_argilla
again to push the changes to Argilla.
remote_dataset = rg.FeedbackDataset.from_argilla("my-dataset", workspace="my-workspace")
local_dataset = remote_dataset.pull(max_records=100) # get first 100 records
If your dataset includes vectors, by default these will not get pulled with the rest of the dataset in order to improve performance. If you would like to pull the vectors in your records, you will need to specify it like so:
remote_dataset = rg.FeedbackDataset.from_argilla(
name="my-dataset",
workspace="my-workspace",
with_vectors="all"
)
remote_dataset = rg.FeedbackDataset.from_argilla(
name="my-dataset",
workspace="my-workspace",
with_vectors=["my_vectors"]
)
At this point, you can do any post-processing you may need with this dataset e.g., unifying responses from multiple annotators. Once youโre happy with the result, you can decide on some of the following options to save it.
Push back to Argilla#
When using a FeedbackDataset
pulled from Argilla via FeedbackDataset.from_argilla
, you can always push the dataset back to Argilla in case you want to clone the dataset or explore it after post-processing.
# This publishes the dataset with its records to Argilla and returns the dataset in Argilla
remote_dataset = dataset.push_to_argilla(name="my-dataset", workspace="my-workspace")
local_dataset = remote_dataset.pull(max_records=100) # get first 100 records
# This publishes the dataset with its records to Argilla and turns the dataset object into a dataset in Argilla
dataset.push_to_argilla(name="my-dataset", workspace="my-workspace")
Additionally, you can still clone local FeedbackDataset
datasets that have neither been pushed nor pulled to/from Argilla, via calling push_to_argilla
.
dataset.push_to_argilla(name="my-dataset-clone", workspace="my-workspace")
Push to the Hugging Face Hub#
It is also possible to save and load a FeedbackDataset
into the Hugging Face Hub for persistence. The methods push_to_huggingface
and from_huggingface
allow you to push to or pull from the Hugging Face Hub, respectively.
When pushing a FeedbackDataset
to the HuggingFace Hub, one can provide the param generate_card
to generate and push the Dataset Card too. generate_card
is by default True
, so it will always be generated unless generate_card=False
is specified.
# Push to HuggingFace Hub
dataset.push_to_huggingface("argilla/my-dataset")
# Push to HuggingFace Hub as private
dataset.push_to_huggingface("argilla/my-dataset", private=True, token="...")
Note that the FeedbackDataset.push_to_huggingface()
method uploads not just the dataset records, but also a configuration file named argilla.yaml
, that contains the dataset configuration i.e. the fields, questions, and guidelines, if any. This way you can load any FeedbackDataset
that has been pushed to the Hub back in Argilla using the from_huggingface
method. Take a look at all public Argilla compatible datasets on the Hugging Face hub.
# Load a public dataset
dataset = rg.FeedbackDataset.from_huggingface("argilla/my-dataset")
# Load a private dataset
dataset = rg.FeedbackDataset.from_huggingface("argilla/my-dataset", use_auth_token=True)
Save to disk#
Additionally, due to the integration with ๐ค Datasets, you can also export the records of a FeedbackDataset
locally in your preferred format by converting the FeedbackDataset
to a datasets.Dataset
first using the method format_as("datasets")
. Then, you may export the datasets.Dataset
to either CSV, JSON, Parquet, etc. Check all the options in the ๐คDatasets documentation.
hf_dataset = dataset.format_as("datasets")
hf_dataset.save_to_disk("sharegpt-prompt-rating-mini") # Save as a `datasets.Dataset` in the local filesystem
hf_dataset.to_csv("sharegpt-prompt-rating-mini.csv") # Save as CSV
hf_dataset.to_json("sharegpt-prompt-rating-mini.json") # Save as JSON
hf_dataset.to_parquet() # Save as Parquet
Note
This workaround will just export the records into the desired format, not the dataset configuration. If you want to load the records back into Argilla, you will need to create a FeedbackDataset and add the records as explained in the corresponding guides.
Other datasets#
Note
The records classes covered in this section correspond to three datasets: DatasetForTextClassification
, DatasetForTokenClassification
, and DatasetForText2Text
. These will be deprecated in Argilla 2.0 and replaced by the fully configurable FeedbackDataset
class. Not sure which dataset to use? Check out our section on choosing a dataset.
Pull from Argilla#
You can simply load the dataset from Argilla using the rg.load()
function.
import argilla as rg
# load your annotated dataset from the Argilla web app
dataset_rg = rg.load("my_dataset")
For easiness and manageability, Argilla offers transformations to Hugging Face Datasets and Pandas DataFrame.
# export your Argilla Dataset to a datasets Dataset
dataset_ds = dataset_rg.to_datasets()
# export to a pandas DataFrame
df = dataset_rg.to_pandas()
Push back to Argilla#
When using other datasets pulled from Argilla via rg.load
, you can always push the dataset back to Argilla. This can be done using the rg.log()
function, just like you did when pushing records for the first time to Argilla. If the records donโt exist already in the dataset, these will be added to it, otherwise, the existing records will be updated.
import argilla as rg
dataset_rg = rg.load("my_dataset")
# loop through the records and change them
rg.log(dataset_rg, name="my_dataset")
Push to the Hugging Face Hub#
You can push your dataset in the form of a Hugging Face Dataset directly to the hub. Just use the to_datasets()
transformation as explained in the previous section and push the dataset:
# push the dataset to the Hugging Face Hub
dataset_ds.push_to_hub("my_dataset")
Save to disk#
Your dataset will always be safe and accessible from Argilla, but in case you need to share or save it somewhere else, here are a couple of options.
Alternatively, you can save the dataset locally. To do that, we recommend formatting the dataset as a Hugging Face Dataset or Pandas DataFrame first and use the methods native to these libraries to export as CSV, JSON, Parquet, etc.
# save locally using Hugging Face datasets
import argilla as rg
# load your annotated dataset from the Argilla web app
dataset_rg = rg.load("my_dataset")
# export your Argilla Dataset to a datasets Dataset
dataset_ds = dataset_rg.to_datasets()
dataset_ds.save_to_disk("my_dataset") # Save as a `datasets.Dataset` in the local filesystem
dataset_ds.to_csv("my_dataset.csv") # Save as CSV
dataset_ds.to_json("my_dataset.json") # Save as JSON
dataset_ds.to_parquet() # Save as Parquet
# save locally using Pandas DataFrame
import argilla as rg
# load your annotated dataset from the Argilla web app
dataset_rg = rg.load("my_dataset")
# export your Argilla Dataset to a Pandas DataFrame
df = dataset_rg.to_pandas()
df.to_csv("my_dataset.csv") # Save as CSV
df.to_json("my_dataset.json") # Save as JSON
df.to_parquet("my_dataset.parquet") # Save as Parquet