🚀 Run Argilla with a Transformer in an active learning loop and a free GPU in your browser#

In this tutorial, you will learn how to set up a complete active learning loop with Google Colab with a GPU in the backend. This tutorial is based on the small-text active learning tutorial. The main difference is that this tutorial is designed to be run in a Google Colab notebook with a GPU as the backend for a more efficient active learning loop with Transformer models. It is recommended to follow this tutorial directly on Google Colab. You can open the Colab notebook via this hyperlink, create your own copy and modify it for your own use-cases.

⚠️ Note that this notebook requires manual input to start Argilla in a terminal and to input an ngrok token. Please read the instructions for each cell. If you do not follow the instructions and execute everything in the correct order, the code will bug. If you face an error, restarting your runtime can solve several issues. ⚠️

🙋🏼‍♂️ The notebook was contributed by Moritz Laurer

Initial setup on Google Colab#

In the Colab interface, you can choose a CPU (for initial testing) or a GPU (for an efficient active learning loop) by clicking Runtime > Change runtime type > Hardware accelerator in the menu in the top left. Once you have chosen your hardware, install the required packages.

[ ]:

%pip install "argilla[server, listeners]==1.16.0"
%pip install "transformers[sentencepiece]~=4.25.1"
%pip install "datasets~=2.7.1"
%pip install "small-text[transformers]~=1.3.2"
%pip install "colab-xterm~=0.1.2"
%pip install "pyngrok~=5.2.1"
%pip install "colab-xterm~=0.1.2"

[ ]:

# info on the hardware you are using - either a CPU or GPU
!nvidia-smi
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:

try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

Install Elastic Search#

Elastic Search is a requirement for using Argilla. The docker installation of Elastic Search recommended by Argilla does not work in Google Colab as Colab does not support docker. Elastic Search therefore needs to be installed ‘manually’ with the following code.

[ ]:

%%bash

wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.2-linux-x86_64.tar.gz -q
tar -xzf elasticsearch-7.10.2-linux-x86_64.tar.gz
chown -R daemon:daemon elasticsearch-7.10.2

[ ]:

%%bash --bg

sudo -u daemon -- elasticsearch-7.10.2/bin/elasticsearch

[ ]:

import time
time.sleep(30)  # sleeping to give ES time to set up. Otherwise downstream code will bug

Start the Argilla localhost in a terminal#

You now need to start Argilla localhost in a separate terminal. We cannot just run !argilla server start in a code cell on Colab, because the cell will run indefinitely and block us from running other cells. We therefore need to open a separate terminal to run Argilla.

Option with Colab Pro: Open the Colab Pro terminal (button to the bottom left) and type in the terminal: argilla server start
Option without Colab Pro: Run the following code cell to get a free terminal window in the code cell with xterm. Then type argilla server start in the terminal window

[ ]:

# create a terminal to run Argilla with, in case you don't have Colab Pro.
# type "argilla server start" into the terminal that appears below this code cell.
%load_ext colabxterm
%xterm

The terminal window above should now display something like:

“… INFO: Application startup complete.

INFO: Uvicorn running on http://0.0.0.0:6900 (Press CTRL+C to quit)”

Create a public link to Argilla localhost with ngrok#

We now have some virtual machine from Google running an Argilla localhost, but we cannot access it yet. ngrok is a service designed to create public links to a localhost. We can therefore use ngrok to create a public link to access the Argilla localhost running on the Google machine. Note that anyone with this (temporary) public link can access the (temporary) localhost. In order to use ngrok, you need to create a free account. Creating a free account only takes a minute following the instructions here. With the free account, you receive an access token. Once you have your access token, you can run the following cell and copy the token into the input prompt.

[ ]:

import getpass
from pyngrok import ngrok, conf

print("Enter your authtoken, which can be copied from https://dashboard.ngrok.com/auth")
print("You need to create a free ngrok account to get an authtoken. The token looks something like this: ASDO1283YZaDu95vysXYIUXZXYRR_54YfASDIb8cpNfVoz349587")
conf.get_default().auth_token = getpass.getpass()
# if the above does not work, you can try:
#ngrok.set_auth_token("<INSER_YOUR_NGROK_AUTHTOKEN>")

[ ]:

# disconnect all existing tunnels to avoid issues when rerunning cells
[ngrok.disconnect(tunnel.public_url) for tunnel in ngrok.get_tunnels()]

# create the public link
# ! check whether this is actually the localhost port Argilla is running on via the terminal above
ngrok_tunnel = ngrok.connect(6900)  # insert the port number Argilla is running on. e.g. 6900 if the terminal displays something like "Uvicorn running on http://0.0.0.0:6900"
print("You can now access the Argilla localhost with the public link below. (It should look something like 'http://X03b-34-XXX-237-25.ngrok.io')\n")
print(f"Your ngrok public link: {ngrok_tunnel}\n")
print("After clicking on the link, there will be a warning, which you can ignore")
print("You can then login with the default argilla username 'argilla' and password '1234'")

Log data to argilla and start your active learning loop with small-text#

If you click on your public link above, you should be able to access Argilla, but there is no data logged to Argilla yet. The following code downloads an example dataset and logs it to Argilla. You can change the following code to download any other dataset you want to annotate. The following code follows the active learning with small-text tutorial and therefore contains fewer explanations.

[ ]:

# load dataset
import datasets
dataset_name = "trec"
dataset_hf = datasets.load_dataset(dataset_name, version=datasets.Version("2.0.0"))
# we work with only a sixth of the texts of the dataset for faster testing
dataset_hf["train"] = dataset_hf["train"].shard(num_shards=6, index=0)

[ ]:

## choose the transformer and load tokenizer
import torch
from transformers import AutoTokenizer

# Choose transformer model: In non-gpu environments we use a tiny model to increase efficiency
if not torch.cuda.is_available():
    transformer_model = "prajjwal1/bert-tiny"
    print(f"No GPU is available, we therefore use the small model '{transformer_model}' for the active learning loop.\n")
else:
    transformer_model = "microsoft/deberta-v3-xsmall"  #"bert-base-uncased"
    print(f"A GPU is available, we can therefore use '{transformer_model}' for the active learning loop.\n")

# Init tokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model)

[ ]:

## create small_text transformersdataset object
import numpy as np
from small_text import TransformersDataset

num_classes = dataset_hf["train"].features["coarse_label"].num_classes
target_labels = np.arange(num_classes)

train_text = [row["text"] for row in dataset_hf["train"]]
train_labels = np.array([row["coarse_label"] for row in dataset_hf["train"]])

# Create the dataset for small-text
dataset_st = TransformersDataset.from_arrays(
    train_text, train_labels, tokenizer, target_labels=target_labels
)

# Create test dataset
test_text = [row["text"] for row in dataset_hf["test"]]
test_labels = np.array([row["coarse_label"] for row in dataset_hf["test"]])

dataset_test = TransformersDataset.from_arrays(
    test_text, test_labels, tokenizer, target_labels=np.arange(num_classes)
)

[ ]:

## setting up the active learner
from small_text import (
    BreakingTies,
    PoolBasedActiveLearner,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
)

# Define our classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device: ", device)

num_epochs = 5  # higher values of around 40 will probably improve performance on small datasets, but the active learning loop will take longer
clf_factory = TransformerBasedClassificationFactory(
    TransformerModelArguments(transformer_model),
    num_classes=num_classes,
    kwargs={"device": device, "num_epochs": num_epochs, "lr": 2e-05, "mini_batch_size": 8,
            "early_stopping_no_improvement": 5}
)


# Define our query strategy
query_strategy = BreakingTies()

# Use the active learner with a pool containing all unlabeled data
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, dataset_st)

[ ]:

## draw an initial sample for the first annotation round
# https://small-text.readthedocs.io/en/v1.1.1/components/initialization.html
from small_text import random_initialization, random_initialization_stratified, random_initialization_balanced
import numpy as np

# Fix seed for reproducibility
np.random.seed(42)

# Number of samples in our queried batches
NUM_SAMPLES = 10

# Draw an initial subset from the data pool
#initial_indices = random_initialization(dataset_st, NUM_SAMPLES)
#initial_indices = random_initialization_balanced(train_labels, NUM_SAMPLES)
initial_indices = random_initialization_stratified(train_labels, NUM_SAMPLES)

[ ]:

### log the first data to Argilla
import argilla as rg

# Choose a name for the dataset
DATASET_NAME = f"{dataset_name}-with-active-learning"

# Define labeling schema
labels = dataset_hf["train"].features["coarse_label"].names
settings = rg.TextClassificationSettings(label_schema=labels)

# Create dataset with a label schema
rg.configure_dataset_settings(name=DATASET_NAME, settings=settings)

# Create records from the initial batch
records = [
    rg.TextClassificationRecord(
        text=dataset_hf["train"]["text"][idx],
        metadata={"batch_id": 0},
        id=idx.item(),
    )
    for idx in initial_indices
]

# Log initial records to Argilla
rg.log(records, DATASET_NAME)

[ ]:

### create active learning loop
from argilla.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = dataset_hf["train"].features["coarse_label"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total == NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0,
)
def active_learning_loop(records, ctx):
    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    new_batch = ctx.query_params["batch_id"] + 1
    new_records = [
        rg.TextClassificationRecord(
            text=dataset_hf["train"]["text"][idx],
            metadata={"batch_id": new_batch},
            id=idx.item(),
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Argilla
    rg.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )

    ACCURACIES.append(accuracy)
    ctx.query_params["batch_id"] = new_batch
    print("Done!")

    print("Waiting for annotations ...")



active_learning_loop.start()

Start annotating in the browser via the ngrok link#

[ ]:

print(f"You can now start annotating with active learning in the background!")
print(f"The public link for accessing the annotation interface is: {ngrok_tunnel}")

You can now start annotating with an active learning in the background!
The public link for accessing the annotation interface is: NgrokTunnel: "http://30b0-34-124-178-185.ngrok.io" -> "http://localhost:6900"

After each iteration of 10 new annotated texts, the active learner will be re-trained and recommend a new batch of 10 texts. So you need to manually annotate exactly 10 texts to get new texts.

⚠️ Note that it will take a while until the active learner has been re-trained and analyed all remaining data to recommend new data. This probably takes several minutes. Refresh the Argilla window after a few minutes and a new batch of 10 texts should automatically appear in the interface. If it does not work immediately, double-check if you really annotated all 10 new texts and wait a bit longer. ⚠️

[ ]:

# when you are done, stop active learning loop
active_learning_loop.stop()

[ ]:

# plot learning progress over different active learning iterations
import pandas as pd
pd.Series(ACCURACIES).plot(xlabel="Iteration", ylabel="Accuracy")

Extract annotated data for downstream use#

[ ]:

## https://docs.v1.argilla.io/en/latest/getting_started/quickstart.html#Manual-extraction

# load your annotations
dataset_annotated = rg.load(DATASET_NAME)
# convert to Hugging Face format
dataset_annotated = dataset_annotated.prepare_for_training()
# now you can write your annotations to .csv, use them for training etc.
df_annotations = pd.DataFrame(dataset_annotated)
df_annotations.head()

Summary#

In this tutorial, we saw how you could embed Argilla in an active learning loop on a GPU in Google Colab. We relied on small-text to use a Hugging Face transformer within an active learning setup. In the end, we gathered a sample-efficient data set by annotating only the most informative records for the model.

Argilla makes it very easy to use a dedicated annotation team or subject matter experts as an oracle for your active learning system. They will only interact with the Argilla UI and do not have to worry about training or querying the system. We encourage you to try out active learning in your next project and make your and your annotator’s life a little easier.