🗂 Multi-label text classification with weak supervision#

In this tutorial we use Argilla and weak supervision to tackle two multi-label classification datasets:

The first dataset is a curated version of GoEmotions, a dataset intended for multi-label emotion classification.
We inspect the dataset in Argilla, come up with good heuristics, and combine them with a label model to train a weakly supervised Hugging Face transformer.
In the second dataset, we categorize research papers by topic based on their titles, which is a multi-label topic classification problem.
We repeat the process of finding good heuristics, combine them with a label model and train a lightweight downstream model using sklearn in the end.

labelling-textclassification-sklearn-weaksupervision

Note

The Snorkel and FlyingSquid label models do not support multi-label classification out of the box.

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argilla’s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don’t forget to change the runtime type to GPU for faster model training and inference.
Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

Setup#

For this tutorial, you’ll need to install the Argilla client and a few third-party libraries using pip:

[ ]:

%pip install argilla datasets "transformers[torch]" scikit-multilearn ipywidgets -qqq

Let’s import the Argilla module for reading and writing data:

[ ]:

import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

[ ]:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="admin.apikey"
)

If you’re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

[ ]:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Now let’s include the imports we need:

[ ]:

from datasets import load_dataset
from argilla.labeling.text_classification import Rule, WeakMultiLabels, add_rules, delete_rules, update_rules, MajorityVoter

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:

try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

GoEmotions#

The original GoEmotions is a challenging dataset intended for multi-label emotion classification. For this tutorial, we simplify it a bit by selecting only 6 out of the 28 emotions: admiration, annoyance, approval, curiosity, gratitude, optimism. We also try to accentuate the multi-label part of the dataset by down-sampling the examples that are classified with only one label. See Appendix A for all the details of this preprocessing step.

Define rules#

Let us start by downloading our curated version of the dataset from the Hugging Face Hub, and logging it to Argilla:

[5]:

# Download preprocessed dataset
ds_rb = rg.read_datasets(
    load_dataset("argilla/go_emotions_multi-label", split="train"),
    task="TextClassification",
)

[ ]:

# Log dataset to Argilla to find good heuristics
rg.log(ds_rb, name="go_emotions")

After uploading the dataset, we can explore and inspect it to find good heuristic rules. For this, we highly recommend the dedicated Define rules mode of the Argilla web app, that allows you to quickly iterate over heuristic rules, compute their metrics and save them.

Here we copy our rules found via the web app to the notebook for you to easily follow along the tutorial.

[7]:

# Define our heuristic rules, they can surely be improved
rules = [
    Rule("thank*", "gratitude"),
    Rule("appreciate", "gratitude"),
    Rule("text:(thanks AND good)", ["admiration", "gratitude"]),
    Rule("advice", "admiration"),
    Rule("amazing", "admiration"),
    Rule("awesome", "admiration"),
    Rule("impressed", "admiration"),
    Rule("text:(good AND (point OR call OR idea OR job))", "admiration"),
    Rule("legend", "admiration"),
    Rule("exactly", "approval"),
    Rule("agree", "approval"),
    Rule("yeah", "optimism"),
    Rule("suck", "annoyance"),
    Rule("pissed", "annoyance"),
    Rule("annoying", "annoyance"),
    Rule("ruined", "annoyance"),
    Rule("hoping", "optimism"),
    Rule("joking", ["optimism", "admiration"]),
    Rule('text:("good luck")', "optimism"),
    Rule('"nice day"', "optimism"),
    Rule('"what is"', "curiosity"),
    Rule('"can you"', "curiosity"),
    Rule('"would you"', "curiosity"),
    Rule('"do you"', ["curiosity", "admiration"]),
    Rule('"great"', ["annoyance"])
]

We go on and apply these heuristic rules to our dataset creating our weak label matrix. Since we are dealing with a multi-label classification task, the weak label matrix will have 3 dimensions.

Dimensions of the weak multi-label matrix: number of records x number of rules x number of labels

It will be filled with 0 and 1, depending on whether the rule voted for the respective label or not. If the rule is abstained for a given record, the matrix will be filled with -1.

We can call the weak_labels.summary() method to check the precision of each rule as well as our total coverage of the dataset.

[10]:

# Compute the weak labels for our dataset given the rules.
# If your dataset already contains rules you can omit the rules argument.
add_rules(dataset="go_emotions", rules=rules)
weak_labels = WeakMultiLabels("go_emotions")

# Check coverage/precision of our rules
weak_labels.summary()

[10]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
thank*	{gratitude}	0.199382	0.198925	0.048004	74	0	1.000000
appreciate	{gratitude}	0.016397	0.021505	0.009981	7	1	0.875000
text:(thanks AND good)	{admiration, gratitude}	0.007842	0.010753	0.007842	8	0	1.000000
advice	{admiration}	0.008317	0.008065	0.007605	3	0	1.000000
amazing	{admiration}	0.025428	0.021505	0.004990	8	0	1.000000
awesome	{admiration}	0.025190	0.034946	0.007605	12	1	0.923077
impressed	{admiration}	0.002139	0.005376	0.000000	2	0	1.000000
text:(good AND (point OR call OR idea OR job))	{admiration}	0.008555	0.018817	0.003089	7	0	1.000000
legend	{admiration}	0.001901	0.002688	0.000475	1	0	1.000000
exactly	{approval}	0.007842	0.010753	0.002376	3	1	0.750000
agree	{approval}	0.016873	0.021505	0.003327	6	2	0.750000
yeah	{optimism}	0.024952	0.021505	0.006179	2	6	0.250000
suck	{annoyance}	0.002139	0.008065	0.000475	3	0	1.000000
pissed	{annoyance}	0.002139	0.008065	0.000713	2	1	0.666667
annoying	{annoyance}	0.003327	0.018817	0.001188	7	0	1.000000
ruined	{annoyance}	0.000713	0.002688	0.000238	1	0	1.000000
hoping	{optimism}	0.003565	0.005376	0.000713	2	0	1.000000
joking	{admiration, optimism}	0.000238	0.000000	0.000000	0	0	NaN
text:("good luck")	{optimism}	0.015209	0.018817	0.002614	4	3	0.571429
"nice day"	{optimism}	0.000713	0.005376	0.000000	2	0	1.000000
"what is"	{curiosity}	0.004040	0.005376	0.001188	2	0	1.000000
"can you"	{curiosity}	0.004278	0.008065	0.000713	3	0	1.000000
"would you"	{curiosity}	0.000951	0.005376	0.000238	2	0	1.000000
"do you"	{admiration, curiosity}	0.010932	0.018817	0.002376	7	7	0.500000
"great"	{annoyance}	0.055133	0.061828	0.016873	1	22	0.043478
total	{approval, gratitude, admiration, optimism, cu...	0.379753	0.448925	0.060361	169	44	0.793427

We can observe that “joking” does not have any support and also “do you” is not informative, because its correct/incorrect ratio equals 1. We can delete these two rules from the dataset using “delete_rules” method

[13]:

rules_to_delete = [
    Rule("joking", ["optimism", "admiration"]),
    Rule('"do you"', ["curiosity", "admiration"])]

delete_rules(dataset="go_emotions", rules=rules_to_delete)

weak_labels = WeakMultiLabels("go_emotions")
weak_labels.summary()

[13]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
thank*	{gratitude}	0.199382	0.198925	0.047766	74	0	1.000000
appreciate	{gratitude}	0.016397	0.021505	0.009743	7	1	0.875000
text:(thanks AND good)	{admiration, gratitude}	0.007842	0.010753	0.007842	8	0	1.000000
advice	{admiration}	0.008317	0.008065	0.007367	3	0	1.000000
amazing	{admiration}	0.025428	0.021505	0.004990	8	0	1.000000
awesome	{admiration}	0.025190	0.034946	0.007129	12	1	0.923077
impressed	{admiration}	0.002139	0.005376	0.000000	2	0	1.000000
text:(good AND (point OR call OR idea OR job))	{admiration}	0.008555	0.018817	0.003089	7	0	1.000000
legend	{admiration}	0.001901	0.002688	0.000475	1	0	1.000000
exactly	{approval}	0.007842	0.010753	0.002139	3	1	0.750000
agree	{approval}	0.016873	0.021505	0.003327	6	2	0.750000
yeah	{optimism}	0.024952	0.021505	0.006179	2	6	0.250000
suck	{annoyance}	0.002139	0.008065	0.000475	3	0	1.000000
pissed	{annoyance}	0.002139	0.008065	0.000475	2	1	0.666667
annoying	{annoyance}	0.003327	0.018817	0.001188	7	0	1.000000
ruined	{annoyance}	0.000713	0.002688	0.000238	1	0	1.000000
hoping	{optimism}	0.003565	0.005376	0.000713	2	0	1.000000
text:("good luck")	{optimism}	0.015209	0.018817	0.002614	4	3	0.571429
"nice day"	{optimism}	0.000713	0.005376	0.000000	2	0	1.000000
"what is"	{curiosity}	0.004040	0.005376	0.001188	2	0	1.000000
"can you"	{curiosity}	0.004278	0.008065	0.000713	3	0	1.000000
"would you"	{curiosity}	0.000951	0.005376	0.000238	2	0	1.000000
"great"	{annoyance}	0.055133	0.061828	0.016397	1	22	0.043478
total	{approval, gratitude, admiration, optimism, cu...	0.370960	0.435484	0.058222	162	37	0.814070

We can observe that following rules are not working well;

Rule('"great"', ["annoyance"])

Rule("yeah", "optimism"),

Let’s update these two rules such that:

Rule('"great"', ["admiration"])

Rule("yeah", "approval"),

[14]:

rules_to_update = [
    Rule('"great"', ["admiration"]),
    Rule("yeah", "approval")]

update_rules(dataset="go_emotions", rules=rules_to_update)

Let us run weak labeling with the final rules of the dataset

[17]:

weak_labels = WeakMultiLabels(dataset="go_emotions")
weak_labels.summary()

[17]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
thank*	{gratitude}	0.199382	0.198925	0.047766	74	0	1.000000
appreciate	{gratitude}	0.016397	0.021505	0.009743	7	1	0.875000
text:(thanks AND good)	{admiration, gratitude}	0.007842	0.010753	0.007842	8	0	1.000000
advice	{admiration}	0.008317	0.008065	0.007367	3	0	1.000000
amazing	{admiration}	0.025428	0.021505	0.004990	8	0	1.000000
awesome	{admiration}	0.025190	0.034946	0.007129	12	1	0.923077
impressed	{admiration}	0.002139	0.005376	0.000000	2	0	1.000000
text:(good AND (point OR call OR idea OR job))	{admiration}	0.008555	0.018817	0.003089	7	0	1.000000
legend	{admiration}	0.001901	0.002688	0.000475	1	0	1.000000
exactly	{approval}	0.007842	0.010753	0.002139	3	1	0.750000
agree	{approval}	0.016873	0.021505	0.003327	6	2	0.750000
yeah	{approval}	0.024952	0.021505	0.006179	5	3	0.625000
suck	{annoyance}	0.002139	0.008065	0.000475	3	0	1.000000
pissed	{annoyance}	0.002139	0.008065	0.000475	2	1	0.666667
annoying	{annoyance}	0.003327	0.018817	0.001188	7	0	1.000000
ruined	{annoyance}	0.000713	0.002688	0.000238	1	0	1.000000
hoping	{optimism}	0.003565	0.005376	0.000713	2	0	1.000000
text:("good luck")	{optimism}	0.015209	0.018817	0.002614	4	3	0.571429
"nice day"	{optimism}	0.000713	0.005376	0.000000	2	0	1.000000
"what is"	{curiosity}	0.004040	0.005376	0.001188	2	0	1.000000
"can you"	{curiosity}	0.004278	0.008065	0.000713	3	0	1.000000
"would you"	{curiosity}	0.000951	0.005376	0.000238	2	0	1.000000
"great"	{admiration}	0.055133	0.061828	0.016397	19	4	0.826087
total	{approval, gratitude, admiration, optimism, cu...	0.370960	0.435484	0.058222	183	16	0.919598

Let’s consider we want to try a rule

[20]:

optimism_rule = Rule("wish*", "optimism")
optimism_rule.apply(dataset="go_emotions")
optimism_rule.metrics(dataset="go_emotions")

[20]:

{'coverage': 0.006178707224334601,
 'annotated_coverage': 0.0,
 'correct': 0,
 'incorrect': 0,
 'precision': None}

optimism_rule is not informative so we don’t add it to the dataset

Let’s try a rule for curiosity class

[23]:

curiosity_rule = Rule("could you", "curiosity")
curiosity_rule.apply("go_emotions")
curiosity_rule.metrics(dataset="go_emotions")

[23]:

{'coverage': 0.005465779467680608,
 'annotated_coverage': 0.002688172043010753,
 'correct': 1,
 'incorrect': 0,
 'precision': 1.0}

curiosity_rule has a positive support, we can add it to the dataset as follows:

[24]:

curiosity_rule.add_to_dataset(dataset="go_emotions")

Let’s apply Weak Labeling again with the final rule set

[26]:

weak_labels = WeakMultiLabels(dataset="go_emotions")
weak_labels.summary()

[26]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
thank*	{gratitude}	0.199382	0.198925	0.048004	74	0	1.000000
appreciate	{gratitude}	0.016397	0.021505	0.009743	7	1	0.875000
text:(thanks AND good)	{admiration, gratitude}	0.007842	0.010753	0.007842	8	0	1.000000
advice	{admiration}	0.008317	0.008065	0.007367	3	0	1.000000
amazing	{admiration}	0.025428	0.021505	0.004990	8	0	1.000000
awesome	{admiration}	0.025190	0.034946	0.007367	12	1	0.923077
impressed	{admiration}	0.002139	0.005376	0.000000	2	0	1.000000
text:(good AND (point OR call OR idea OR job))	{admiration}	0.008555	0.018817	0.003089	7	0	1.000000
legend	{admiration}	0.001901	0.002688	0.000475	1	0	1.000000
exactly	{approval}	0.007842	0.010753	0.002139	3	1	0.750000
agree	{approval}	0.016873	0.021505	0.003565	6	2	0.750000
yeah	{approval}	0.024952	0.021505	0.006179	5	3	0.625000
suck	{annoyance}	0.002139	0.008065	0.000475	3	0	1.000000
pissed	{annoyance}	0.002139	0.008065	0.000475	2	1	0.666667
annoying	{annoyance}	0.003327	0.018817	0.001188	7	0	1.000000
ruined	{annoyance}	0.000713	0.002688	0.000238	1	0	1.000000
hoping	{optimism}	0.003565	0.005376	0.000713	2	0	1.000000
text:("good luck")	{optimism}	0.015209	0.018817	0.002614	4	3	0.571429
"nice day"	{optimism}	0.000713	0.005376	0.000000	2	0	1.000000
"what is"	{curiosity}	0.004040	0.005376	0.001188	2	0	1.000000
"can you"	{curiosity}	0.004278	0.008065	0.000713	3	0	1.000000
"would you"	{curiosity}	0.000951	0.005376	0.000475	2	0	1.000000
"great"	{admiration}	0.055133	0.061828	0.016397	19	4	0.826087
could you	{curiosity}	0.005466	0.002688	0.001188	1	0	1.000000
total	{approval, gratitude, admiration, optimism, cu...	0.375238	0.435484	0.059173	184	16	0.920000

Create training set#

When we are happy with our heuristics, it is time to combine them and compute weak labels for the training of our downstream model. For this, we will use the MajorityVoter. In the multi-label case, it sets the probability of a label to 0 or 1 depending on whether at least one non-abstaining rule voted for the respective label or not.

[ ]:

# Use the majority voter as the label model
label_model = MajorityVoter(weak_labels)

From our label model, we get the training records together with its weak labels and probabilities. We will use the weak labels with a probability greater than 0.5 as labels for our training, and hence copy them to the annotation property of our records.

[ ]:

# Get records with the predictions from the label model to train a down-stream model
train_rg = label_model.predict()

# Copy label model predictions to annotation with a threshold of 0.5
for rec in train_rg:
    rec.annotation = [pred[0] for pred in rec.prediction if pred[1] > 0.5]

We extract the test set with manual annotations from our WeakMultiLabels object:

[ ]:

# Get records with manual annotations to use as test set for the down-stream model
test_rg = rg.DatasetForTextClassification(weak_labels.records(has_annotation=True))

We will use the convenient DatasetForTextClassification.prepare_for_training() method to create datasets optimized for training with the Hugging Face transformers library:

[ ]:

from datasets import DatasetDict

# Create dataset dictionary and shuffle training set
ds = DatasetDict(
    train=train_rg.prepare_for_training().shuffle(seed=42),
    test=test_rg.prepare_for_training(),
)

Train a transformer downstream model#

The following steps are basically a copy & paste from the amazing documentation of the Hugging Face transformers library.

First, we will load the tokenizer corresponding to our model, which we choose to be the distilled version of the infamous BERT.

Note

Since we will use a full-blown transformer as a downstream model (albeit a distilled one), we recommend executing the following code on a machine with a GPU, or in a Google Colab with a GPU backend enabled.

[ ]:

from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

Afterward, we tokenize our data:

[ ]:

def tokenize_func(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


# Tokenize the data
tokenized_ds = ds.map(tokenize_func, batched=True)

The transformer model expects our labels to follow a common multi-label format of binaries, so let us use sklearn for this transformation.

[ ]:

from sklearn.preprocessing import MultiLabelBinarizer

# Turn labels into multi-label format
mb = MultiLabelBinarizer()
mb.fit(ds["test"]["label"])

def binarize_labels(examples):
    return {"label": mb.transform(examples["label"])}

binarized_tokenized_ds = tokenized_ds.map(binarize_labels, batched=True)

Before we start the training, it is important to define our metric for the evaluation. Here we settle on the commonly used micro averaged F1 metric, but we will also keep track of the F1 per label, for a more in-depth error analysis afterward.

[ ]:

from datasets import load_metric
import numpy as np

# Define our metrics
metric = load_metric("f1", config_name="multilabel")


def compute_metrics(eval_pred):
    logits, labels = eval_pred
    # apply sigmoid
    predictions = (1.0 / (1 + np.exp(-logits))) > 0.5

    # f1 micro averaged
    metrics = metric.compute(
        predictions=predictions, references=labels, average="micro"
    )
    # f1 per label
    per_label_metric = metric.compute(
        predictions=predictions, references=labels, average=None
    )
    for label, f1 in zip(
        ds["train"].features["label"][0].names, per_label_metric["f1"]
    ):
        metrics[f"f1_{label}"] = f1

    return metrics

Now we are ready to load our pre-trained transformer model and prepare it for our task: multi-label text classification with 6 labels.

[ ]:

from transformers import AutoModelForSequenceClassification

# Init our down-stream model
model = AutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", problem_type="multi_label_classification", num_labels=6
)

The only thing missing from the training is the Trainer and its TrainingArguments. To keep it simple, we mostly rely on the default arguments, which often work out of the box, but tweak a bit the batch size to train faster. We also checked that 2 epochs are enough for our rather small dataset.

[ ]:

from transformers import TrainingArguments

# Set our training arguments
training_args = TrainingArguments(
    output_dir="test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
)

[ ]:

from transformers import Trainer

# Init the trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=binarized_tokenized_ds["train"],
    eval_dataset=binarized_tokenized_ds["test"],
    compute_metrics=compute_metrics,
)

[ ]:

# Train the down-stream model
trainer.train()

We achieved a micro-averaged F1 of about 0.54, which is not perfect, but a good baseline for this challenging dataset. When inspecting the F1s per label, we clearly see that the worst-performing labels are the ones with the poorest heuristics in terms of accuracy and coverage, which comes as no surprise.

Research topic dataset#

After covering a multi-label emotion classification task, we will try to do the same for a multi-label classification task related to topic modeling. In this dataset, research papers were classified with 6 non-exclusive labels based on their title and abstract.

We will try to classify the papers only based on the title, which is considerably harder, but allows us to quickly scan through the data and come up with heuristics. See Appendix B for all the details of the minimal data preprocessing.

Define rules#

Let us start by downloading our preprocessed dataset from the Hugging Face Hub, and logging it to Argilla:

[ ]:

# Download preprocessed dataset
ds_rb = rg.read_datasets(
    load_dataset("argilla/research_titles_multi-label", split="train"),
    task="TextClassification",
)

[ ]:

# Log dataset to Argilla to find good heuristics
rg.log(ds_rb, "research_titles")

After uploading the dataset, we can explore and inspect it to find good heuristic rules. For this, we highly recommend the dedicated Define rules mode of the Argilla web app, which allows you to quickly iterate over heuristic rules, compute their metrics and save them.

Here we copy our rules found via the web app to the notebook for you to easily follow along the tutorial.

[29]:

# Define our heuristic rules (can probably be improved)
rules = [
    Rule("stock*", "Quantitative Finance"),
    Rule("*asset*", "Quantitative Finance"),
    Rule("pric*", "Quantitative Finance"),
    Rule("economy", "Quantitative Finance"),
    Rule("deep AND neural AND network*", "Computer Science"),
    Rule("convolutional", "Computer Science"),
    Rule("allocat* AND *net*", "Computer Science"),
    Rule("program", "Computer Science"),
    Rule("classification* AND (label* OR deep)", "Computer Science"),
    Rule("scattering", "Physics"),
    Rule("astro*", "Physics"),
    Rule("optical", "Physics"),
    Rule("ray", "Physics"),
    Rule("entangle*", "Physics"),
    Rule("*algebra*", "Mathematics"),
    Rule("spaces", "Mathematics"),
    Rule("operators", "Mathematics"),
    Rule("estimation", "Statistics"),
    Rule("mixture", "Statistics"),
    Rule("gaussian", "Statistics"),
    Rule("gene", "Quantitative Biology"),
]

We go on and apply these heuristic rules to our dataset creating our weak label matrix. As mentioned in the GoEmotions section, the weak label matrix will have 3 dimensions and values of -1, 0 and 1.

Let us get an overview of our heuristics and how they perform:

[31]:

# Compute the weak labels for our dataset given the rules
# If your dataset already contains rules you can omit the rules argument.
add_rules(dataset="research_titles", rules=rules)
weak_labels = WeakMultiLabels("research_titles")
weak_labels.summary()

[31]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
stock*	{Quantitative Finance}	0.000954	0.000715	0.000191	3	0	1.000000
asset	{Quantitative Finance}	0.000477	0.000715	0.000238	3	0	1.000000
pric*	{Quantitative Finance}	0.003433	0.003337	0.000668	9	5	0.642857
economy	{Quantitative Finance}	0.000238	0.000238	0.000000	1	0	1.000000
deep AND neural AND network*	{Computer Science}	0.009155	0.010250	0.002575	32	11	0.744186
convolutional	{Computer Science}	0.010109	0.009297	0.002003	32	7	0.820513
allocat* AND net	{Computer Science}	0.000763	0.000715	0.000000	3	0	1.000000
program	{Computer Science}	0.002623	0.003099	0.000095	11	2	0.846154
classification* AND (label* OR deep)	{Computer Science}	0.003338	0.004052	0.001287	14	3	0.823529
scattering	{Physics}	0.004053	0.002861	0.000572	10	2	0.833333
astro*	{Physics}	0.003099	0.004052	0.000477	17	0	1.000000
optical	{Physics}	0.007105	0.006913	0.000811	27	2	0.931034
ray	{Physics}	0.005865	0.007390	0.000668	27	4	0.870968
entangle*	{Physics}	0.002623	0.002861	0.000048	11	1	0.916667
algebra	{Mathematics}	0.014829	0.018355	0.000429	70	7	0.909091
spaces	{Mathematics}	0.010586	0.009774	0.001287	38	3	0.926829
operators	{Mathematics}	0.006151	0.005959	0.001192	22	3	0.880000
estimation	{Statistics}	0.021266	0.021216	0.001621	65	24	0.730337
mixture	{Statistics}	0.003290	0.003099	0.000906	10	3	0.769231
gaussian	{Statistics}	0.009250	0.011204	0.001526	36	11	0.765957
gene	{Quantitative Biology}	0.001287	0.001669	0.000143	6	1	0.857143
total	{Mathematics, Quantitative Biology, Physics, Q...	0.111911	0.118951	0.008154	447	89	0.833955

Consider the case we have come up with new rules and want to add them to dataset

[32]:

additional_rules = [
    Rule("trading", "Quantitative Finance"),
    Rule("finance", "Quantitative Finance"),
    Rule("memor* AND (design* OR network*)", "Computer Science"),
    Rule("system* AND design*", "Computer Science"),
    Rule("material*", "Physics"),
    Rule("spin", "Physics"),
    Rule("magnetic", "Physics"),
    Rule("manifold* AND (NOT learn*)", "Mathematics"),
    Rule("equation", "Mathematics"),
    Rule("regression", "Statistics"),
    Rule("bayes*", "Statistics"),
]

[35]:

add_rules(dataset="research_titles", rules=additional_rules)
weak_labels = WeakMultiLabels("research_titles")
weak_labels.summary()

[35]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
stock*	{Quantitative Finance}	0.000954	0.000715	0.000334	3	0	1.000000
asset	{Quantitative Finance}	0.000477	0.000715	0.000286	3	0	1.000000
pric*	{Quantitative Finance}	0.003433	0.003337	0.000715	9	5	0.642857
economy	{Quantitative Finance}	0.000238	0.000238	0.000000	1	0	1.000000
deep AND neural AND network*	{Computer Science}	0.009155	0.010250	0.002909	32	11	0.744186
convolutional	{Computer Science}	0.010109	0.009297	0.002241	32	7	0.820513
allocat* AND net	{Computer Science}	0.000763	0.000715	0.000000	3	0	1.000000
program	{Computer Science}	0.002623	0.003099	0.000143	11	2	0.846154
classification* AND (label* OR deep)	{Computer Science}	0.003338	0.004052	0.001335	14	3	0.823529
scattering	{Physics}	0.004053	0.002861	0.001001	10	2	0.833333
astro*	{Physics}	0.003099	0.004052	0.000620	17	0	1.000000
optical	{Physics}	0.007105	0.006913	0.001097	27	2	0.931034
ray	{Physics}	0.005865	0.007390	0.001192	27	4	0.870968
entangle*	{Physics}	0.002623	0.002861	0.000095	11	1	0.916667
algebra	{Mathematics}	0.014829	0.018355	0.000620	70	7	0.909091
spaces	{Mathematics}	0.010586	0.009774	0.001860	38	3	0.926829
operators	{Mathematics}	0.006151	0.005959	0.001526	22	3	0.880000
estimation	{Statistics}	0.021266	0.021216	0.003385	65	24	0.730337
mixture	{Statistics}	0.003290	0.003099	0.001287	10	3	0.769231
gaussian	{Statistics}	0.009250	0.011204	0.002766	36	11	0.765957
gene	{Quantitative Biology}	0.001287	0.001669	0.000191	6	1	0.857143
trading	{Quantitative Finance}	0.000954	0.000238	0.000191	1	0	1.000000
finance	{Quantitative Finance}	0.000048	0.000238	0.000000	1	0	1.000000
memor* AND (design* OR network*)	{Computer Science}	0.001383	0.002145	0.000286	9	0	1.000000
system* AND design*	{Computer Science}	0.001144	0.002384	0.000238	9	1	0.900000
material*	{Physics}	0.004148	0.003099	0.000238	10	3	0.769231
spin	{Physics}	0.013542	0.015018	0.002146	60	3	0.952381
magnetic	{Physics}	0.011301	0.012872	0.002432	49	5	0.907407
manifold* AND (NOT learn*)	{Mathematics}	0.007057	0.008343	0.000858	28	7	0.800000
equation	{Mathematics}	0.010681	0.007867	0.000954	24	9	0.727273
regression	{Statistics}	0.009393	0.009058	0.002575	33	5	0.868421
bayes*	{Statistics}	0.015306	0.014779	0.003147	49	13	0.790323
total	{Mathematics, Quantitative Biology, Physics, Q...	0.176616	0.185936	0.017833	720	135	0.842105

Let’s create new rules and see their effects, if they are informative enough we can proceed by adding them to the dataset

[36]:

# create a statistics rule and get its metrics
statistics_rule = Rule("sample", "Statistics")
statistics_rule.apply("research_titles")
statistics_rule.metrics("research_titles")

[36]:

{'coverage': 0.004672897196261682,
 'annotated_coverage': 0.004529201430274136,
 'correct': 17,
 'incorrect': 2,
 'precision': 0.8947368421052632}

[37]:

# add the statistics_rule to the research_titles dataset
statistics_rule.add_to_dataset("research_titles")

[38]:

finance_rule = Rule("risk", "Quantitative Finance")
finance_rule.apply("research_titles")
finance_rule.metrics("research_titles")

[38]:

{'coverage': 0.004815945069616631,
 'annotated_coverage': 0.004290822407628129,
 'correct': 1,
 'incorrect': 17,
 'precision': 0.05555555555555555}

[39]:

finance_rule.add_to_dataset("research_titles")

Our assertion does not seem correct let us update this rule

[40]:

rule =  Rule("risk", "Statistics")

[41]:

rule.metrics("research_titles")

[41]:

{'coverage': 0.004815945069616631,
 'annotated_coverage': 0.004290822407628129,
 'correct': 11,
 'incorrect': 7,
 'precision': 0.6111111111111112}

[42]:

rule.update_at_dataset("research_titles")

[43]:

quantitative_biology_rule = Rule("dna", "Quantitative Biology")

[44]:

quantitative_biology_rule.metrics("research_titles")

[44]:

{'coverage': 0.0013351134846461949,
 'annotated_coverage': 0.0011918951132300357,
 'correct': 4,
 'incorrect': 1,
 'precision': 0.8}

[45]:

quantitative_biology_rule.add_to_dataset("research_titles")

Let’s see the final matrix with new added rules

[47]:

weak_labels = WeakMultiLabels("research_titles")
weak_labels.summary()

[47]:

	label	coverage	annotated_coverage	overlaps	correct	incorrect	precision
stock*	{Quantitative Finance}	0.000954	0.000715	0.000334	3	0	1.000000
asset	{Quantitative Finance}	0.000477	0.000715	0.000334	3	0	1.000000
pric*	{Quantitative Finance}	0.003433	0.003337	0.000811	9	5	0.642857
economy	{Quantitative Finance}	0.000238	0.000238	0.000048	1	0	1.000000
deep AND neural AND network*	{Computer Science}	0.009155	0.010250	0.002956	32	11	0.744186
convolutional	{Computer Science}	0.010109	0.009297	0.002336	32	7	0.820513
allocat* AND net	{Computer Science}	0.000763	0.000715	0.000048	3	0	1.000000
program	{Computer Science}	0.002623	0.003099	0.000191	11	2	0.846154
classification* AND (label* OR deep)	{Computer Science}	0.003338	0.004052	0.001335	14	3	0.823529
scattering	{Physics}	0.004053	0.002861	0.001049	10	2	0.833333
astro*	{Physics}	0.003099	0.004052	0.000668	17	0	1.000000
optical	{Physics}	0.007105	0.006913	0.001097	27	2	0.931034
ray	{Physics}	0.005865	0.007390	0.001240	27	4	0.870968
entangle*	{Physics}	0.002623	0.002861	0.000095	11	1	0.916667
algebra	{Mathematics}	0.014829	0.018355	0.000620	70	7	0.909091
spaces	{Mathematics}	0.010586	0.009774	0.001860	38	3	0.926829
operators	{Mathematics}	0.006151	0.005959	0.001574	22	3	0.880000
estimation	{Statistics}	0.021266	0.021216	0.003862	65	24	0.730337
mixture	{Statistics}	0.003290	0.003099	0.001335	10	3	0.769231
gaussian	{Statistics}	0.009250	0.011204	0.003052	36	11	0.765957
gene	{Quantitative Biology}	0.001287	0.001669	0.000191	6	1	0.857143
trading	{Quantitative Finance}	0.000954	0.000238	0.000191	1	0	1.000000
finance	{Quantitative Finance}	0.000048	0.000238	0.000000	1	0	1.000000
memor* AND (design* OR network*)	{Computer Science}	0.001383	0.002145	0.000286	9	0	1.000000
system* AND design*	{Computer Science}	0.001144	0.002384	0.000238	9	1	0.900000
material*	{Physics}	0.004148	0.003099	0.000238	10	3	0.769231
spin	{Physics}	0.013542	0.015018	0.002146	60	3	0.952381
magnetic	{Physics}	0.011301	0.012872	0.002432	49	5	0.907407
manifold* AND (NOT learn*)	{Mathematics}	0.007057	0.008343	0.000858	28	7	0.800000
equation	{Mathematics}	0.010681	0.007867	0.001001	24	9	0.727273
regression	{Statistics}	0.009393	0.009058	0.002718	33	5	0.868421
bayes*	{Statistics}	0.015306	0.014779	0.003481	49	13	0.790323
sample	{Statistics}	0.004673	0.004529	0.000811	17	2	0.894737
risk	{Statistics}	0.004816	0.004291	0.001097	11	7	0.611111
dna	{Quantitative Biology}	0.001335	0.001192	0.000143	4	1	0.800000
total	{Mathematics, Quantitative Biology, Physics, Q...	0.185390	0.194041	0.019788	752	145	0.838350

Create training set#

When we are happy with our heuristics, it is time to combine them and compute weak labels for the training of our downstream model. As for the “GoEmotions” dataset, we will use the simple MajorityVoter.

[48]:

# Use the majority voter as the label model
label_model = MajorityVoter(weak_labels)

From our label model, we get the training records together with its weak labels and probabilities. Since we are going to train an sklearn model, we will put the records in a pandas DataFrame that generally has a good integration with the sklearn ecosystem.

[49]:

train_df = label_model.predict().to_pandas()

Before training our model, we need to extract the training labels from the label model predictions and transform them into a multi-label compatible format.

[50]:

# Create labels in multi-label format, we will use a threshold of 0.5 for the probability
def multi_label_binarizer(predictions, threshold=0.5):
    predicted_labels = [label for label, prob in predictions if prob > threshold]
    binary_labels = [
        1 if label in predicted_labels else 0 for label in weak_labels.labels
    ]
    return binary_labels


train_df["label"] = train_df.prediction.map(multi_label_binarizer)

Now, let us define our downstream model and train it.

We will use the scikit-multilearn library to wrap a multinomial Naive Bayes classifier that is suitable for classification with discrete features (e.g., word counts for text classification). The BinaryRelevance class transforms the multi-label problem with L labels into L single-label binary classification problems, so in the end we will automatically fit L naive bayes classifiers to our data.

The features for our classifier will be the counts of different word n-grams: that is, for each example, we count the number of contiguous sequences of n words, where n goes from 1 to 5. We extract these features with the CountVectorizer.

Finally, we will put our feature extractor and multi-label classifier in a sklearn pipeline that makes fitting and scoring the model a breeze.

[51]:

from skmultilearn.problem_transform import BinaryRelevance
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

# Define our down-stream model
classifier = Pipeline(
    [("vect", CountVectorizer()), ("clf", BinaryRelevance(MultinomialNB()))]
)

Training the model is as easy as calling the fit method on the our pipeline, and provide our training text and training labels.

[52]:

import numpy as np

# Fit the down-stream classifier
classifier.fit(
    X=train_df.text,
    y=np.array(train_df.label.tolist()),
)

[52]:

Pipeline(steps=[('vect', CountVectorizer()),
                ('clf',
                 BinaryRelevance(classifier=MultinomialNB(),
                                 require_dense=[True, True]))])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

To score our trained model, we retrieve its predictions of the test set and use sklearn’s classification_report to get all important classification metrics in a nicely formatted string.

[53]:

# Get predictions for test set
predictions = classifier.predict(
    X=[rec.text for rec in weak_labels.records(has_annotation=True)]
)

[54]:

from sklearn.metrics import classification_report

# Compute metrics
print(
    classification_report(
        weak_labels.annotation(), predictions, target_names=weak_labels.labels
    )
)

                      precision    recall  f1-score   support

    Computer Science       0.81      0.24      0.38      1740
         Mathematics       0.79      0.58      0.67      1141
             Physics       0.88      0.65      0.74      1186
Quantitative Biology       0.67      0.02      0.04       109
Quantitative Finance       0.46      0.13      0.21        45
          Statistics       0.52      0.69      0.60      1069

           micro avg       0.71      0.49      0.58      5290
           macro avg       0.69      0.39      0.44      5290
        weighted avg       0.76      0.49      0.56      5290
         samples avg       0.58      0.52      0.53      5290

We obtain a micro-averaged F1 score of around 0.59, which again is not perfect but can serve as a decent baseline for future improvements. Looking at the F1 per label, we see that the main problem is the recall of our heuristics and we should either define more of them or try to find more general ones.

Summary#

In this tutorial we saw how you can use Argilla to tackle multi-label text classification problems with weak supervision. We showed you how to train two downstream models on two different multi-label datasets using the discovered heuristics.

For the emotion classification task, we trained a full-blown transformer model with Hugging Face, while for the topic classification task, we relied on a more lightweight Bayes classifier from sklearn. Although the results are not perfect, they can serve as a good baseline for future improvements.

So the next time you encounter a multi-label classification problem, maybe try out weak supervision with Argilla and save some time for your annotation team 😀.

Appendix A#

This appendix summarizes the preprocessing steps for our curated GoEmotions dataset. The goal was to limit the labels, and down-sample single-label annotations to move the focus to multi-label outputs.

[ ]:

# load original dataset and check label frequencies
import pandas as pd
import datasets

go_emotions = datasets.load_dataset("go_emotions")
df = go_emotions["test"].to_pandas()


def int2str(i):
    # return int(i)
    return go_emotions["train"].features["labels"].feature.int2str(int(i))


label_freq = []
idx_multi = df.labels.map(lambda x: len(x) > 1)
df["is_single"] = df.labels.map(lambda x: 0 if len(x) > 1 else 1)
df[idx_multi].labels.map(lambda x: [label_freq.append(int(l)) for l in x])
pd.Series(label_freq).value_counts()

[ ]:

# limit labels, down-sample single-label annotations and create Argilla records

import argilla as rg


def create(split: str) -> pd.DataFrame:
    df = go_emotions[split].to_pandas()
    df["is_single"] = df.labels.map(lambda x: 0 if len(x) > 1 else 1)

    # ['admiration', 'approval', 'annoyance', 'gratitude', 'curiosity', 'optimism', 'amusement']
    idx_most_common = df.labels.map(
        lambda x: all([int(label) in [0, 4, 3, 15, 7, 15, 20] for label in x])
    )
    df_multi = df[(df.is_single == 0) & idx_most_common]
    df_single = df[idx_most_common].sample(
        3 * len(df_multi), weights="is_single", axis=0, random_state=42
    )
    return pd.concat([df_multi, df_single]).sample(frac=1, random_state=42)


def make_records(row, is_train: bool) -> rg.TextClassificationRecord:
    annotation = [int2str(i) for i in row.labels] if not is_train else None
    return rg.TextClassificationRecord(
        inputs=row.text,
        annotation=annotation,
        multi_label=True,
        id=row.id,
    )


train_recs = create("train").apply(make_records, axis=1, is_train=True)
test_recs = create("test").apply(make_records, axis=1, is_train=False)

records = train_recs.to_list() + test_recs.tolist()

Appendix B#

This appendix summarizes the minimal preprocessing done to this multi-label classification dataset from Kaggle. You can download the original data (train.csv) by following the Kaggle link.

The preprocessing consists of extracting only the title from the research paper and split the data into a train and validation set.

[ ]:

# Extract the title and split the data
import pandas as pd
import argilla as rg
from sklearn.model_selection import train_test_split

df = pd.read_csv("train.csv")

_, test_id = train_test_split(df.ID, test_size=0.2, random_state=42)

labels = [
    "Computer Science",
    "Physics",
    "Mathematics",
    "Statistics",
    "Quantitative Biology",
    "Quantitative Finance",
]


def make_record(row):
    annotation = [label for label in labels if row[label] == 1]
    return rg.TextClassificationRecord(
        inputs=row.TITLE,
        # inputs={"title": row.TITLE, "abstract": row.ABSTRACT},
        annotation=annotation if row.ID in test_id else None,
        multi_label=True,
        id=row.ID,
    )


records = df.apply(make_record, axis=1)