๐จ Label data with semantic search and Sentence Transformers#
In this tutorial, youโll learn to use Sentence Transformer embeddings and semantic search to make data labeling significantly faster. It will walk you through the following steps:
๐พ use sentence transformers to generate embeddings of a dataset with banking customer requests
๐ upload the dataset into Argilla for data labeling
๐ท use the similarity search feature to efficiently find label bulks of semantically-related examples
Introduction#
In this tutorial, weโll use the power of embeddings to make data labeling (and curation) more efficient. The idea of exploiting embeddings for labeling is not new, and there are several cool, standalone libraries to label data using embeddings.
Starting from the version 1.2.0
, Argilla gives you a way to leverage embedding-based similarity together with all other workflows already provided: search-based bulk labeling, programmatic labeling using search queries, model pre-annotation, and human-in-the-loop workflows. This also means you can combine keyword searches and filters with this new similarity search feature. Without any vendor or model lock-in, you can use ANY embedding or encoding method, including but not limited to
Sentence Transformers
, OpenAI
, or Co:here
. If you want a deep-dive, you can check the Semantic similarity deep-dive, but this tutorial will show you the basics to get started.
Letโs do it!
Running Argilla#
For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:
Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:
For details about configuring your deployment, check the official Hugging Face Hub guide.
Launch Argilla using Argillaโs quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.
For more information on deployment options, please check the Deployment section of the documentation.
Tip
This tutorial is a Jupyter Notebook. There are two options to run it:
Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโt forget to change the runtime type to GPU for faster model training and inference.
Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
Setup#
For this tutorial, youโll need Argillaโs Python client and a few third-party libraries that can be installed via pip
:
[ ]:
%pip install argilla datasets==2.8.0 sentence-transformers==2.2.2 -qqq
Letโs import the Argilla module for reading and writing data:
[ ]:
import argilla as rg
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL
and API_KEY
:
[ ]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
api_url="http://localhost:6900",
api_key="admin.apikey"
)
If youโre running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:
[ ]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"
# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
# api_url="https://[your-owner-name]-[your_space_name].hf.space",
# api_key="admin.apikey",
# extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )
Letโs add the imports we need:
[ ]:
from sentence_transformers import SentenceTransformer
from datasets import load_dataset
Enable Telemetry#
We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.
[ ]:
try:
from argilla.utils.telemetry import tutorial_running
tutorial_running()
except ImportError:
print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")
๐พ Downloading and embedding your dataset#
The code below will load the banking customer requests dataset from the Hub, encode the text
field, and create the vectors
field which will contain only one key (mini-lm-sentence-transformers
). For the purposes of labeling the dataset from scratch, it will also remove the label
field, which contains original intent labels.
[ ]:
# Define fast version of sentence transformers
encoder = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")
# Load dataset with banking
dataset = load_dataset("banking77", split="test")
# Encode text field using batched computation
dataset = dataset.map(
lambda batch: {"vectors": encoder.encode(batch["text"])},
batch_size=32,
batched=True
)
# Removes the original labels because you'll be labeling from scratch
dataset = dataset.remove_columns("label")
# Turn vectors into a dictionary
dataset = dataset.map(
lambda r: {"vectors": {"mini-lm-sentence-transformers": r["vectors"]}}
)
Our dataset now contains a vectors
field with the embedding vector generated by the sentence transformer model.
[10]:
dataset.to_pandas().head()
[10]:
text | vectors | |
---|---|---|
0 | How do I locate my card? | {'mini-lm-sentence-transformers': [-0.01016708... |
1 | I still have not received my new card, I order... | {'mini-lm-sentence-transformers': [-0.04284123... |
2 | I ordered a card but it has not arrived. Help ... | {'mini-lm-sentence-transformers': [-0.03365558... |
3 | Is there a way to know when my card will arrive? | {'mini-lm-sentence-transformers': [0.012195908... |
4 | My card has not arrived yet. | {'mini-lm-sentence-transformers': [-0.04361863... |
๐ Upload dataset into Argilla#
The original banking77
dataset is an intent classification dataset with dozens of labels (lost_card
, card_arrival
, etc.). To keep this tutorial simple, we define a simplified labeling scheme with higher level classes: ["change_details", "card", "atm", "top_up", "balance", "transfer", "exchange_rate", "pin"]
.
Letโs define the dataset settings, configure the dataset, and upload our dataset with vectors.
[ ]:
rg_ds = rg.DatasetForTextClassification.from_datasets(dataset)
# Our labeling scheme
settings = rg.TextClassificationSettings(
label_schema=["change_details", "card", "atm", "top_up", "balance", "transfer", "exchange_rate", "pin"]
)
rg.configure_dataset_settings(name="banking77-topics", settings=settings)
rg.log(
name="banking77-topics",
records=rg_ds,
chunk_size=50,
)
๐ท Bulk labeling with the find similar
action#
Now that our banking77-topics
is available from the Argilla UI. We can start annotating our data leveraging semantic similarity search. The workflow is as follows:
Label a record (e.g., โChange my informationโ with the label
change_details
) and then click on Find similar on the top-right of your record.As a result, youโll get to a list of the most similar records sorted by similarity (in descending order).
You can now review the records and assign either the
change_details
label or any other. For our use case, we see that most of the suggested records fall into the same category.
Letโs see it step-by-step:
Label a record#
Using the hand-labeling mode, you can label a record like the one below:
Now if you want to find semantically similar or even duplicates of this record you can use the Find similar button.
Find similar#
As a result, youโll get a list of the 50 most similar records.
Note
Remember that you can combine this similarity search with the other search features: keywords, the query string DSL, and filters. If you have filters enabled for example, the find similar action will return the most similar records from the subset of records with the filter enabled.
As you can see, the model is effectively capturing the similar meanings without any need for explicit shared words: e.g., details
vs.ย information
.
Review records#
At this point, you can label the records one by one or by scrolling down to review them before using the bulk-labeling button on the top of the records list.
Bulk label#
For this tutorial, our labels are sufficiently fine-grained for the embeddings to group records that fall under the same topic. So in this case, it is safe to use the bulk labeling feature directly, effectively labeling 50 semantically similar examples after a quick revision.
Warning
For other use cases, you might need to be more careful and combine this feature with search queries and filters. For quick experimentation, you can also assume youโll make some labeling errors and then use tools like cleanlab
for detecting label errors.
Summary#
In this tutorial, you learned to use similarity search for data labeling with Argilla by using Sentence Transformers to embed your raw data.