🔦 Semantic Search#
This guide gives an overview of the semantic search features. Since 1.19.0
Argilla supports adding vectors to Feedback Datasets (other datasets include this feature since 1.2.0
) which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter-based).
Vector search leverages machine learning to capture rich semantic features by embedding items (text, video, images, etc.) into a vector space, which can be then used to find “semantically” similar items.
In this guide, you’ll find how to:
Set up your Elasticsearch or Opensearch endpoint with vector search support.
Encode text into vectors for Argilla records.
Use semantic search.
The next section gives a general overview of how semantic search works in Argilla.
How it works#
Semantic search in Argilla works as follows:
One or several vectors can be included in the
vectors
field of Argilla Records. Thevectors
field accepts a dictionary wherekeys
represent the names andvalues
contain the actual vectors. This is the case because certain use cases might require using several vectors. Note that for aFeedbackDataset
you will also need to configureVectorSettings
in your dataset.The vectors are stored at indexing time, once the records are logged with
add_records
orupdate_records
in aFeedbackDataset
, or withrg.log
in older datasets.If you have stored vectors in your dataset, you can use the semantic search feature in Argilla UI and the Python SDK.
In future versions, embedding services might be developed to facilitate steps 1 and 2 and associate vectors to records automatically.
Note
It’s completely up to the user which encoding or embedding mechanism to use for producing these vectors. In the “Encode text fields” section of this document you will find several examples and details about this process, using open-source libraries (e.g., Hugging Face) as well as paid services (e.g., Cohere or OpenAI).
Currently, Argilla uses vector search only for searching similar records (nearest neighbors) of a given vector. This can be leveraged from Argilla UI as well as the Python Client. In the future, vector search could be leveraged as well for free text queries using Argilla UI.
Setup vector search support#
In order to use this feature you should use Elasticsearch at least version 8.5.x
or Opensearch 2.4.0
. We provide pre-configured docker-compose files in the root of Argilla’s Github repository.
Warning
If you had Argilla running with Elasticsearch 7.1.0 you need to migrate to at least version 8.5.x. Please check the section “Migrating from Elasticsearch 7.1.0 to 8.5”.
Elasticsearch backend#
If you don’t have another instance of Elasticsearch or Opensearch running, or don’t want to keep previous Argilla datasets, you can launch a clean instance of Elasticsearch by downloading the docker-compose.elasticsearch.yaml and running:
docker-compose -f docker-compose.elasticsearch.yaml up
Migrate from 7.1.0 to 8.5#
Warning
If you had Argilla running with Elasticsearch 7.1.0 you need to migrate to at least version 8.5.x. Before following the process described below, please read the official Elasticsearch Migration Guide carefully.
In order to migrate from Elasticsearch 7.1.0 and keep your datasets you can follow this process:
Stop your current Elasticsearch service (we assume a migration for a
docker-compose
setup).Set the Elasticsearch image to 7.17.x in your
docker-compose
.Start the Elasticsearch service again.
Once is up and running, stop it again and set the Elasticsearch image to 8.5.x
Finally, start again the Elasticsearch service. Data should be migrated properly.
Once the service is up you can launch the Argilla Server with python -m argilla server start
.
Opensearch backend#
If you don’t have another instance of Elasticsearch or Opensearch running, or don’t want to keep previous Argilla datasets, you can launch a clean instance of Opensearch by downloading the docker-compose.opensearch.yaml file and running:
docker-compose -f docker-compose.opensearch.yaml up
Once the service is up you can launch the Argilla Server with ARGILLA_SEARCH_ENGINE=opensearch python -m argilla server start
.
Warning
For vector search in OpenSearch, the filtering applied is using a post_filter
step, since there is a bug that makes queries fail using filtering + knn from Argilla.
See https://github.com/opensearch-project/k-NN/issues/1286
This may result in unexpected results when combining filtering with vector search with this engine.
Add vectors to your data#
The first and most important thing to do before leveraging semantic search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned vectors
field. But the question is: how do you create these vectors?
Over the years, many approaches have been used to turn text into numerical representations. The goal is to “encode” meaning, context, topics, etc.. This can be used to find “semantically” similar text. Some of these approaches are LSA (Latent Semantic Analysis), tf-idf, LDA (Latent Dirichlet Allocation), or doc2Vec. More recent methods fall in the category of “neural” methods, which leverage the power of large neural networks to embed text into dense vectors (a large array of real numbers). These methods have demonstrated a great ability to capture semantic features. These methods are powering a new wave of technologies that fall under categories like neural search, semantic search, or vector search. Most of these methods involve using a large language model to encode the full context of a textual snippet, such as a sentence, a paragraph, and more lately larger documents.
Note
In the context of Argilla, we intentionally use the term vector
in favor of embedding
to emphasize that users can leverage methods other than neural, which might be cheaper to compute or be more useful for their use cases.
In the next sections, we show how to encode text using different models and services and how to add them to Argilla records.
Warning
If you run into issues when logging records with large vectors using rg.log
, we recommend you to use a smaller chunk_size
as shown in the following examples.
Sentence Transformers#
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. There are dozens of pre-trained models available on the Hugging Face Hub.
Given its fundamental and open source versatile nature, we have decided to add a native integration with SentenceTransformers. This integration allows you to easily add embeddings to your records or datasets using the SentenceTransformersExtractor
based on the sentence-transformers library. This integration can be found here.
OpenAI Embeddings
#
OpenAI provides an API endpoint called Embeddings to get a vector representation of a given input that can be easily consumed by machine learning models and algorithms.
Warning
Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you can only use the text-similarity-ada-001
model which produces vectors of 1024
dimensions.
The code below will load a dataset from the Hub, encode the text
field, and create the vectors
field which will contain only one key (openai
) using the Embeddings endpoint.
To run the code below you need to install openai
and datasets
with pip: pip install openai datasets
.
You also need to setup your OpenAI API key as shown below.
import openai
from datasets import load_dataset
openai.api_key = "<your api key goes here>"
# Load dataset
dataset = load_dataset("banking77", split="test")
def get_embedding(texts, model="text-similarity-ada-001"):
response = openai.Embedding.create(input = texts, model=model)
vectors = [item["embedding"] for item in response["data"]]
return vectors
# Encode text. Get only 500 vectors for testing, remove the select to do the full dataset
dataset = dataset.select(range(500)).map(lambda batch: {"vectors": get_embedding(batch["text"])}, batch_size=16, batched=True)
# Turn vectors into a dictionary
dataset = dataset.map(
lambda r: {"vectors": {"text-similarity-ada-001": r["vectors"]}}
)
Cohere Co.Embed
#
Cohere Co.Embed is an API endpoint by Cohere that takes a piece of text and turns it into a vector embedding.
Warning
Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you can only use the small
model which produces vectors of 1024
dimensions.
The code below will load a dataset from the Hub, encode the text
field, and create the vectors
field which will contain only one key (cohere
) using the Embeddings endpoint.
To run the code below you need to install cohere
and datasets
with pip: pip install cohere datasets
.
You also need to set up your Cohere API key as shown below.
import cohere
api_key = "<your api key goes here>"
co = cohere.Client(api_key)
# Load dataset
dataset = load_dataset("banking77", split="test")
def get_embedding(texts):
return co.embed(texts, model="small").embeddings
# Encode text. Get only 1000 vectors for testing, remove the select to do the full dataset
dataset = dataset.select(range(1000)).map(lambda batch: {"vectors": get_embedding(batch["text"])}, batch_size=16, batched=True)
# Turn vectors into a dictionary
dataset = dataset.map(
lambda r: {"vectors": {"cohere-embed": r["vectors"]}}
)
Configure your dataset#
Our dataset now contains a vectors
field with the embedding vector generated by our preferred model. This dataset can be transformed into an Argilla Dataset in the following ways:
Let’s first configure a Feedback Dataset that includes vector settings:
import argilla as rg
local_ds = rg.FeedbackDataset(
fields=[
rg.TextField(name="text")
],
questions=[
rg.MultiLabelQuestion(
name="topic",
title="Select the topics mentioned in the text:",
labels=dataset.info.features['label'].names, #these are the labels in the original dataset
)
],
vectors_settings=[
rg.VectorSettings(name=key, dimensions=len(value))
for key,value in dataset[0]["vectors"].items()
]
)
remote_ds = local_ds.push_to_argilla("banking77", workspace="admin")
Now we can create records and add them to the dataset:
records = [
rg.FeedbackRecord(
fields={"text": rec["text"]},
vectors=rec["vectors"]
)
for rec in dataset
]
remote_ds.add_records(records)
You can use the DatasetForTextClassification.from_datasets
method. Then, this dataset can be logged into Argilla as follows:
import argilla as rg
rg_ds = rg.DatasetForTextClassification.from_datasets(dataset, annotation="label")
rg.log(
name="banking77",
records=rg_ds,
chunk_size=50,
)
Use semantic search#
This section introduces how to use the semantic search feature from Argilla UI and Argilla Python client.
Argilla UI#
In Feedback datasets, you can also retrieve records based on their similarity with another record. To do that, make sure you have added vector_settings
to your dataset configuration and that your records include vectors.
In the UI, go to the record you’d like to use for the semantic search and click on Find similar
at the top right corner of the record card. If there is more than one vector, you will be asked to select which vector to use. You can also select whether you want the most or least similar records and the number of results you would like to see.
At any time, you can expand or collapse the record that was used for the search as a reference. If you want to undo the search, just click on the cross next to the reference record.
Within the Argilla UI, it is possible to select a record that has an attached vector to start semantic searching by clicking the “Find similar” button. After labeling, the “Remove similar record filter” button can be pressed to close the specific search and continue with your labeling session.
Argilla Python client#
To find records similar to a given vector, first we need to produce that vector of reference. Let’s see how we can do that with the different frameworks that we used before:
Warning
In order to get good results, make sure you are using the same encoder model for generating the vector used for the query. For example, if your dataset has been encoded with the bge-small-en
model from sentence transformers, make sure to use the same model for encoding the text to be used for querying. Another option is to use an existing record in your dataset, which already contains a vector.
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer("BAAI/bge-small-en", device="cpu")
vector = encoder.encode("I lost my credit card. What should I do?").tolist()
vector = openai.Embedding.create(
input = ["I lost my credit card. What should I do?"],
model="text-similarity-ada-001"
)["data"][0]["embedding"]
vector = co.embed(["I lost my credit card. What should I do?"], model="small").embeddings[0]
Now that we have our reference vector, we can do a semantic search in the Python SDK:
In the Python SDK, you can also get a list of feedback records that are semantically close to a given embedding with the find_similar_records
method. These are the arguments of this function:
vector_name
: Thename
of the vector to use in the search.value
: A vector to use for the similarity search in the form of aList[float]
. It is necessary to include avalue
or arecord
.record
: AFeedbackRecord
to use as part of the search. It is necessary to include avalue
or arecord
.max_results
(optional): The maximum number of results for this search. The default is50
.
This returns a list of Tuples with the records and their similarity score (between 0 and 1).
ds = rg.FeedbackDataset.from_argilla("my_dataset", workspace="my_workspace")
# using text embeddings
similar_records = ds.find_similar_records(
vector_name="my_vector",
value=embedder_model.embeddings("My text is here")
# value=embedder_model.embeddings("My text is here").tolist() # for numpy arrays
)
# using another record
similar_records = ds.find_similar_records(
vector_name="my_vector",
record=ds.records[0],
max_results=5
)
# work with the resulting tuples
for record, score in similar_records:
...
You can also combine filters and semantic search like this:
similar_records = (dataset
.filter_by(metadata=[rg.TermsMetadataFilter(values=["Positive"])])
.find_similar_records(vector_name="vector", value=model.encode("Another text").tolist())
)
The rg.load
method includes a vector
parameter which can be used to retrieve similar records to a given vector, and a limit
parameter to indicate the number of records to be retrieved. This parameter accepts a tuple with the key of the target vector (this should match with one of the keys of the vectors
dictionary) and the query vector itself.
In addition, the vector
param can be combined with the query
param to combine vector search with traditional search.
ds = rg.load(
name="banking77-openai",
vector=("my-vector-name", vector),
limit=20,
query="annotated_as:card_arrival"
)