🔦 Semantic Search#

This guide gives an overview of the semantic search features. Since 1.19.0 Argilla supports adding vectors to Feedback Datasets (other datasets include this feature since 1.2.0) which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter-based).

Vector search leverages machine learning to capture rich semantic features by embedding items (text, video, images, etc.) into a vector space, which can be then used to find “semantically” similar items.

In this guide, you’ll find how to:

Set up your Elasticsearch or Opensearch endpoint with vector search support.
Encode text into vectors for Argilla records.
Use semantic search.

The next section gives a general overview of how semantic search works in Argilla.

How it works#

Semantic search in Argilla works as follows:

One or several vectors can be included in the vectors field of Argilla Records. The vectors field accepts a dictionary where keys represent the names and values contain the actual vectors. This is the case because certain use cases might require using several vectors. Note that for a FeedbackDataset you will also need to configure VectorSettings in your dataset.
The vectors are stored at indexing time, once the records are logged with add_records or update_records in a FeedbackDataset, or with rg.log in older datasets.
If you have stored vectors in your dataset, you can use the semantic search feature in Argilla UI and the Python SDK.

In future versions, embedding services might be developed to facilitate steps 1 and 2 and associate vectors to records automatically.

Note

It’s completely up to the user which encoding or embedding mechanism to use for producing these vectors. In the “Encode text fields” section of this document you will find several examples and details about this process, using open-source libraries (e.g., Hugging Face) as well as paid services (e.g., Cohere or OpenAI).

Currently, Argilla uses vector search only for searching similar records (nearest neighbors) of a given vector. This can be leveraged from Argilla UI as well as the Python Client. In the future, vector search could be leveraged as well for free text queries using Argilla UI.

Setup vector search support#

In order to use this feature you should use Elasticsearch at least version 8.5.xor Opensearch 2.4.0. We provide pre-configured docker-compose files in the root of Argilla’s Github repository.

Warning

If you had Argilla running with Elasticsearch 7.1.0 you need to migrate to at least version 8.5.x. Please check the section “Migrating from Elasticsearch 7.1.0 to 8.5”.

Elasticsearch backend#

If you don’t have another instance of Elasticsearch or Opensearch running, or don’t want to keep previous Argilla datasets, you can launch a clean instance of Elasticsearch by downloading the docker-compose.elasticsearch.yaml and running:

docker-compose -f docker-compose.elasticsearch.yaml up

Migrate from 7.1.0 to 8.5#

Warning

If you had Argilla running with Elasticsearch 7.1.0 you need to migrate to at least version 8.5.x. Before following the process described below, please read the official Elasticsearch Migration Guide carefully.

In order to migrate from Elasticsearch 7.1.0 and keep your datasets you can follow this process:

Stop your current Elasticsearch service (we assume a migration for a docker-compose setup).
Set the Elasticsearch image to 7.17.x in your docker-compose.
Start the Elasticsearch service again.
Once is up and running, stop it again and set the Elasticsearch image to 8.5.x
Finally, start again the Elasticsearch service. Data should be migrated properly.

Once the service is up you can launch the Argilla Server with python -m argilla server start.

Opensearch backend#

If you don’t have another instance of Elasticsearch or Opensearch running, or don’t want to keep previous Argilla datasets, you can launch a clean instance of Opensearch by downloading the docker-compose.opensearch.yaml file and running:

docker-compose -f docker-compose.opensearch.yaml up

Once the service is up you can launch the Argilla Server with ARGILLA_SEARCH_ENGINE=opensearch python -m argilla server start.

Warning

For vector search in OpenSearch, the filtering applied is using a post_filter step, since there is a bug that makes queries fail using filtering + knn from Argilla. See https://github.com/opensearch-project/k-NN/issues/1286

This may result in unexpected results when combining filtering with vector search with this engine.

Add vectors to your data#

The first and most important thing to do before leveraging semantic search is to turn text into a numerical representation: a vector. In practical terms, you can think of a vector as an array or list of numbers. You can associate this list of numbers with an Argilla Record by using the aforementioned vectors field. But the question is: how do you create these vectors?

Over the years, many approaches have been used to turn text into numerical representations. The goal is to “encode” meaning, context, topics, etc.. This can be used to find “semantically” similar text. Some of these approaches are LSA (Latent Semantic Analysis), tf-idf, LDA (Latent Dirichlet Allocation), or doc2Vec. More recent methods fall in the category of “neural” methods, which leverage the power of large neural networks to embed text into dense vectors (a large array of real numbers). These methods have demonstrated a great ability to capture semantic features. These methods are powering a new wave of technologies that fall under categories like neural search, semantic search, or vector search. Most of these methods involve using a large language model to encode the full context of a textual snippet, such as a sentence, a paragraph, and more lately larger documents.

Note

In the context of Argilla, we intentionally use the term vector in favor of embedding to emphasize that users can leverage methods other than neural, which might be cheaper to compute or be more useful for their use cases.

In the next sections, we show how to encode text using different models and services and how to add them to Argilla records.

Warning

If you run into issues when logging records with large vectors using rg.log, we recommend you to use a smaller chunk_size as shown in the following examples.

Sentence Transformers#

SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. There are dozens of pre-trained models available on the Hugging Face Hub.

Given its fundamental and open source versatile nature, we have decided to add a native integration with SentenceTransformers. This integration allows you to easily add embeddings to your records or datasets using the SentenceTransformersExtractor based on the sentence-transformers library. This integration can be found here.

OpenAI `Embeddings`#

OpenAI provides an API endpoint called Embeddings to get a vector representation of a given input that can be easily consumed by machine learning models and algorithms.

Warning

Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you can only use the text-similarity-ada-001 model which produces vectors of 1024 dimensions.

The code below will load a dataset from the Hub, encode the text field, and create the vectors field which will contain only one key (openai) using the Embeddings endpoint.

To run the code below you need to install openai and datasets with pip: pip install openai datasets.

You also need to setup your OpenAI API key as shown below.

import openai
from datasets import load_dataset

openai.api_key = "<your api key goes here>"

# Load dataset
dataset = load_dataset("banking77", split="test")

def get_embedding(texts, model="text-similarity-ada-001"):
    response = openai.Embedding.create(input = texts, model=model)
    vectors = [item["embedding"] for item in response["data"]]
    return vectors

# Encode text. Get only 500 vectors for testing, remove the select to do the full dataset
dataset = dataset.select(range(500)).map(lambda batch: {"vectors": get_embedding(batch["text"])}, batch_size=16, batched=True)

# Turn vectors into a dictionary
dataset = dataset.map(
    lambda r: {"vectors": {"text-similarity-ada-001": r["vectors"]}}
)

Cohere `Co.Embed`#

Cohere Co.Embed is an API endpoint by Cohere that takes a piece of text and turns it into a vector embedding.

Warning

Due to the vector dimension limitation of Elasticsearch and Opensearch Lucene-based engines, currently, you can only use the small model which produces vectors of 1024 dimensions.

The code below will load a dataset from the Hub, encode the text field, and create the vectors field which will contain only one key (cohere) using the Embeddings endpoint.

To run the code below you need to install cohere and datasets with pip: pip install cohere datasets.

You also need to set up your Cohere API key as shown below.

import cohere

api_key = "<your api key goes here>"
co = cohere.Client(api_key)

# Load dataset
dataset = load_dataset("banking77", split="test")

def get_embedding(texts):
    return co.embed(texts, model="small").embeddings

# Encode text. Get only 1000 vectors for testing, remove the select to do the full dataset
dataset = dataset.select(range(1000)).map(lambda batch: {"vectors": get_embedding(batch["text"])}, batch_size=16, batched=True)

# Turn vectors into a dictionary
dataset = dataset.map(
    lambda r: {"vectors": {"cohere-embed": r["vectors"]}}
)

Configure your dataset#

Our dataset now contains a vectors field with the embedding vector generated by our preferred model. This dataset can be transformed into an Argilla Dataset in the following ways:

FeedbackDataset

Let’s first configure a Feedback Dataset that includes vector settings:

import argilla as rg

local_ds = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="text")
    ],
    questions=[
        rg.MultiLabelQuestion(
            name="topic",
            title="Select the topics mentioned in the text:",
            labels=dataset.info.features['label'].names, #these are the labels in the original dataset
        )
    ],
    vectors_settings=[
        rg.VectorSettings(name=key, dimensions=len(value))
        for key,value in dataset[0]["vectors"].items()
    ]
)
remote_ds = local_ds.push_to_argilla("banking77", workspace="admin")

Now we can create records and add them to the dataset:

records = [
    rg.FeedbackRecord(
        fields={"text": rec["text"]},
        vectors=rec["vectors"]
    )
    for rec in dataset
]
remote_ds.add_records(records)

Older datasets

You can use the DatasetForTextClassification.from_datasets method. Then, this dataset can be logged into Argilla as follows:

import argilla as rg

rg_ds = rg.DatasetForTextClassification.from_datasets(dataset, annotation="label")

rg.log(
    name="banking77",
    records=rg_ds,
    chunk_size=50,
)

Use semantic search#

This section introduces how to use the semantic search feature from Argilla UI and Argilla Python client.

Argilla UI#

FeedbackDataset

In Feedback datasets, you can also retrieve records based on their similarity with another record. To do that, make sure you have added vector_settings to your dataset configuration and that your records include vectors.

In the UI, go to the record you’d like to use for the semantic search and click on Find similar at the top right corner of the record card. If there is more than one vector, you will be asked to select which vector to use. You can also select whether you want the most or least similar records and the number of results you would like to see.

At any time, you can expand or collapse the record that was used for the search as a reference. If you want to undo the search, just click on the cross next to the reference record.

Snapshot of semantic search in a Feedback Dataset from Argilla's UI

Older datasets

Within the Argilla UI, it is possible to select a record that has an attached vector to start semantic searching by clicking the “Find similar” button. After labeling, the “Remove similar record filter” button can be pressed to close the specific search and continue with your labeling session.

Screenshot of Argilla UI

Argilla Python client#

To find records similar to a given vector, first we need to produce that vector of reference. Let’s see how we can do that with the different frameworks that we used before:

Warning

In order to get good results, make sure you are using the same encoder model for generating the vector used for the query. For example, if your dataset has been encoded with the bge-small-en model from sentence transformers, make sure to use the same model for encoding the text to be used for querying. Another option is to use an existing record in your dataset, which already contains a vector.

Sentence Transformers

from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("BAAI/bge-small-en", device="cpu")

vector = encoder.encode("I lost my credit card. What should I do?").tolist()

OpenAI Embeddings

vector = openai.Embedding.create(
    input = ["I lost my credit card. What should I do?"],
    model="text-similarity-ada-001"
)["data"][0]["embedding"]

Cohere co.Embed

vector = co.embed(["I lost my credit card. What should I do?"], model="small").embeddings[0]

Now that we have our reference vector, we can do a semantic search in the Python SDK:

Feedback Datasets

In the Python SDK, you can also get a list of feedback records that are semantically close to a given embedding with the find_similar_records method. These are the arguments of this function:

vector_name: The name of the vector to use in the search.
value: A vector to use for the similarity search in the form of a List[float]. It is necessary to include a value or a record.
record: A FeedbackRecord to use as part of the search. It is necessary to include a value or a record.
max_results (optional): The maximum number of results for this search. The default is 50.

This returns a list of Tuples with the records and their similarity score (between 0 and 1).

ds = rg.FeedbackDataset.from_argilla("my_dataset", workspace="my_workspace")

# using text embeddings
similar_records =  ds.find_similar_records(
    vector_name="my_vector",
    value=embedder_model.embeddings("My text is here")
    # value=embedder_model.embeddings("My text is here").tolist() # for numpy arrays
)

# using another record
similar_records =  ds.find_similar_records(
    vector_name="my_vector",
    record=ds.records[0],
    max_results=5
)

# work with the resulting tuples
for record, score in similar_records:
    ...

You can also combine filters and semantic search like this:

similar_records = (dataset
    .filter_by(metadata=[rg.TermsMetadataFilter(values=["Positive"])])
    .find_similar_records(vector_name="vector", value=model.encode("Another text").tolist())
)

Older datasets

The rg.load method includes a vector parameter which can be used to retrieve similar records to a given vector, and a limit parameter to indicate the number of records to be retrieved. This parameter accepts a tuple with the key of the target vector (this should match with one of the keys of the vectors dictionary) and the query vector itself.

In addition, the vector param can be combined with the query param to combine vector search with traditional search.

ds = rg.load(
    name="banking77-openai",
    vector=("my-vector-name", vector),
    limit=20,
    query="annotated_as:card_arrival"
)

🔦 Semantic Search#

How it works#

Setup vector search support#

Elasticsearch backend#

Migrate from 7.1.0 to 8.5#

Opensearch backend#

Add vectors to your data#

Sentence Transformers#

OpenAI Embeddings#

Cohere Co.Embed#

Configure your dataset#

Use semantic search#

Argilla UI#

Argilla Python client#

OpenAI `Embeddings`#

Cohere `Co.Embed`#