๐ซ Work with vectors#
Feedback Dataset#
Note
The dataset class covered in this section is the FeedbackDataset
. This fully configurable dataset will replace the DatasetForTextClassification
, DatasetForTokenClassification
, and DatasetForText2Text
in Argilla 2.0. Not sure which dataset to use? Check out our section on choosing a dataset.
Define vectors_settings
#
To use the similarity search in the UI and the Python SDK, you will need to configure vector settings. These are defined using the SDK as a list of up to 5 vectors when creating a FeedbackDataset or adding them to an already existing FeedbackDataset. They have the following arguments:
name
: The name of the vector, as it will appear in the records.dimensions
: The dimensions of the vectors used in this setting.title
(optional): A name for the vector to display in the UI for better readability.
vectors_settings = [
rg.VectorSettings(
name="my_vector",
dimensions=768
),
rg.VectorSettings(
name="my_other_vector",
title="Another Vector", # optional
dimensions=768
)
]
Add vectors_settings
#
If you want to add vector settings when creating a dataset, you can pass them as a list of VectorSettings
instances to the vector_settings
argument of the FeedbackDataset
constructor as shown here.
For an end-to-end example, check our tutorial on adding vectors.
vector_settings = rg.VectorSettings(
name="sentence_embeddings",
title="Sentence Embeddings",
dimensions=384
)
dataset.add_vector_settings(vector_settings)
Once the vector settings are added, you can check their definition using vector_settings_property_by_name
.
dataset.vector_settings_property_by_name("sentence_embeddings")
# rg.VectorSettings(
# name="sentence_embeddings",
# title="Sentence Embeddings",
# dimensions=768
# )
Update vectors_settings
#
You can update the vector settings for a FeedbackDataset
, via assignment. If the dataset was already pushed to Argilla and you are working with a RemoteFeedbackDataset
, you can update them using the update_vector_settings
method.
Note
The dataset not yet pushed to Argilla or pulled from HuggingFace Hub is an instance of FeedbackDataset
whereas the dataset pulled from Argilla is an instance of RemoteFeedbackDataset
.
vector_config = dataset.vector_settings_by_name("sentence_embeddings")
vector_config.title = "Embeddings"
dataset.update_vectors_settings(vector_config)
Delete vectors_settings
#
If you need to delete vector settings from an already configured FeedbackDataset
, you can use the delete_vector_settings
method.
dataset.delete_vectors_settings("sentence_embeddings")
Format vectors
#
You can associate vectors, like text embeddings, to your records. This will enable the semantic search in the UI and the Python SDK. These are saved as a dictionary, where the keys correspond to the name
s of the vector settings that were configured for your dataset and the value is a list of floats. Make sure that the length of the list corresponds to the dimensions set in the vector settings.
Hint
Vectors should have the following format List[float]
. If you are using numpy arrays, simply convert them using the method .tolist()
.
record = rg.FeedbackRecord(
fields={...},
vectors={"my_vector": [...], "my_other_vector": [...]}
)
Add vectors
#
Once the vector_settings
were defined, to add vectors to the records, it slightly depends on whether you are using a FeedbackDataset
or a RemoteFeedbackDataset
. For an end-to-end example, check our tutorial on adding vectors.
Note
The dataset not yet pushed to Argilla or pulled from HuggingFace Hub is an instance of FeedbackDataset
whereas the dataset pulled from Argilla is an instance of RemoteFeedbackDataset
. The difference between the two is that the former is a local one and the changes made on it stay locally. On the other hand, the latter is a remote one and the changes made on it are directly reflected on the dataset on the Argilla server, which can make your process faster.
for record in dataset.records:
record.vectors["my_vectors"] = [0.1, 0.2, 0.3, 0.4]
modified_records = []
for record in dataset.records:
record.vectors["my_vectors"] = [0.1, 0.2, 0.3, 0.4]
modified_records.append(record)
dataset.update_records(modified_records)
Note
You can also follow the same strategy to modify existing vectors.
Add Sentence Transformers vectors
#
You can easily add semantic embeddings to your records or datasets using the SentenceTransformersExtractor
based on the sentence-transformers library. This extractor is available in the Python SDK and can be used to configure settings for a dataset and extract embeddings from a list of records. The SentenceTransformersExtractor
has the following arguments:
model_name
: The name of the model to use for extracting embeddings. You can find a list of available models here.show_progress
(optional): Whether to show a progress bar when extracting metrics. Defaults toTrue
.
For a practical example, check our tutorial on adding sentence transformer embeddings as vectors.
This can be used to update the dataset and configuration with VectorSettings
for Fields
in a FeedbackDataset
or a RemoteFeedbackDataset
.
from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor
dataset = ... # FeedbackDataset or RemoteFeedbackDataset
tde = SentenceTransformersExtractor(
model="TaylorAI/bge-micro-v2",
show_progress=True,
)
dataset = tde.update_dataset(
dataset=dataset,
fields=None, # None means using all fields
update_records=True, # Also, update the records in the dataset
overwrite=False, # Whether to overwrite existing vectors
)
This can be used to update the records with vector
values for Fields
in a list of FeedbackRecords
.
from argilla.client.feedback.integrations.textdescrisentencetransformersptives import SentenceTransformersExtractor
records = [...] # FeedbackRecords or RemoteFeedbackRecords
tde = SentenceTransformersExtractor(
model="TaylorAI/bge-micro-v2",
show_progress=True,
)
records = tde.update_records(
records=records,
fields=None # None means using all fields
overwrite=False # Whether to overwrite existing vectors
)
Other datasets#
Note
The records classes covered in this section correspond to three datasets: DatasetForTextClassification
, DatasetForTokenClassification
, and DatasetForText2Text
. These will be deprecated in Argilla 2.0 and replaced by the fully configurable FeedbackDataset
class. Not sure which dataset to use? Check out our section on choosing a dataset.
Add vectors
#
You can add vectors to a TextClassificationRecord
, TokenClassificationRecord
or Text2TextRecord
. The vectors is a dictionary with the name as the key and the vectors as the value.
record = rg.TokenClassificationRecord(
text = "Michael is a professor at Harvard",
tokens = ["Michael", "is", "a", "professor", "at", "Harvard"],
vectors = {
"bert_base_uncased": [3.2, 4.5, 5.6, 8.9]
}
)