🦾 Fine-tune LLMs and other language models#

Feedback Dataset#

Note

The dataset class covered in this section is the FeedbackDataset. This fully configurable dataset will replace the DatasetForTextClassification, DatasetForTokenClassification, and DatasetForText2Text in Argilla 2.0. Not sure which dataset to use? Check out our section on choosing a dataset.

After collecting the responses from our FeedbackDataset, we can start fine-tuning our LLMs and other models. Due to the customizability of the task, this might require setting up a custom post-processing workflow, but we will provide some good toy examples for the LLM approaches: supervised fine-tuning, and reinforcement learning through human feedback (RLHF). However, we still provide for other NLP tasks like text classification.

The `ArgillaTrainer`#

The ArgillaTrainer is a wrapper around many of our favorite NLP libraries. It provides a very intuitive abstract representation to facilitate simple training workflows using decent default pre-set configurations without having to worry about any data transformations from Argilla.

Using the ArgillaTrainer is straightforward, but it slightly differs per task.

First, we define a TrainingTask. This is done using a custom formatting_func. However, tasks like Text Classification can also be defined using default definitions using the FeedbackDataset fields and questions. These tasks are then used for retrieving data from a dataset and initializing the training. We also offer some ideas for unifying data out of the box.
Next, we initialize the ArgillaTrainer and forward the task and training framework. Internally, this uses the FeedbackData.prepare_for_training-method to format the data according to the expectations from the framework. Some other interesting methods are:
1. ArgillaTrainer.update_config to change framework-specific training parameters.
2. ArgillaTrainer.train to start training.
3. ArgillTrainer.predict to run inference.

Underneath, you can see the happy flow for using the ArgillaTrainer.

from argilla.feedback import ArgillaTrainer, FeedbackDataset, TrainingTask

dataset = FeedbackDataset.from_huggingface(
    repo_id="argilla/emotion"
)
task = TrainingTask.for_text_classification(
    text=dataset.field_by_name("text"),
    label=dataset.question_by_name("label"),
)
trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="setfit"
)
trainer.update_config(num_iterations=1)
trainer.train(output_dir="my_setfit_model")
trainer.predict("This is awesome!")

Supported Frameworks#

We plan on adding more support for other tasks and frameworks so feel free to reach out on our Discord channel or GitHub to help us prioritize each task.

Task/Framework	TRL	OpenAI	SetFit	spaCy	Transformers	PEFT	SentenceTransformers
Text Classification			✔️	✔️	✔️	✔️
Question Answering					✔️
Sentence Similarity							✔️
Supervised Fine-tuning	✔️
Reward Modeling	✔️
Proximal Policy Optimization	✔️
Direct Preference Optimization	✔️
Chat Completion		✔️

Training Configs#

The trainer also has an ArgillaTrainer.update_config() method, which maps a dict with **kwargs to the respective framework. So, these can be derived from the underlying framework that was used to initialize the trainer. Underneath, you can find an overview of these variables for the supported frameworks.

Note

Note that you don’t need to pass all of them directly and that the values below are their default configurations.

OpenAI

# `OpenAI.FineTune`
trainer.update_config(
    training_file = None,
    validation_file = None,
    model = "gpt-3.5-turbo-0613",
    hyperparameters = {"n_epochs": 1},
    suffix = None
)

# `OpenAI.FineTune` (legacy)
trainer.update_config(
    training_file = None,
    validation_file = None,
    model = "curie",
    n_epochs = 2,
    batch_size = None,
    learning_rate_multiplier = 0.1,
    prompt_loss_weight = 0.1,
    compute_classification_metrics = False,
    classification_n_classes = None,
    classification_positive_class = None,
    classification_betas = None,
    suffix = None
)

AutoTrain

# `AutoTrain.autotrain_advanced`
trainer.update_config(
    model = "autotrain", # hub models like roberta-base
    autotrain = [{
        "source_language": "en",
        "num_models": 5
    }],
    hub_model = [{
        "learning_rate":  0.001,
        "optimizer": "adam",
        "scheduler": "linear",
        "train_batch_size": 8,
        "epochs": 10,
        "percentage_warmup": 0.1,
        "gradient_accumulation_steps": 1,
        "weight_decay": 0.1,
        "tasks": "text_binary_classification", # this is inferred from the dataset
    }]
)

SetFit

# `setfit.SetFitModel`
trainer.update_config(
    pretrained_model_name_or_path = "all-MiniLM-L6-v2",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `setfit.SetFitTrainer`
trainer.update_config(
    metric = "accuracy",
    num_iterations = 20,
    num_epochs = 1,
    learning_rate = 2e-5,
    batch_size = 16,
    seed = 42,
    use_amp = True,
    warmup_proportion = 0.1,
    distance_metric = "BatchHardTripletLossDistanceFunction.cosine_distance",
    margin = 0.25,
    samples_per_label = 2
)

spaCy

# `spacy.training`
trainer.update_config(
    dev_corpus = "corpora.dev",
    train_corpus = "corpora.train",
    seed = 42,
    gpu_allocator = 0,
    accumulate_gradient = 1,
    patience = 1600,
    max_epochs = 0,
    max_steps = 20000,
    eval_frequency = 200,
    frozen_components = [],
    annotating_components = [],
    before_to_disk = None,
    before_update = None
)

Transformers

# `transformers.AutoModelForTextClassification`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-uncased",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

Peft (LoRA)

# `peft.LoraConfig`
trainer.update_config(
    r=8,
    target_modules=None,
    lora_alpha=16,
    lora_dropout=0.1,
    fan_in_fan_out=False,
    bias="none",
    inference_mode=False,
    modules_to_save=None,
    init_lora_weights=True,
)
# `transformers.AutoModelForTextClassification`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-uncased",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

SpanMarker

# `SpanMarkerConfig`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-cased"
    model_max_length = 256,
    marker_max_length = 128,
    entity_max_length = 8,
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

TRL

# Parameters from `trl.RewardTrainer`, `trl.SFTTrainer`, `trl.PPOTrainer` or `trl.DPOTrainer`.
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

sentence-transformers

# Parameters related to the model initialization from `sentence_transformers.SentenceTransformer`
trainer.update_config(
    model="sentence-transformers/all-MiniLM-L6-v2",
    modules = False,
    device="cuda",
    cache_folder="dir/folder",
    use_auth_token=True
)
# and from `sentence_transformers.CrossEncoder`
trainer.update_config(
    model="cross-encoder/ms-marco-MiniLM-L-6-v2",
    num_labels=2,
    max_length=128,
    device="cpu",
    tokenizer_args={},
    automodel_args={},
    default_activation_function=None
)
# Related to the training procedure from `sentence_transformers.SentenceTransformer`
trainer.update_config(
    steps_per_epoch = 2,
    checkpoint_path: str = None,
    checkpoint_save_steps: int = 500,
    checkpoint_save_total_limit: int = 0
)
# and from `sentence_transformers.CrossEncoder`
trainer.update_config(
    loss_fct = None
    activation_fct = nn.Identity(),
)
# The remaining arguments are common for both procedures
trainer.update_config(
    evaluator: SentenceEvaluator = evaluation.EmbeddingSimilarityEvaluator,
    epochs: int = 1,
    scheduler: str = 'WarmupLinear',
    warmup_steps: int = 10000,
    optimizer_class: Type[Optimizer] = torch.optim.AdamW,
    optimizer_params : Dict[str, object]= {'lr': 2e-5},
    weight_decay: float = 0.01,
    evaluation_steps: int = 0,
    output_path: str = None,
    save_best_model: bool = True,
    max_grad_norm: float = 1,
    use_amp: bool = False,
    callback: Callable[[float, int, int], None] = None,
    show_progress_bar: bool = True,
)
# Other parameters that don't correspond to the initialization or the trainer, but
# can be set externally.
trainer.update_config(
    batch_size=8,  # It will be passed to the DataLoader to generate batches during training.
    loss_cls=losses.BatchAllTripletLoss
)

The `TrainingTask`#

A TrainingTask is used to define how the data should be processed and formatted according to the associated task and framework. Each task has its own TrainingTask.for_*-classmethod and the data formatting can always be defined using a custom formatting_func. However, simpler tasks like Text Classification can also be defined using default definitions. These directly use the fields and questions from the FeedbackDataset configuration to infer how to prepare the data. Underneath you can find an overview of the TrainingTask requirements.

Method	Content	`formatting_func` return type	Default
for_text_classification	`text-label`	`Union[Tuple[str, str], Tuple[str, List[str]]]`	✔️
for_question_answering	`questio-context-answer`	`Union[Tuple[str, str], Tuple[str, List[str]]]`	✔️
for_sentence_similarity	`sentence-1-sentence-2-(sentence-3)-(label)`	`Union[Dict[str, Union[float, int]], Dict[str, str], List[Dict[str, Union[float, int]]], List[Dict[str, str]]]`	✔️
for_supervised_fine_tuning	`text`	`Union[str, Iterator[str]]`	✗
for_reward_modeling	`chosen-rejected`	`Union[Tuple[str, str], Iterator[Tuple[str, str]]]`	✗
for_proximal_policy_optimization	`text`	`Union[str, Iterator[str]]]`	✗
for_direct_preference_optimization	`prompt-chosen-rejected`	`Union[Tuple[str, str, str], Iterator[Tuple[str, str, str]]]`	✗
for_chat_completion	`chat-turn-role-content`	`Union[Tuple[str, str, str, str], Iterator[Tuple[str, str, str, str]]]`	✗

Filter and Sort datasets for training#

Say you want to filter a part of your dataset, keep only the submitted records, or maybe sort by date to train on the latest additions to your dataset only. You can do it easily from the ArgillaTrainer by using the filter_by, sort_by, and max_records arguments:

from argilla import SortBy

trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="setfit",
    filter_by={"response_status": ["submitted"]},
    sort_by=[SortBy(field="metadata.my-metadata", order="asc")],
    max_records=1000
)

Note

You can take a look at the filter and query datasets page in the docs to learn more about how to filter and sort datasets.

Huggingface hub Integration#

This section presents some integrations with the Hugging Face 🤗model hub, the easiest way to share your Argilla models, as well as the possibility of generating an automated model card.

Note

Take a look at the following sample model in the 🤗huggingface hub with the autogenerated model card, and check https://huggingface.co/models?other=argilla for shared Argilla models to come.

Model card generation#

The ArgillaTrainer automatically generates a model card when saving the model. After calling trainer.train(output_dir="my_model"), you should see the model card under the same output directory you passed through the train method: ./my_model/README.md. Most of the fields in the card are automatically generated when possible, but the following fields can be (optionally) updated via the framework_kwargs variable of the ArgillaTrainer like so:

model_card_kwargs = {
    "language": ["en", "es"],
    "license": "Apache-2.0",
    "dataset_name": "argilla/emotion",
    "tags": ["nlp", "few-shot-learning", "argilla", "setfit"],
    "model_summary": "Small summary of what the model does",
    "model_description": "An extended explanation of the model",
    "model_type": "A 1.3B parameter embedding model fine-tuned on an awesome dataset",
    "finetuned_from": "all-MiniLM-L6-v2",
    "repo": "https://github.com/..."
    "developers": "",
    "shared_by": "",
}

trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="setfit",
    framework_kwargs={"model_card_kwargs": model_card_kwargs}
)
trainer.train(output_dir="my_model")

Even though its generated internally, you can get the card by calling the generate_model_card method:

argilla_model_card = trainer.generate_model_card("my_model")

Upload your models to Huggingface Hub#

If you don’t have huggingface hub installed yet, you can do it with the following command:

pip install huggingface_hub

Note

If your framework chosen is spacy or spacy-transformers you should also install the following dependency:

pip install spacy-huggingface-hub

And then select the environment, depending on whether you are working with a script or from a jupyter notebook:

Console

Run the following command from a console window and insert your 🤗huggingface hub token:

huggingface-cli login

Notebook

Run the following command from a notebook cell and insert your 🤗huggingface hub token:

from huggingface_hub import notebook_login

notebook_login()

Internally, the token will be used when calling the push_to_huggingface model.

Be sure to take a look at the huggingface hub requirements in case you need more help publishing your models.

After your model is trained, you just need to call push_to_huggingface and wait for your model to be pushed to the hub (by default, a model card will be generated, put the argument to False if you don’t want it):

# spaCy based models:
repo_id = output_dir

# Every other framework:
repo_id = "organization/model-name"  # for example: argilla/newest-model

trainer.push_to_huggingface(repo_id, generate_card=True)

Due to the spaCy behavior when pushing models, the repo_id is automatically generated internally, you need to pass the path to where the model was saved (the same output_dir variable you may pass to the train method), and it will work out just the same.

Other datasets#

Note

The records classes covered in this section correspond to three datasets: DatasetForTextClassification, DatasetForTokenClassification, and DatasetForText2Text. These will be deprecated in Argilla 2.0 and replaced by the fully configurable FeedbackDataset class. Not sure which dataset to use? Check out our section on choosing a dataset.

The `ArgillaTrainer`#

Supported frameworks#

Framework/Task	TextClassification	TokenClassification	Text2Text
OpenAI	✔️		✔️
SetFit	✔️
spaCy	✔️	✔️
Transformers	✔️	✔️
PEFT	✔️	✔️
SpanMarker		✔️

Training configs#

Note

Note that you don’t need to pass all of them directly and that the values below are their default configurations.

OpenAI

# `OpenAI.FineTune`
trainer.update_config(
    training_file = None,
    validation_file = None,
    model = "gpt-3.5-turbo-0613",
    hyperparameters = {"n_epochs": 1},
    suffix = None
)

# `OpenAI.FineTune` (legacy)
trainer.update_config(
    training_file = None,
    validation_file = None,
    model = "curie",
    n_epochs = 2,
    batch_size = None,
    learning_rate_multiplier = 0.1,
    prompt_loss_weight = 0.1,
    compute_classification_metrics = False,
    classification_n_classes = None,
    classification_positive_class = None,
    classification_betas = None,
    suffix = None
)

SetFit

# `setfit.SetFitModel`
trainer.update_config(
    pretrained_model_name_or_path = "all-MiniLM-L6-v2",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `setfit.SetFitTrainer`
trainer.update_config(
    metric = "accuracy",
    num_iterations = 20,
    num_epochs = 1,
    learning_rate = 2e-5,
    batch_size = 16,
    seed = 42,
    use_amp = True,
    warmup_proportion = 0.1,
    distance_metric = "BatchHardTripletLossDistanceFunction.cosine_distance",
    margin = 0.25,
    samples_per_label = 2
)

spaCy

# `spacy.training`
trainer.update_config(
    dev_corpus = "corpora.dev",
    train_corpus = "corpora.train",
    seed = 42,
    gpu_allocator = 0,
    accumulate_gradient = 1,
    patience = 1600,
    max_epochs = 0,
    max_steps = 20000,
    eval_frequency = 200,
    frozen_components = [],
    annotating_components = [],
    before_to_disk = None,
    before_update = None
)

Transformers

# `transformers.AutoModelForTextClassification`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-uncased",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

Peft (LoRA)

# `peft.LoraConfig`
trainer.update_config(
    r=8,
    target_modules=None,
    lora_alpha=16,
    lora_dropout=0.1,
    fan_in_fan_out=False,
    bias="none",
    inference_mode=False,
    modules_to_save=None,
    init_lora_weights=True,
)
# `transformers.AutoModelForTextClassification`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-uncased",
    force_download = False,
    resume_download = False,
    proxies = None,
    token = None,
    cache_dir = None,
    local_files_only = False
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

SpanMarker

# `SpanMarkerConfig`
trainer.update_config(
    pretrained_model_name_or_path = "distilbert-base-cased"
    model_max_length = 256,
    marker_max_length = 128,
    entity_max_length = 8,
)
# `transformers.TrainingArguments`
trainer.update_config(
    per_device_train_batch_size = 8,
    per_device_eval_batch_size = 8,
    gradient_accumulation_steps = 1,
    learning_rate = 5e-5,
    weight_decay = 0,
    adam_beta1 = 0.9,
    adam_beta2 = 0.9,
    adam_epsilon = 1e-8,
    max_grad_norm = 1,
    learning_rate = 5e-5,
    num_train_epochs = 3,
    max_steps = 0,
    log_level = "passive",
    logging_strategy = "steps",
    save_strategy = "steps",
    save_steps = 500,
    seed = 42,
    push_to_hub = False,
    hub_model_id = "user_name/output_dir_name",
    hub_strategy = "every_save",
    hub_token = "1234",
    hub_private_repo = False
)

Tasks#

In this part, we’ll explore Text Classification, Token Classification, and Text2Text tasks. We’ll provide concise descriptions of what each task entails and the steps involved in training and making predictions.

Text Classification#

Background#

Text classification is a widely used NLP task where labels are assigned to text. Major companies rely on it for various applications. Sentiment analysis, a popular form of text classification, assigns labels like 🙂 positive, 🙁 negative, or 😐 neutral to text. Additionally, we distinguish between single- and multi-label text classification.

Single-label

Single-label text classification refers to the task of assigning a single category or label to a given text sample. Each text is associated with only one predefined class or category. For example, in sentiment analysis, a single-label text classification task would involve assigning labels such as “positive,” “negative,” or “neutral” to texts based on their sentiment.

"The help for my application of a new card and mortgage was great", "positive"

Multi-label

Multi-label text classification is generally more complex than single-label classification due to the challenge of determining and predicting multiple relevant labels for each text. It finds applications in various domains, including document tagging, topic labeling, and content recommendation systems. For example, in customer care, a multi-label text classification task would involve assigning topics such as “new_card,” “mortgage,” or “opening_hours” to texts based on their content.

Tip

For a multi-label scenario, it is recommended to add some examples without any labels to improve model performance.

"The help for my application of a new card and mortgage was great", ["new_card", "mortgage"]

Training#

from argilla.feedback import ArgillaTrainer, FeedbackDataset, TrainingTask

dataset = FeedbackDataset.from_huggingface(
    repo_id="argilla/emotion"
)
task = TrainingTask.for_text_classification(
    text=dataset.field_by_name("text"),
    label=dataset.question_by_name("label"),
)
trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="setfit"
)
trainer.update_config(num_iterations=1)
trainer.train(output_dir="my_setfit_model")
trainer.predict("This is awesome!")

Token Classification#

Background#

Token classification is a crucial concept in the domain of NLP. It entails the act of assigning specific labels to individual words or tokens in a given text. These labels can encompass diverse linguistic or semantic attributes, such as part-of-speech annotations, named entities (including people’s names, organizations, or locations), or sentiment indicators (expressing positivity, negativity, or neutrality). This process serves as an indispensable foundation for numerous NLP applications, facilitating the extraction of valuable insights from textual data.

Training#

import argilla as rg
from datasets import load_dataset
from argilla.training import ArgillaTrainer

dataset_rg = rg.DatasetForTokenClassification.from_datasets(
    dataset=load_dataset("conll2003", split="train[:100]"),
    tags="ner_tags",
)
rg.log(dataset_rg, name="conll2003", workspace="admin")

trainer = ArgillaTrainer(
    name="conll2003",
    workspace="admin",
    framework="spacy",
    train_size=0.8
)
trainer.update_config(num_train_epochs=2)
trainer.train(output_dir="my_spacy_model")
records = trainer.predict("The ArgillaTrainer is great!", as_argilla_records=True)
rg.log(records=records, name="conll2003", workspace="admin")

Text2Text#

Background#

The Text2Text task in the realm of NLP represents a framework that takes a piece of text as input to transform it into another. Instead of approaching different NLP challenges as isolated issues, T2T seeks to create a generalized solution by framing them as sequence-to-sequence transformations. In this approach, both the input and output are considered as sequences of text, and their lengths can vary.

Training#

import argilla as rg
from datasets import load_dataset
from argilla.training import ArgillaTrainer

dataset_rg = rg.DatasetForText2Text.from_datasets(
    dataset=load_dataset("opus_books", "en-fr", split="train[:100]"),
    tags="ner_tags",
)
rg.log(dataset_rg, name="opus_books", workspace="admin")

trainer = ArgillaTrainer(
    name="opus_books",
    workspace="admin",
    framework="openAI",
    train_size=0.8
)
trainer.update_config(max_epochs=2)
trainer.train(output_dir="my_openAI_model")
records = trainer.predict("The ArgillaTrainer is great!", as_argilla_records=True)
rg.log(records=records, name="opus_books", workspace="admin")

Other options#

Prepare for training#

If you want to train a model we provide a handy method to prepare your dataset: DatasetFor*.prepare_for_training(). It will return a Hugging Face dataset, a spaCy DocBin or a SparkNLP-formatted DataFrame, optimized for the training process with the Hugging Face Trainer, the spaCy CLI or the SparkNLP API.

It is possible to directly include train-test splits to the prepare_for_training by passing the train_size and test_size parameters.

OpenAI

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="openai", train_size=1)
# [{'promt': 'My title', 'completion': ' My content'}]

AutoTrain

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="autotrain", train_size=1)
# {'title': 'My title', 'content': 'My content', 'label': 0}

SetFit

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="setfit", train_size=1)
# {'title': 'My title', 'content': 'My content', 'label': 0}

spaCy

import argilla as rg
import spacy

nlp = spacy.blank("en")

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="spacy", lang=nlp, train_size=1)
# <spacy.tokens._serialize.DocBin object at 0x280613af0>

Transformers

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="transformers", train_size=1)
# {'title': 'My title', 'content': 'My content', 'label': 0}

Peft (LoRA)

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="peft", train_size=1)
# {'title': 'My title', 'content': 'My content', 'label': 0}

SpanMarker

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="span_marker", train_size=1)
# {'title': 'My title', 'content': 'My content', 'label': 0}

Spark NLP

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="spark-nlp", train_size=1)
# <pd.DataFrame>

TRL

import argilla as rg

dataset_rg = rg.load("<my_dataset>")
dataset_rg.prepare_for_training(framework="trl", task=..., train_size=1)

CLI support#

We also have CLI support for the ArgillaTrainer. This can be used when, for example, executing training on an external machine. Not that the –update-config-kwargs always uses the update_config() method for the corresponding class. Hence, you should take this into account to configure training via the CLI command by passing a JSON-serializable string.

Usage: python -m argilla train [OPTIONS] COMMAND [ARGS]...

Starts the ArgillaTrainer.

Options:
--name                        TEXT                                                      The name of the dataset to be used for training. [default: None]
--framework                   [transformers|peft|setfit|spacy|                          The framework to be used for training. [default: None]
                            spacy-transformers|span_marker|spark-nlp|
                            openai|trl|trlx|sentence-transformers]
--workspace                   TEXT                                                      The workspace to be used for training. [default: None]
--limit                       INTEGER                                                   The number of record to be used. [default: None]
--query                       TEXT                                                      The query to be used. [default: None]
--model                       TEXT                                                      The modelname or path to be used for training. [default: None]
--train-size                  FLOAT                                                     The train split to be used. [default: 1.0]
--seed                        INTEGER                                                   The random seed number. [default: 42]
--device                      INTEGER                                                   The GPU id to be used for training. [default: -1]
--output-dir                  TEXT                                                      Output directory for the saved model. [default: model]
--update-config-kwargs        TEXT                                                      update_config() kwargs to be passed as a dictionary. [default: {}]
--help                                                                                  Show this message and exit.

🦾 Fine-tune LLMs and other language models#

Feedback Dataset#

The ArgillaTrainer#

Supported Frameworks#

Training Configs#

The TrainingTask#

Filter and Sort datasets for training#

Huggingface hub Integration#

Model card generation#

Upload your models to Huggingface Hub#

Tasks#

Text Classification#

Background#

Training#

Question Answering#

Background#

Training#

Sentence Similarity#

Background#

Training#

Supervised finetuning#

Background#

Training#

Reward Modeling#

Background#

Training#

Proximal Policy Optimization#

Background#

Training#

Direct Preference Optimization#

Background#

Training#

Chat Completion#

Background#

Training#

Other datasets#

The ArgillaTrainer#

Supported frameworks#

Training configs#

Tasks#

Text Classification#

Background#

Training#

Token Classification#

Background#

Training#

Text2Text#

Background#

Training#

Other options#

Prepare for training#

CLI support#

The `ArgillaTrainer`#

The `TrainingTask`#

The `ArgillaTrainer`#