{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Fine-tuning a model with the 🤗 TLC Trainer API\n", "\n", "This notebook demonstrates how to use our hugging face TLC Trainer API and finetuning a model called bert-base-uncased\n", "\n", "![](../images/huggingface-mrcp.png)\n", "\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": { "tags": [ "parameters" ] }, "outputs": [], "source": [ "PROJECT_NAME = \"3LC Tutorials - Hugging Face BERT\"\n", "RUN_NAME = \"finetuning-run\"\n", "DESCRIPTION = \"Fine-tune BERT on MRPC\"\n", "TRAIN_DATASET_NAME = \"hugging-face-train\"\n", "VAL_DATASET_NAME = \"hugging-face-val\"\n", "CHECKPOINT = \"bert-base-uncased\"\n", "DEVICE = None\n", "TRAIN_BATCH_SIZE = 64\n", "EVAL_BATCH_SIZE = 256\n", "EPOCHS = 4\n", "NUM_WORKERS = 0\n", "OPTIMIZER = \"adamw_torch\"\n", "DOWNLOAD_PATH = \"../../transient_data\"\n", "INSTALL_DEPENDENCIES = True" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "if INSTALL_DEPENDENCIES:\n", " %pip install accelerate\n", " %pip install scikit-learn\n", " %pip install 3lc[huggingface] \"transformers<=4.56.0\"" ] }, { "cell_type": "code", "execution_count": null, "id": "3", "metadata": {}, "outputs": [], "source": [ "import os\n", "\n", "import datasets\n", "import evaluate\n", "import numpy as np\n", "import tlc\n", "import torch\n", "from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments\n", "\n", "os.environ[\"TRANSFORMERS_NO_ADVISORY_WARNINGS\"] = \"true\" # Removing BertTokenizerFast tokenizer warning\n", "\n", "datasets.utils.logging.disable_progress_bar()" ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "if DEVICE is None:\n", " if torch.cuda.is_available():\n", " DEVICE = \"cuda\"\n", " elif torch.backends.mps.is_available():\n", " DEVICE = \"mps\"\n", " else:\n", " DEVICE = \"cpu\"" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Initialize a 3LC Run\n", "\n", "We initialize a Run with a call to `tlc.init`, and add the configuration to the Run object." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "run = tlc.init(\n", " project_name=PROJECT_NAME,\n", " run_name=RUN_NAME,\n", " description=DESCRIPTION,\n", " if_exists=\"overwrite\",\n", ")" ] }, { "cell_type": "markdown", "id": "7", "metadata": {}, "source": [ "With the 3LC integration, you can use `tlc.Table.from_hugging_face()` as a drop-in replacement for\n", "`datasets.load_dataset()` to create a `tlc.Table`. Notice `.latest()`, which gets the latest version of the 3LC dataset." ] }, { "cell_type": "code", "execution_count": null, "id": "8", "metadata": {}, "outputs": [], "source": [ "tlc_train_dataset = tlc.Table.from_hugging_face(\n", " path=\"glue\",\n", " name=\"mrpc\",\n", " split=\"train\",\n", " project_name=PROJECT_NAME,\n", " dataset_name=TRAIN_DATASET_NAME,\n", " if_exists=\"overwrite\",\n", ")\n", "\n", "tlc_val_dataset = tlc.Table.from_hugging_face(\n", " path=\"glue\",\n", " name=\"mrpc\",\n", " split=\"validation\",\n", " project_name=PROJECT_NAME,\n", " dataset_name=VAL_DATASET_NAME,\n", " if_exists=\"overwrite\",\n", ")" ] }, { "cell_type": "markdown", "id": "9", "metadata": {}, "source": [ "`Table` provides a method `map` to apply both preprocessing and on-the-fly transforms to your data before it is sent to the model.\n", "\n", "It is different from huggingface where it generates a new reference of the data directly including the example" ] }, { "cell_type": "code", "execution_count": null, "id": "10", "metadata": {}, "outputs": [], "source": [ "tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)\n", "\n", "\n", "def tokenize_function_tlc(example):\n", " return {**example, **tokenizer(example[\"sentence1\"], example[\"sentence2\"], truncation=True)}\n", "\n", "\n", "tlc_tokenized_dataset_train = tlc_train_dataset.map(tokenize_function_tlc)\n", "tlc_tokenized_dataset_val = tlc_val_dataset.map(tokenize_function_tlc)" ] }, { "cell_type": "code", "execution_count": null, "id": "11", "metadata": {}, "outputs": [], "source": [ "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)" ] }, { "cell_type": "markdown", "id": "12", "metadata": {}, "source": [ "Here we define our model with two labels" ] }, { "cell_type": "code", "execution_count": null, "id": "13", "metadata": {}, "outputs": [], "source": [ "# For demonstration purposes, we use the bert-base-uncased model with a different set of labels than\n", "# it was trained on. As a result, there will be a warning about the inconsistency of the classifier and\n", "# pre_classifier weights. This is expected and can be ignored.\n", "model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)" ] }, { "cell_type": "markdown", "id": "14", "metadata": {}, "source": [ "## Setup Metrics Collection\n", "\n", "Computing metrics is done by implementing a function which returns per-sample metrics you would like to see in the 3LC Dashboard. \n", "\n", "This is different from the original compute_metrics of Huggingface which compute per batch the metrics. Here we want to find results with a granularity of per sample basis." ] }, { "cell_type": "code", "execution_count": null, "id": "15", "metadata": {}, "outputs": [], "source": [ "def compute_tlc_metrics(logits, labels):\n", " probabilities = torch.nn.functional.softmax(logits, dim=-1)\n", "\n", " predictions = logits.argmax(dim=-1)\n", " loss = torch.nn.functional.cross_entropy(logits, labels, reduction=\"none\")\n", " confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()\n", "\n", " return {\n", " \"predicted\": predictions,\n", " \"loss\": loss,\n", " \"confidence\": confidence,\n", " }\n", "\n", "\n", "id2label = {0: \"not_equivalent\", 1: \"equivalent\"}\n", "schemas = {\n", " \"predicted\": tlc.CategoricalLabelSchema(display_name=\"Predicted Label\", classes=id2label, display_importance=4005),\n", " \"loss\": tlc.Schema(display_name=\"Loss\", writable=False, value=tlc.Float32Value()),\n", " \"confidence\": tlc.Schema(display_name=\"Confidence\", writable=False, value=tlc.Float32Value()),\n", "}\n", "compute_tlc_metrics.column_schemas = schemas" ] }, { "cell_type": "code", "execution_count": null, "id": "16", "metadata": {}, "outputs": [], "source": [ "# Add references to the input datasets used by the Run.\n", "run.add_input_table(tlc_train_dataset)\n", "run.add_input_table(tlc_val_dataset)" ] }, { "cell_type": "markdown", "id": "17", "metadata": {}, "source": [ "## Train the model with TLCTrainer\n", "\n", "To perform model training, we replace the usual `Trainer` with `TLCTrainer` and provide the per-sample metrics collection function. \n", "\n", "In this example, we still compute the glue MRPC per batch thanks to the *compute_hf_metrics* method (*compute_metrics* is changed to *compute_hf_metric*s to avoid confusion).\n", "\n", "We also compute our special per sample tlc metrics thanks to the *compute_tlc_metrics* method.\n", "\n", "With this latter, we can choose when to start to collect the metrics, here at epoch 2 (indexed from 0 with *tlc_metrics_collection_start*) with a frequency of 1 epoch (with *tlc_metrics_collection_epoch_frequency*).\n", "\n", "You also can switch the strategy to compute the metrics to \"steps\" in the eval_strategy and specify the frequency with *eval_steps*. At this stage, if you use *tlc_metrics_collection_start*, it should be a multiple of *eval_steps*. Note that *tlc_metrics_collection_epoch_frequency* is disable in this case because we use the original *eval_steps* variable.\n", "\n", "We also specify that we would like to collect metrics prior to training with *compute_tlc_metrics_on_train_begin*." ] }, { "cell_type": "code", "execution_count": null, "id": "18", "metadata": {}, "outputs": [], "source": [ "from tlc.integration.hugging_face import TLCTrainer\n", "\n", "\n", "def compute_metrics(eval_preds):\n", " metric = evaluate.load(\"glue\", \"mrpc\")\n", " logits, labels = eval_preds\n", " predictions = np.argmax(logits, axis=-1)\n", " return metric.compute(predictions=predictions, references=labels)\n", "\n", "\n", "training_args = TrainingArguments(\n", " output_dir=DOWNLOAD_PATH,\n", " per_device_train_batch_size=TRAIN_BATCH_SIZE,\n", " per_device_eval_batch_size=EVAL_BATCH_SIZE,\n", " optim=OPTIMIZER,\n", " num_train_epochs=EPOCHS,\n", " report_to=\"none\", # Disable wandb logging\n", " use_cpu=DEVICE == \"cpu\",\n", " eval_strategy=\"epoch\",\n", " disable_tqdm=True,\n", " dataloader_num_workers=NUM_WORKERS,\n", " # eval_strategy=\"steps\", # For running metrics on steps\n", " # eval_steps=20, # For running metrics on steps\n", ")\n", "\n", "trainer = TLCTrainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tlc_tokenized_dataset_train,\n", " eval_dataset=tlc_tokenized_dataset_val,\n", " tokenizer=tokenizer,\n", " data_collator=data_collator,\n", " compute_hf_metrics=compute_metrics,\n", " compute_tlc_metrics=compute_tlc_metrics,\n", " compute_tlc_metrics_on_train_begin=True,\n", " compute_tlc_metrics_on_train_end=False,\n", " tlc_metrics_collection_start=2,\n", " tlc_metrics_collection_epoch_frequency=1,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "id": "19", "metadata": {}, "outputs": [], "source": [ "trainer.train()" ] }, { "cell_type": "code", "execution_count": null, "id": "20", "metadata": {}, "outputs": [], "source": [ "run.set_status_completed()" ] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" } }, "nbformat": 4, "nbformat_minor": 5 }