{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0",
   "metadata": {},
   "source": [
    "# Fine-tuning a model with the 🤗 TLC Trainer API\n",
    "\n",
    "This notebook demonstrates how to use our hugging face TLC Trainer API and finetuning a model called bert-base-uncased\n",
    "\n",
    "![](../images/huggingface-mrcp.png)\n",
    "\n",
    "<!-- Tags: [\"text\", \"hugging-face\", \"training\", \"metrics\", \"bert\"] -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1",
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "PROJECT_NAME = \"3LC Tutorials - Hugging Face BERT\"\n",
    "RUN_NAME = \"finetuning-run\"\n",
    "DESCRIPTION = \"Fine-tune BERT on MRPC\"\n",
    "TRAIN_DATASET_NAME = \"hugging-face-train\"\n",
    "VAL_DATASET_NAME = \"hugging-face-val\"\n",
    "CHECKPOINT = \"bert-base-uncased\"\n",
    "DEVICE = None\n",
    "TRAIN_BATCH_SIZE = 64\n",
    "EVAL_BATCH_SIZE = 256\n",
    "EPOCHS = 4\n",
    "NUM_WORKERS = 0\n",
    "OPTIMIZER = \"adamw_torch\"\n",
    "TMP_PATH = \"../../transient_data\"\n",
    "INSTALL_DEPENDENCIES = True"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2",
   "metadata": {},
   "outputs": [],
   "source": [
    "if INSTALL_DEPENDENCIES:\n",
    "    %pip install -q accelerate\n",
    "    %pip install -q scikit-learn\n",
    "    %pip install -q 3lc[huggingface] \"transformers<=4.56.0\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "import datasets\n",
    "import evaluate\n",
    "import numpy as np\n",
    "import tlc\n",
    "import torch\n",
    "from transformers import AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding, TrainingArguments\n",
    "\n",
    "os.environ[\"TRANSFORMERS_NO_ADVISORY_WARNINGS\"] = \"true\"  # Removing BertTokenizerFast tokenizer warning\n",
    "\n",
    "datasets.utils.logging.disable_progress_bar()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4",
   "metadata": {},
   "outputs": [],
   "source": [
    "if DEVICE is None:\n",
    "    if torch.cuda.is_available():\n",
    "        DEVICE = \"cuda\"\n",
    "    elif torch.backends.mps.is_available():\n",
    "        DEVICE = \"mps\"\n",
    "    else:\n",
    "        DEVICE = \"cpu\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5",
   "metadata": {},
   "source": [
    "## Initialize a 3LC Run\n",
    "\n",
    "We initialize a Run with a call to `tlc.init`, and add the configuration to the Run object."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6",
   "metadata": {},
   "outputs": [],
   "source": [
    "run = tlc.init(\n",
    "    project_name=PROJECT_NAME,\n",
    "    run_name=RUN_NAME,\n",
    "    description=DESCRIPTION,\n",
    "    if_exists=\"overwrite\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7",
   "metadata": {},
   "source": [
    "With the 3LC integration, you can use `tlc.Table.from_hugging_face()` as a drop-in replacement for\n",
    "`datasets.load_dataset()` to create a `tlc.Table`. Notice `.latest()`, which gets the latest version of the 3LC dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8",
   "metadata": {},
   "outputs": [],
   "source": [
    "tlc_train_dataset = tlc.Table.from_hugging_face(\n",
    "    path=\"glue\",\n",
    "    name=\"mrpc\",\n",
    "    split=\"train\",\n",
    "    project_name=PROJECT_NAME,\n",
    "    dataset_name=TRAIN_DATASET_NAME,\n",
    "    if_exists=\"overwrite\",\n",
    ")\n",
    "\n",
    "tlc_val_dataset = tlc.Table.from_hugging_face(\n",
    "    path=\"glue\",\n",
    "    name=\"mrpc\",\n",
    "    split=\"validation\",\n",
    "    project_name=PROJECT_NAME,\n",
    "    dataset_name=VAL_DATASET_NAME,\n",
    "    if_exists=\"overwrite\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9",
   "metadata": {},
   "source": [
    "`Table` provides a method `map` to apply both preprocessing and on-the-fly transforms to your data before it is sent to the model.\n",
    "\n",
    "It is different from huggingface where it generates a new reference of the data directly including the example"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "10",
   "metadata": {},
   "outputs": [],
   "source": [
    "tokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)\n",
    "\n",
    "\n",
    "def tokenize_function_tlc(example):\n",
    "    return {**example, **tokenizer(example[\"sentence1\"], example[\"sentence2\"], truncation=True)}\n",
    "\n",
    "\n",
    "tlc_tokenized_dataset_train = tlc_train_dataset.map(tokenize_function_tlc)\n",
    "tlc_tokenized_dataset_val = tlc_val_dataset.map(tokenize_function_tlc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "11",
   "metadata": {},
   "outputs": [],
   "source": [
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "12",
   "metadata": {},
   "source": [
    "Here we define our model with two labels"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "13",
   "metadata": {},
   "outputs": [],
   "source": [
    "# For demonstration purposes, we use the bert-base-uncased model with a different set of labels than\n",
    "# it was trained on. As a result, there will be a warning about the inconsistency of the classifier and\n",
    "# pre_classifier weights. This is expected and can be ignored.\n",
    "model = AutoModelForSequenceClassification.from_pretrained(CHECKPOINT, num_labels=2)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14",
   "metadata": {},
   "source": [
    "## Setup Metrics Collection\n",
    "\n",
    "Computing metrics is done by implementing a function which returns per-sample metrics you would like to see in the 3LC Dashboard. \n",
    "\n",
    "This is different from the original compute_metrics of Huggingface which compute per batch the metrics. Here we want to find results with a granularity of per sample basis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "15",
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_tlc_metrics(logits, labels):\n",
    "    probabilities = torch.nn.functional.softmax(logits, dim=-1)\n",
    "\n",
    "    predictions = logits.argmax(dim=-1)\n",
    "    loss = torch.nn.functional.cross_entropy(logits, labels, reduction=\"none\")\n",
    "    confidence = probabilities.gather(dim=-1, index=predictions.unsqueeze(-1)).squeeze()\n",
    "\n",
    "    return {\n",
    "        \"predicted\": predictions,\n",
    "        \"loss\": loss,\n",
    "        \"confidence\": confidence,\n",
    "    }\n",
    "\n",
    "\n",
    "id2label = {0: \"not_equivalent\", 1: \"equivalent\"}\n",
    "schemas = {\n",
    "    \"predicted\": tlc.CategoricalLabelSchema(display_name=\"Predicted Label\", classes=id2label, display_importance=4005),\n",
    "    \"loss\": tlc.Schema(display_name=\"Loss\", writable=False, value=tlc.Float32Value()),\n",
    "    \"confidence\": tlc.Schema(display_name=\"Confidence\", writable=False, value=tlc.Float32Value()),\n",
    "}\n",
    "compute_tlc_metrics.column_schemas = schemas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "16",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Add references to the input datasets used by the Run.\n",
    "run.add_input_table(tlc_train_dataset)\n",
    "run.add_input_table(tlc_val_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17",
   "metadata": {},
   "source": [
    "## Train the model with TLCTrainer\n",
    "\n",
    "To perform model training, we replace the usual `Trainer` with `TLCTrainer` and provide the per-sample metrics collection function. \n",
    "\n",
    "In this example, we still compute the glue MRPC per batch thanks to the *compute_hf_metrics* method (*compute_metrics* is changed to *compute_hf_metric*s to avoid confusion).\n",
    "\n",
    "We also compute our special per sample tlc metrics thanks to the *compute_tlc_metrics* method.\n",
    "\n",
    "With this latter, we can choose when to start to collect the metrics, here at epoch 2 (indexed from 0 with *tlc_metrics_collection_start*) with a frequency of 1 epoch (with *tlc_metrics_collection_epoch_frequency*).\n",
    "\n",
    "You also can switch the strategy to compute the metrics to \"steps\" in the eval_strategy and specify the frequency with *eval_steps*. At this stage, if you use *tlc_metrics_collection_start*, it should be a multiple of *eval_steps*. Note that *tlc_metrics_collection_epoch_frequency* is disable in this case because we use the original *eval_steps* variable.\n",
    "\n",
    "We also specify that we would like to collect metrics prior to training with *compute_tlc_metrics_on_train_begin*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18",
   "metadata": {},
   "outputs": [],
   "source": [
    "from tlc.integration.hugging_face import TLCTrainer\n",
    "\n",
    "\n",
    "def compute_metrics(eval_preds):\n",
    "    metric = evaluate.load(\"glue\", \"mrpc\")\n",
    "    logits, labels = eval_preds\n",
    "    predictions = np.argmax(logits, axis=-1)\n",
    "    return metric.compute(predictions=predictions, references=labels)\n",
    "\n",
    "\n",
    "training_args = TrainingArguments(\n",
    "    output_dir=TMP_PATH,\n",
    "    per_device_train_batch_size=TRAIN_BATCH_SIZE,\n",
    "    per_device_eval_batch_size=EVAL_BATCH_SIZE,\n",
    "    optim=OPTIMIZER,\n",
    "    num_train_epochs=EPOCHS,\n",
    "    report_to=\"none\",  # Disable wandb logging\n",
    "    use_cpu=DEVICE == \"cpu\",\n",
    "    eval_strategy=\"epoch\",\n",
    "    disable_tqdm=True,\n",
    "    dataloader_num_workers=NUM_WORKERS,\n",
    "    # eval_strategy=\"steps\",  # For running metrics on steps\n",
    "    # eval_steps=20,  # For running metrics on steps\n",
    ")\n",
    "\n",
    "trainer = TLCTrainer(\n",
    "    model=model,\n",
    "    args=training_args,\n",
    "    train_dataset=tlc_tokenized_dataset_train,\n",
    "    eval_dataset=tlc_tokenized_dataset_val,\n",
    "    tokenizer=tokenizer,\n",
    "    data_collator=data_collator,\n",
    "    compute_hf_metrics=compute_metrics,\n",
    "    compute_tlc_metrics=compute_tlc_metrics,\n",
    "    compute_tlc_metrics_on_train_begin=True,\n",
    "    compute_tlc_metrics_on_train_end=False,\n",
    "    tlc_metrics_collection_start=2,\n",
    "    tlc_metrics_collection_epoch_frequency=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "19",
   "metadata": {},
   "outputs": [],
   "source": [
    "trainer.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20",
   "metadata": {},
   "outputs": [],
   "source": [
    "run.set_status_completed()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}