{"metadata":{"kernelspec":{"language":"python","display_name":"Python 3","name":"python3"},"language_info":{"pygments_lexer":"ipython3","nbconvert_exporter":"python","version":"3.6.4","file_extension":".py","codemirror_mode":{"name":"ipython","version":3},"name":"python","mimetype":"text/x-python"}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Fine-tuned RoBERTa for multi-label text classification in Spanish (CPV codes)\n\nThis notebook follows the steps in the tutorial by Niels Rogge https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb, originally written for BERT and adapted for the CPV multilabel classification problem in the Spanish language (using RoBERTa trained in the BNE corpus) by María Navas-Loro.\n\n\n## Set-up environment\n\nFirst, we install the libraries which we'll use: HuggingFace Transformers and Datasets.","metadata":{"id":"kLB3I4FKZ5Lr"}},{"cell_type":"code","source":"!pip install -q transformers datasets\n!pip install -y ipywidgets\n!pip uninstall -y jupyterlab_widgets\n!pip install -y jupyterlab_widgets\n!pip install huggingface_hub","metadata":{"id":"4wxY3x-ZZz8h","outputId":"d40e1cee-354e-4c9c-99e1-0dfbdf8da690","execution":{"iopub.status.busy":"2022-04-28T10:36:31.223722Z","iopub.execute_input":"2022-04-28T10:36:31.224229Z","iopub.status.idle":"2022-04-28T10:36:51.653532Z","shell.execute_reply.started":"2022-04-28T10:36:31.22414Z","shell.execute_reply":"2022-04-28T10:36:51.652701Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"!sudo apt-get install git-lfs","metadata":{"execution":{"iopub.status.busy":"2022-04-28T10:36:51.656075Z","iopub.execute_input":"2022-04-28T10:36:51.656335Z","iopub.status.idle":"2022-04-28T10:36:57.82088Z","shell.execute_reply.started":"2022-04-28T10:36:51.656302Z","shell.execute_reply":"2022-04-28T10:36:57.81974Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from huggingface_hub import notebook_login\nnotebook_login()","metadata":{"execution":{"iopub.status.busy":"2022-04-28T10:36:57.825153Z","iopub.execute_input":"2022-04-28T10:36:57.825391Z","iopub.status.idle":"2022-04-28T10:36:59.52126Z","shell.execute_reply.started":"2022-04-28T10:36:57.825352Z","shell.execute_reply":"2022-04-28T10:36:59.520328Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Load dataset\n\nNext, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).\n\nAt the time of writing, I picked a random one as follows: \n\n* first, go to the \"datasets\" tab on huggingface.co\n* next, select the \"multi-label-classification\" tag on the left as well as the the \"1k<10k\" tag (fo find a relatively small dataset).\n\nNote that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).\n\n","metadata":{"id":"bIH9NP0MZ6-O"}},{"cell_type":"markdown","source":"# Data preparation\n\nWe prepare our datasets, splitting the CPV column in 45 different columns with binary values.","metadata":{"id":"_V7ars6WxrbQ"}},{"cell_type":"code","source":"import pandas as pd","metadata":{"id":"LcLL3LnWEDIC","execution":{"iopub.status.busy":"2022-04-28T10:36:59.523763Z","iopub.execute_input":"2022-04-28T10:36:59.524111Z","iopub.status.idle":"2022-04-28T10:36:59.529292Z","shell.execute_reply.started":"2022-04-28T10:36:59.524072Z","shell.execute_reply":"2022-04-28T10:36:59.528527Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# We read the input data\ndf = pd.read_csv('../input/dataset/train.csv')\ndftest = pd.read_csv('../input/dataset/test.csv')\n\n#df = pd.read_csv('../input/dataset10/train10.csv')\n#dftest = pd.read_csv('../input/dataset10/test10.csv')\n\n#dftest['descripcion'] = dftest['descripcion'].apply(lambda x: x.strip('\"'))\n#df['descripcion'] = df['descripcion'].apply(lambda x: x.strip('\"'))\n\ndf.pop('Unnamed: 0')\ndftest.pop('Unnamed: 0')\ndf.head()\n","metadata":{"id":"4WYl2oz8zYEM","outputId":"d63feee1-971c-40ef-c4b0-e7b41960eb4c","execution":{"iopub.status.busy":"2022-04-28T10:36:59.530604Z","iopub.execute_input":"2022-04-28T10:36:59.530993Z","iopub.status.idle":"2022-04-28T10:37:00.713662Z","shell.execute_reply.started":"2022-04-28T10:36:59.530955Z","shell.execute_reply":"2022-04-28T10:37:00.712963Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"labels = df.columns[1:]\n\nid2label = {idx:label for idx, label in enumerate(labels)}\nlabel2id = {label:idx for idx, label in enumerate(labels)}","metadata":{"execution":{"iopub.status.busy":"2022-04-28T10:37:00.717272Z","iopub.execute_input":"2022-04-28T10:37:00.719187Z","iopub.status.idle":"2022-04-28T10:37:00.725568Z","shell.execute_reply.started":"2022-04-28T10:37:00.719148Z","shell.execute_reply":"2022-04-28T10:37:00.724962Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Preprocess data\n\nNow we have the data ready, we have to tokenize it so the model can read it.","metadata":{"id":"nJ3Teyjmank2"}},{"cell_type":"code","source":"from transformers import AutoTokenizer\nimport numpy as np\n\ntokenizer = AutoTokenizer.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\")\n\nimport pyarrow as pa\nimport pyarrow.dataset as ds\nimport pandas as pd\nfrom datasets import Dataset\n\ndatasettrain = Dataset(pa.Table.from_pandas(df))\n\ndatasettest = Dataset(pa.Table.from_pandas(dftest))\n\ndef preprocess_data(df):\n # take a batch of texts\n text = df[\"descripcion\"]\n # encode them\n encoding = tokenizer(text, padding=\"max_length\", truncation=True, max_length=128)\n # add labels\n labels_batch = {k: df[k] for k in labels}\n # create numpy array of shape (batch_size, num_labels)\n labels_matrix = np.zeros((len(text), len(labels)))\n # fill numpy array\n for idx, label in enumerate(labels):\n labels_matrix[:, idx] = labels_batch[label]\n\n encoding[\"labels\"] = labels_matrix.tolist()\n \n return encoding","metadata":{"id":"AFWlSsbZaRLc","outputId":"6302326f-3af0-4964-dbb6-1bd5e9aa5c5c","execution":{"iopub.status.busy":"2022-04-28T10:37:00.729782Z","iopub.execute_input":"2022-04-28T10:37:00.731996Z","iopub.status.idle":"2022-04-28T10:37:08.69608Z","shell.execute_reply.started":"2022-04-28T10:37:00.73196Z","shell.execute_reply":"2022-04-28T10:37:08.695285Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"encoded_dataset = datasettrain.map(preprocess_data, batched=True, remove_columns=datasettrain.column_names)","metadata":{"id":"i4ENBTdulBEI","outputId":"e13f9211-194d-4e2e-fa7b-07a1c727015d","execution":{"iopub.status.busy":"2022-04-28T10:37:08.700518Z","iopub.execute_input":"2022-04-28T10:37:08.702736Z","iopub.status.idle":"2022-04-28T10:37:21.379465Z","shell.execute_reply.started":"2022-04-28T10:37:08.702696Z","shell.execute_reply":"2022-04-28T10:37:21.378769Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"encoded_dataset_test = datasettest.map(preprocess_data, batched=True, remove_columns=datasettest.column_names)","metadata":{"id":"cb0j1aHS6JdC","outputId":"ae1ab3cf-2f9b-4bce-9168-71629c009ddd","execution":{"iopub.status.busy":"2022-04-28T10:37:21.380665Z","iopub.execute_input":"2022-04-28T10:37:21.380998Z","iopub.status.idle":"2022-04-28T10:37:26.844134Z","shell.execute_reply.started":"2022-04-28T10:37:21.380953Z","shell.execute_reply":"2022-04-28T10:37:26.843442Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# To check an example\n#example = encoded_dataset[0]\n#print(example.keys())","metadata":{"id":"0enAb0W9o25W","execution":{"iopub.status.busy":"2022-04-28T10:37:26.847069Z","iopub.execute_input":"2022-04-28T10:37:26.847781Z","iopub.status.idle":"2022-04-28T10:37:26.851113Z","shell.execute_reply.started":"2022-04-28T10:37:26.84774Z","shell.execute_reply":"2022-04-28T10:37:26.850196Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#tokenizer.decode(example['input_ids'])","metadata":{"id":"D0McCtJ8HRJY","execution":{"iopub.status.busy":"2022-04-28T10:37:26.852283Z","iopub.execute_input":"2022-04-28T10:37:26.853028Z","iopub.status.idle":"2022-04-28T10:37:26.863288Z","shell.execute_reply.started":"2022-04-28T10:37:26.852991Z","shell.execute_reply":"2022-04-28T10:37:26.862626Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# Labels of this example\n#example['labels']","metadata":{"id":"VdIvj6WjHeZQ","execution":{"iopub.status.busy":"2022-04-28T10:37:26.864783Z","iopub.execute_input":"2022-04-28T10:37:26.865309Z","iopub.status.idle":"2022-04-28T10:37:26.872081Z","shell.execute_reply.started":"2022-04-28T10:37:26.865273Z","shell.execute_reply":"2022-04-28T10:37:26.871368Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"# To translate this output to the actual positive labels\n# [id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]","metadata":{"id":"q4Dx95t2o6N9","execution":{"iopub.status.busy":"2022-04-28T10:37:26.87355Z","iopub.execute_input":"2022-04-28T10:37:26.874149Z","iopub.status.idle":"2022-04-28T10:37:26.879485Z","shell.execute_reply.started":"2022-04-28T10:37:26.874087Z","shell.execute_reply":"2022-04-28T10:37:26.878765Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"encoded_dataset.set_format(\"torch\")","metadata":{"id":"Lk6Cq9duKBkA","execution":{"iopub.status.busy":"2022-04-28T10:37:26.880471Z","iopub.execute_input":"2022-04-28T10:37:26.880715Z","iopub.status.idle":"2022-04-28T10:37:26.888442Z","shell.execute_reply.started":"2022-04-28T10:37:26.880682Z","shell.execute_reply":"2022-04-28T10:37:26.887727Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Define model\n\nWe will now finetune the RoBERTa model for Spanish trained on the BNE corpus, adding a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.\n","metadata":{"id":"w5qSmCgWefWs"}},{"cell_type":"code","source":"from transformers import AutoModelForSequenceClassification\n\nmodel = AutoModelForSequenceClassification.from_pretrained(\"PlanTL-GOB-ES/roberta-base-bne\", \n problem_type=\"multi_label_classification\", \n num_labels=len(labels),\n id2label=id2label,\n label2id=label2id)","metadata":{"id":"6XPL1Z_RegBF","outputId":"863b07ca-4320-47cc-840c-01b2302fdd62","execution":{"iopub.status.busy":"2022-04-28T10:37:26.889851Z","iopub.execute_input":"2022-04-28T10:37:26.890342Z","iopub.status.idle":"2022-04-28T10:37:41.196814Z","shell.execute_reply.started":"2022-04-28T10:37:26.890307Z","shell.execute_reply":"2022-04-28T10:37:41.196086Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Train the model!\n\nWe are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: \n\n* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.\n* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).","metadata":{"id":"mjJGEXShp7te"}},{"cell_type":"code","source":"batch_size = 8\nmetric_name = \"f1\"","metadata":{"id":"K5a8_vIKqr7P","execution":{"iopub.status.busy":"2022-04-28T10:37:41.198287Z","iopub.execute_input":"2022-04-28T10:37:41.198586Z","iopub.status.idle":"2022-04-28T10:37:41.204182Z","shell.execute_reply.started":"2022-04-28T10:37:41.198549Z","shell.execute_reply":"2022-04-28T10:37:41.203494Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"from transformers import TrainingArguments, Trainer\n\nargs = TrainingArguments(\n f\"roberta-finetuned-CPV_Spanish\",\n evaluation_strategy = \"epoch\",\n save_strategy = \"epoch\",\n learning_rate=2e-5,\n per_device_train_batch_size=batch_size,\n per_device_eval_batch_size=batch_size,\n num_train_epochs=10,\n weight_decay=0.01,\n load_best_model_at_end=True,\n metric_for_best_model=metric_name,\n push_to_hub=True,\n)","metadata":{"id":"dR2GmpvDqbuZ","outputId":"764d0131-f42b-476b-c403-1b40a2a57751","execution":{"iopub.status.busy":"2022-04-28T10:37:41.205234Z","iopub.execute_input":"2022-04-28T10:37:41.205971Z","iopub.status.idle":"2022-04-28T10:37:46.094857Z","shell.execute_reply.started":"2022-04-28T10:37:41.205923Z","shell.execute_reply":"2022-04-28T10:37:46.094091Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.","metadata":{"id":"1_v2fPFFJ3-v"}},{"cell_type":"code","source":"from sklearn.metrics import f1_score, roc_auc_score, accuracy_score\nfrom sklearn.metrics import coverage_error\nfrom sklearn.metrics import label_ranking_average_precision_score\nfrom transformers import EvalPrediction\nimport torch\n \n# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/\ndef multi_label_metrics(predictions, labels, threshold=0.5):\n # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)\n sigmoid = torch.nn.Sigmoid()\n probs = sigmoid(torch.Tensor(predictions))\n # next, use threshold to turn them into integer predictions\n y_pred = np.zeros(probs.shape)\n y_pred[np.where(probs >= threshold)] = 1\n # finally, compute metrics\n y_true = labels\n f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')\n roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')\n accuracy = accuracy_score(y_true, y_pred)\n coverage_err = coverage_error(y_true, y_pred)\n label_ranking_average_precision = label_ranking_average_precision_score(y_true, y_pred)\n # return as dictionary\n metrics = {'f1': f1_micro_average,\n 'roc_auc': roc_auc,\n 'accuracy': accuracy,\n 'coverage_error': coverage_err,\n 'label_ranking_average_precision_score': label_ranking_average_precision}\n return metrics\n\ndef compute_metrics(p: EvalPrediction):\n preds = p.predictions[0] if isinstance(p.predictions, \n tuple) else p.predictions\n result = multi_label_metrics(\n predictions=preds, \n labels=p.label_ids)\n return result","metadata":{"id":"797b2WHJqUgZ","execution":{"iopub.status.busy":"2022-04-28T10:37:46.098061Z","iopub.execute_input":"2022-04-28T10:37:46.098269Z","iopub.status.idle":"2022-04-28T10:37:46.106965Z","shell.execute_reply.started":"2022-04-28T10:37:46.098243Z","shell.execute_reply":"2022-04-28T10:37:46.106312Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"Here is where the real training begins.","metadata":{"id":"f-X2brZcv0X6"}},{"cell_type":"code","source":"trainer = Trainer(\n model,\n args,\n train_dataset=encoded_dataset,\n eval_dataset=encoded_dataset_test,\n tokenizer=tokenizer,\n compute_metrics=compute_metrics\n)","metadata":{"id":"chq_3nUz73ib","execution":{"iopub.status.busy":"2022-04-28T10:38:11.868918Z","iopub.execute_input":"2022-04-28T10:38:11.869229Z","iopub.status.idle":"2022-04-28T10:40:07.861855Z","shell.execute_reply.started":"2022-04-28T10:38:11.869191Z","shell.execute_reply":"2022-04-28T10:40:07.860924Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"import os\nimport wandb\n\nwandb.init()\n#os.environ[\"WANDB_DISABLED\"] = \"true\"\n\ntrainer.train()","metadata":{"id":"KXmFds8js6P8","outputId":"9ebfead3-8d8e-4e68-cd54-84b9e41cc0e9","execution":{"iopub.status.busy":"2022-04-28T10:40:07.864108Z","iopub.execute_input":"2022-04-28T10:40:07.864337Z","iopub.status.idle":"2022-04-28T14:38:59.795736Z","shell.execute_reply.started":"2022-04-28T10:40:07.864311Z","shell.execute_reply":"2022-04-28T14:38:59.794768Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Evaluate\n\nAfter training, we evaluate our model on the validation set.","metadata":{"id":"hiloh9eMK91o"}},{"cell_type":"code","source":"trainer.evaluate()","metadata":{"id":"cMlebJ83LRYG","execution":{"iopub.status.busy":"2022-04-28T14:39:23.649963Z","iopub.execute_input":"2022-04-28T14:39:23.650222Z","iopub.status.idle":"2022-04-28T14:41:57.787252Z","shell.execute_reply.started":"2022-04-28T14:39:23.650194Z","shell.execute_reply":"2022-04-28T14:41:57.784216Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"## Inference\n\nLet's test the model on a new sentence:","metadata":{"id":"3nmvJp0pLq-3"}},{"cell_type":"code","source":"text = \"Ejecución de obras de almacén en el CEIP San Miguel Arcángel\"\n\nencoding = tokenizer(text, return_tensors=\"pt\")\nencoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}\n\noutputs = trainer.model(**encoding)","metadata":{"id":"3fxjfr8PLD42","execution":{"iopub.status.busy":"2022-04-28T10:37:51.957321Z","iopub.status.idle":"2022-04-28T10:37:51.95774Z","shell.execute_reply.started":"2022-04-28T10:37:51.957526Z","shell.execute_reply":"2022-04-28T10:37:51.957548Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"The logits that come out of the model are of shape (batch_size, num_labels). As we are only forwarding a single sentence through the model, the `batch_size` equals 1. The logits is a tensor that contains the (unnormalized) scores for every individual label.","metadata":{"id":"8THm5-XgNHPm"}},{"cell_type":"code","source":"logits = outputs.logits\nlogits.shape","metadata":{"id":"KOBosj4UL2tU","execution":{"iopub.status.busy":"2022-04-28T10:37:51.959026Z","iopub.status.idle":"2022-04-28T10:37:51.95975Z","shell.execute_reply.started":"2022-04-28T10:37:51.959504Z","shell.execute_reply":"2022-04-28T10:37:51.959534Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"To turn them into actual predicted labels, we first apply a sigmoid function independently to every score, such that every score is turned into a number between 0 and 1, that can be interpreted as a \"probability\" for how certain the model is that a given class belongs to the input text.\n\nNext, we use a threshold (typically, 0.5) to turn every probability into either a 1 (which means, we predict the label for the given example) or a 0 (which means, we don't predict the label for the given example).","metadata":{"id":"DC4XdDaHNVcd"}},{"cell_type":"code","source":"# apply sigmoid + threshold\nsigmoid = torch.nn.Sigmoid()\nprobs = sigmoid(logits.squeeze().cpu())\npredictions = np.zeros(probs.shape)\npredictions[np.where(probs >= 0.5)] = 1\n# turn predicted id's into actual label names\npredicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]\nprint(predicted_labels)","metadata":{"id":"mEkAQleMMT0k","execution":{"iopub.status.busy":"2022-04-28T10:37:51.96116Z","iopub.status.idle":"2022-04-28T10:37:51.961577Z","shell.execute_reply.started":"2022-04-28T10:37:51.961341Z","shell.execute_reply":"2022-04-28T10:37:51.961363Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"#trainer.save_model(\"./trainer\")","metadata":{"id":"I8PC8Gnwoug1","execution":{"iopub.status.busy":"2022-04-28T10:37:51.962787Z","iopub.status.idle":"2022-04-28T10:37:51.963423Z","shell.execute_reply.started":"2022-04-28T10:37:51.963172Z","shell.execute_reply":"2022-04-28T10:37:51.963196Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":" Download model
\n Download config
\n Download tokenizer_config
\n Download special_tokens_map \n\n\n Download model 10
\n Download config 10
\n Download tokenizer_config 10
\n Download special_tokens_map 10","metadata":{}},{"cell_type":"code","source":"trainer.push_to_hub()","metadata":{"execution":{"iopub.status.busy":"2022-04-28T14:43:06.262365Z","iopub.execute_input":"2022-04-28T14:43:06.262668Z","iopub.status.idle":"2022-04-28T14:43:11.853451Z","shell.execute_reply.started":"2022-04-28T14:43:06.262631Z","shell.execute_reply":"2022-04-28T14:43:11.851935Z"},"trusted":true},"execution_count":null,"outputs":[]}]}