{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Building a classifier with DistilBert\n",
    "In this notebook is code to create the movie_overview_classification model. The model accepts an overview of a movie and returns a prediction regarding whether the movie will a pass the Bechdel test. It only achieves accuracy (measured via f-score) of .77, but it will be implemented as part of a larger ensemble algorithm."
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "2cd645ab5faa1775"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Imports and Data"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "93a1ebe96b0c37cb"
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import warnings\n",
    "from sklearn.model_selection import train_test_split\n",
    "import numpy as np\n",
    "warnings.filterwarnings(\"ignore\")\n",
    "import datasets\n",
    "from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, DataCollatorWithPadding\n",
    "from datasets import load_metric\n",
    " \n"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:26:24.518079800Z",
     "start_time": "2024-07-23T18:26:24.514056300Z"
    }
   },
   "id": "61b223d9fc28303d"
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "outputs": [],
   "source": [
    "import BechdelDataImporter as data\n",
    "df = data.NoScripts()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:20:18.413548400Z",
     "start_time": "2024-07-23T16:20:18.183977700Z"
    }
   },
   "id": "f39f162e2864f877"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Text Cleaning\n",
    "First, instantiate a tokenizer, data collator, and model:"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "d4e49ecdb60e42c0"
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "initial_id",
   "metadata": {
    "collapsed": true,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:20:20.917109Z",
     "start_time": "2024-07-23T16:20:19.948053500Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']\n",
      "You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.\n"
     ]
    },
    {
     "data": {
      "text/plain": "'\\npredicted_class_id = logits.argmax().item()\\nmodel.config.id2label[predicted_class_id]\\n'"
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = DistilBertTokenizer.from_pretrained(\"distilbert-base-uncased\")\n",
    "model = DistilBertForSequenceClassification.from_pretrained(\"distilbert-base-uncased\", num_labels=2)\n",
    "data_collator = DataCollatorWithPadding(tokenizer=tokenizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "source": [
    "- Drop duplicate and na rows in the data\n",
    "- Tokenize the overviews\n",
    "- Change the labels from a 4-category rating to a pass-fail rating"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "4758f34ceb7509f9"
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "outputs": [],
   "source": [
    "df['overview_tokenized'] = pd.Series()\n",
    "df['label'] = pd.Series()\n",
    "df = df.drop_duplicates(subset=['overview']).dropna(subset=['overview'])\n",
    "for i in df.index:\n",
    "    df['overview_tokenized'][i] = tokenizer(df.loc[i, 'overview'], return_tensors=\"pt\")\n",
    "    if df['bechdel_rating'][i] == 3:        df['label'][i] = 1\n",
    "    else:                                   df['label'][i] = 0"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:20:32.787565800Z",
     "start_time": "2024-07-23T16:20:21.777045300Z"
    }
   },
   "id": "9557e1ea9bd67821"
  },
  {
   "cell_type": "markdown",
   "source": [
    "Split off a test set:"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "682080881db3f13"
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(df[['overview', 'overview_tokenized']], df['label'], test_size=0.2, random_state=42)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:20:40.513637Z",
     "start_time": "2024-07-23T16:20:40.487424500Z"
    }
   },
   "id": "eacbec3d3fbde8c6"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Connecting to Hugging Face"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "ee02108df36daa41"
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "outputs": [
    {
     "data": {
      "text/plain": "VBox(children=(HTML(value='<center> <img\\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "350772f56307417889e03213cbbce6c8"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from huggingface_hub import notebook_login\n",
    "notebook_login()\n"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:22:07.265590800Z",
     "start_time": "2024-07-23T16:22:07.230396400Z"
    }
   },
   "id": "482be4c1c7bdaa9b"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Defining accuracy metrics and some final processing steps"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "f1372de3c16d20cf"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "def compute_metrics(eval_pred):\n",
    "   load_accuracy = load_metric(\"accuracy\")\n",
    "   load_f1 = load_metric(\"f1\")\n",
    "  \n",
    "   logits, labels = eval_pred\n",
    "   predictions = np.argmax(logits, axis=-1)\n",
    "   accuracy = load_accuracy.compute(predictions=predictions, references=labels)[\"accuracy\"]\n",
    "   f1 = load_f1.compute(predictions=predictions, references=labels)[\"f1\"]\n",
    "   return {\"accuracy\": accuracy, \"f1\": f1}\n"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "5ecf3d1228d49dff"
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "outputs": [],
   "source": [
    "def processing(X: pd.DataFrame, y: pd.Series) -> datasets.Dataset:    \n",
    "    X['input_ids'] = pd.Series()\n",
    "    X['attention_mask'] = pd.Series()\n",
    "    for i in X.index:\n",
    "        X['input_ids'][i], X['attention_mask'][i] = X.loc[i, 'overview_tokenized'].input_ids.tolist()[0], X.loc[i, 'overview_tokenized'].attention_mask.tolist()[0]\n",
    "        \n",
    "    \n",
    "    return datasets.Dataset.from_pandas(X.join(y).drop(columns=['overview_tokenized']).rename(columns={'overview':'text'}))\n",
    "\n",
    "train_df, test_df = processing(X_train, y_train), processing(X_test, y_test)"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:23:08.144192500Z",
     "start_time": "2024-07-23T16:23:05.853498400Z"
    }
   },
   "id": "86ed4759f5c016d"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Training the Model"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "2e85a93d14c1870e"
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "outputs": [],
   "source": [
    "from transformers import TrainingArguments, Trainer\n",
    " \n",
    "repo_name = \"movie_overview_classification\"\n",
    " \n",
    "training_args = TrainingArguments(\n",
    "   output_dir=repo_name,\n",
    "   learning_rate=2e-5,\n",
    "   per_device_train_batch_size=16,\n",
    "   per_device_eval_batch_size=16,\n",
    "   num_train_epochs=2,\n",
    "   weight_decay=0.01,\n",
    "   save_strategy=\"epoch\",\n",
    "   push_to_hub=True\n",
    ")\n",
    " \n",
    "trainer = Trainer(\n",
    "   model=model,\n",
    "   args=training_args,\n",
    "   train_dataset=train_df,\n",
    "   eval_dataset=test_df,\n",
    "   tokenizer=tokenizer,\n",
    "   data_collator=data_collator,\n",
    "   compute_metrics=compute_metrics\n",
    ")\n"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T16:23:20.952560300Z",
     "start_time": "2024-07-23T16:23:17.185501700Z"
    }
   },
   "id": "a50a809b77860b4"
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "outputs": [
    {
     "data": {
      "text/plain": "<IPython.core.display.HTML object>",
      "text/html": "\n    <div>\n      \n      <progress value='2' max='1010' style='width:300px; height:20px; vertical-align: middle;'></progress>\n      [   2/1010 : < :, Epoch 0.00/2]\n    </div>\n    <table border=\"1\" class=\"dataframe\">\n  <thead>\n <tr style=\"text-align: left;\">\n      <th>Step</th>\n      <th>Training Loss</th>\n    </tr>\n  </thead>\n  <tbody>\n  </tbody>\n</table><p>"
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "TrainOutput(global_step=1010, training_loss=0.5062790998137824, metrics={'train_runtime': 5485.4763, 'train_samples_per_second': 2.944, 'train_steps_per_second': 0.184, 'total_flos': 545236330318872.0, 'train_loss': 0.5062790998137824, 'epoch': 2.0})"
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.train()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:01:30.248929900Z",
     "start_time": "2024-07-23T16:23:44.701626400Z"
    }
   },
   "id": "9361a3c143040d03"
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "outputs": [
    {
     "data": {
      "text/plain": "<IPython.core.display.HTML object>",
      "text/html": "\n    <div>\n      \n      <progress value='1' max='127' style='width:300px; height:20px; vertical-align: middle;'></progress>\n      [  1/127 : < :]\n    </div>\n    "
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "{'eval_loss': 0.5222412347793579,\n 'eval_accuracy': 0.7439326399207529,\n 'eval_f1': 0.7701200533570476,\n 'eval_runtime': 272.8653,\n 'eval_samples_per_second': 7.399,\n 'eval_steps_per_second': 0.465,\n 'epoch': 2.0}"
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.evaluate()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:06:03.246574600Z",
     "start_time": "2024-07-23T18:01:30.243814700Z"
    }
   },
   "id": "f183c0cdcc65ce21"
  },
  {
   "cell_type": "markdown",
   "source": [
    "Pushing the model to Hugging Face hub"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "f6f0609cd15e5cc2"
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "outputs": [
    {
     "data": {
      "text/plain": "events.out.tfevents.1721757963.Marks_Laptop.61328.1:   0%|          | 0.00/457 [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "0a4266ca475846cea2c99806d69b299c"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "Upload 2 LFS files:   0%|          | 0/2 [00:00<?, ?it/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "9066f36a709a4895b5e57bfbfc65e1ba"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "events.out.tfevents.1721751825.Marks_Laptop.61328.0:   0%|          | 0.00/5.53k [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "5b8165e6176143b3b96ce3948c145253"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "CommitInfo(commit_url='https://huggingface.co/mocboch/movie_overview_classification/commit/059aaf4a08b21825ae434df60b2c7af682dc5f6f', commit_message='End of training', commit_description='', oid='059aaf4a08b21825ae434df60b2c7af682dc5f6f', pr_url=None, pr_revision=None, pr_num=None)"
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "trainer.push_to_hub()"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:08:04.612821300Z",
     "start_time": "2024-07-23T18:07:59.396726500Z"
    }
   },
   "id": "8c935959cdc557fd"
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Making Predictions"
   ],
   "metadata": {
    "collapsed": false
   },
   "id": "c43decd4c57bc730"
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "outputs": [
    {
     "data": {
      "text/plain": "config.json:   0%|          | 0.00/640 [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "81411e096b9c4efa924e5ead4c769c3b"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "d7ff33dab28946588c8dbf254efd9ab7"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "tokenizer_config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "21b1c5fa4f474074861c7859057b9475"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "vocab.txt:   0%|          | 0.00/262k [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "78ba48af5e0247aa8cc7d9a5b85b70a6"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/plain": "special_tokens_map.json:   0%|          | 0.00/132 [00:00<?, ?B/s]",
      "application/vnd.jupyter.widget-view+json": {
       "version_major": 2,
       "version_minor": 0,
       "model_id": "b33fcc9ea84d46289ab70815505ac7d9"
      }
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from transformers import pipeline\n",
    "final_model = pipeline(model=\"mocboch/movie_overview_classification\")\n"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:11:04.422524200Z",
     "start_time": "2024-07-23T18:09:47.575679100Z"
    }
   },
   "id": "bc4b3b81e8dd5cc5"
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "outputs": [],
   "source": [
    "test_data = pd.DataFrame(test_df)\n",
    "test_data['preds'] = pd.Series()\n",
    "\n",
    "for i in test_data.index:\n",
    "    test_data['preds'][i] = final_model(test_df['text'][i])"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:17:11.351053900Z",
     "start_time": "2024-07-23T18:15:06.473578700Z"
    }
   },
   "id": "1af3ad9bce6726f3"
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "outputs": [
    {
     "data": {
      "text/plain": "                                                   text  \\\n0     A young and devoted morning television produce...   \n1     Don Birnam, a long-time alcoholic, has been so...   \n2     One peaceful day on Earth, two remnants of Fri...   \n3     Dominic Toretto and his crew battle the most s...   \n4     The Martins family are optimistic dreamers, qu...   \n...                                                 ...   \n2014  Seven short films - each one focused on the pl...   \n2015  After an unprecedented series of natural disas...   \n2016  Girl Lost tackles the issue of underage prosti...   \n2017  Loosely based on the true-life tale of Ron Woo...   \n2018  A young black pianist becomes embroiled in the...   \n\n                                              input_ids  \\\n0     [101, 1037, 2402, 1998, 7422, 2851, 2547, 3135...   \n1     [101, 2123, 12170, 12789, 2213, 1010, 1037, 21...   \n2     [101, 2028, 9379, 2154, 2006, 3011, 1010, 2048...   \n3     [101, 11282, 9538, 9284, 1998, 2010, 3626, 264...   \n4     [101, 1996, 19953, 2155, 2024, 21931, 24726, 2...   \n...                                                 ...   \n2014  [101, 2698, 2460, 3152, 1011, 2169, 2028, 4208...   \n2015  [101, 2044, 2019, 15741, 2186, 1997, 3019, 186...   \n2016  [101, 2611, 2439, 10455, 1996, 3277, 1997, 210...   \n2017  [101, 11853, 2241, 2006, 1996, 2995, 1011, 216...   \n2018  [101, 1037, 2402, 2304, 9066, 4150, 7861, 1261...   \n\n                                         attention_mask  label  \\\n0     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n1     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      0   \n2     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      0   \n3     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n4     [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n...                                                 ...    ...   \n2014  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n2015  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n2016  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n2017  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      1   \n2018  [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...      0   \n\n      __index_level_0__                                              preds  \n0                  6240  [{'label': 'LABEL_1', 'score': 0.8961271643638...  \n1                   574  [{'label': 'LABEL_0', 'score': 0.7674303054809...  \n2                  8304  [{'label': 'LABEL_0', 'score': 0.6576490402221...  \n3                  9737  [{'label': 'LABEL_0', 'score': 0.8074591755867...  \n4                 10031  [{'label': 'LABEL_1', 'score': 0.8910151124000...  \n...                 ...                                                ...  \n2014               4951  [{'label': 'LABEL_1', 'score': 0.6277556419372...  \n2015               8786  [{'label': 'LABEL_1', 'score': 0.6020154356956...  \n2016               8693  [{'label': 'LABEL_1', 'score': 0.9668087363243...  \n2017               7366  [{'label': 'LABEL_1', 'score': 0.9210842251777...  \n2018               2045  [{'label': 'LABEL_1', 'score': 0.9472063183784...  \n\n[2019 rows x 6 columns]",
      "text/html": "<div>\n<style scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n</style>\n<table border=\"1\" class=\"dataframe\">\n  <thead>\n    <tr style=\"text-align: right;\">\n      <th></th>\n      <th>text</th>\n      <th>input_ids</th>\n      <th>attention_mask</th>\n      <th>label</th>\n      <th>__index_level_0__</th>\n      <th>preds</th>\n    </tr>\n  </thead>\n  <tbody>\n    <tr>\n      <th>0</th>\n      <td>A young and devoted morning television produce...</td>\n      <td>[101, 1037, 2402, 1998, 7422, 2851, 2547, 3135...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>6240</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.8961271643638...</td>\n    </tr>\n    <tr>\n      <th>1</th>\n      <td>Don Birnam, a long-time alcoholic, has been so...</td>\n      <td>[101, 2123, 12170, 12789, 2213, 1010, 1037, 21...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>0</td>\n      <td>574</td>\n      <td>[{'label': 'LABEL_0', 'score': 0.7674303054809...</td>\n    </tr>\n    <tr>\n      <th>2</th>\n      <td>One peaceful day on Earth, two remnants of Fri...</td>\n      <td>[101, 2028, 9379, 2154, 2006, 3011, 1010, 2048...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>0</td>\n      <td>8304</td>\n      <td>[{'label': 'LABEL_0', 'score': 0.6576490402221...</td>\n    </tr>\n    <tr>\n      <th>3</th>\n      <td>Dominic Toretto and his crew battle the most s...</td>\n      <td>[101, 11282, 9538, 9284, 1998, 2010, 3626, 264...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>9737</td>\n      <td>[{'label': 'LABEL_0', 'score': 0.8074591755867...</td>\n    </tr>\n    <tr>\n      <th>4</th>\n      <td>The Martins family are optimistic dreamers, qu...</td>\n      <td>[101, 1996, 19953, 2155, 2024, 21931, 24726, 2...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>10031</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.8910151124000...</td>\n    </tr>\n    <tr>\n      <th>...</th>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n      <td>...</td>\n    </tr>\n    <tr>\n      <th>2014</th>\n      <td>Seven short films - each one focused on the pl...</td>\n      <td>[101, 2698, 2460, 3152, 1011, 2169, 2028, 4208...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>4951</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.6277556419372...</td>\n    </tr>\n    <tr>\n      <th>2015</th>\n      <td>After an unprecedented series of natural disas...</td>\n      <td>[101, 2044, 2019, 15741, 2186, 1997, 3019, 186...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>8786</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.6020154356956...</td>\n    </tr>\n    <tr>\n      <th>2016</th>\n      <td>Girl Lost tackles the issue of underage prosti...</td>\n      <td>[101, 2611, 2439, 10455, 1996, 3277, 1997, 210...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>8693</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.9668087363243...</td>\n    </tr>\n    <tr>\n      <th>2017</th>\n      <td>Loosely based on the true-life tale of Ron Woo...</td>\n      <td>[101, 11853, 2241, 2006, 1996, 2995, 1011, 216...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>1</td>\n      <td>7366</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.9210842251777...</td>\n    </tr>\n    <tr>\n      <th>2018</th>\n      <td>A young black pianist becomes embroiled in the...</td>\n      <td>[101, 1037, 2402, 2304, 9066, 4150, 7861, 1261...</td>\n      <td>[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...</td>\n      <td>0</td>\n      <td>2045</td>\n      <td>[{'label': 'LABEL_1', 'score': 0.9472063183784...</td>\n    </tr>\n  </tbody>\n</table>\n<p>2019 rows × 6 columns</p>\n</div>"
     },
     "execution_count": 32,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "test_data"
   ],
   "metadata": {
    "collapsed": false,
    "ExecuteTime": {
     "end_time": "2024-07-23T18:20:38.981650800Z",
     "start_time": "2024-07-23T18:20:38.968907100Z"
    }
   },
   "id": "de058528ac4dc23c"
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "732fc0611fabe4fa"
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false
   },
   "id": "21821b4e974dfb78"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}