{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# IMDb - Predicting actors' gender\n",
    "\n",
    "In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.\n",
    "\n",
    "The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.\n",
    "\n",
    "As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.\n",
    "\n",
    "Summary:\n",
    "\n",
    "- Prediction type: __Classification model__\n",
    "- Domain: __Entertainment__\n",
    "- Prediction target: __The gender of an actor__ \n",
    "- Population size: __817718__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Background\n",
    "\n",
    "The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.\n",
    "\n",
    "It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/IMDb) (Motl and Schulte, 2015) (Now residing at [relational-data.org](https://relational-data.org/dataset/IMDb).)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's get started with the analysis and set up your session:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install -q \"getml==1.5.0\" \"pyspark==3.5.2\" \"ipywidgets==8.1.5\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "getML API version: 1.5.0\n",
      "\n"
     ]
    }
   ],
   "source": [
    "import os\n",
    "\n",
    "from pyspark.sql import SparkSession\n",
    "import getml\n",
    "\n",
    "os.environ[\"PYARROW_IGNORE_TIMEZONE\"] = \"1\"\n",
    "\n",
    "print(f\"getML API version: {getml.__version__}\\n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...\n",
      "Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912144137.log.\n",
      "\u001b[2K  Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[?25h"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Connected to project <span style=\"color: #008000; text-decoration-color: #008000\">'imdb'</span>.\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Connected to project \u001b[32m'imdb'\u001b[0m.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "getml.engine.launch(allow_remote_ips=True, token='token')\n",
    "getml.engine.set_project('imdb')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span id='flags'></span>\n",
    "In the following, we set some flags that affect execution of the notebook:\n",
    "- We don't let the algorithms utilize the information on actors' first names (see [below](#first-names) for an explanation)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "USE_FIRST_NAMES = False\n",
    "RUN_SPARK = False"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1. Loading data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.1 Download from source\n",
    "\n",
    "We begin by downloading the data from the source file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Connection(dbname='imdb_ijs',\n",
       "           dialect='mysql',\n",
       "           host='relational.fel.cvut.cz',\n",
       "           port=3306)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "conn = getml.database.connect_mysql(\n",
    "    host=\"relational.fel.cvut.cz\",\n",
    "    dbname=\"imdb_ijs\",\n",
    "    port=3306,\n",
    "    user=\"guest\",\n",
    "    password=\"ctu-relational\"\n",
    ")\n",
    "\n",
    "conn"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def load_if_needed(name):\n",
    "    \"\"\"\n",
    "    Loads the data from the relational learning\n",
    "    repository, if the data frame has not already\n",
    "    been loaded.\n",
    "    \"\"\"\n",
    "    if getml.data.exists(name):\n",
    "        return getml.data.load_data_frame(name)\n",
    "    data_frame = getml.data.DataFrame.from_db(\n",
    "        name=name,\n",
    "        table_name=name,\n",
    "        conn=conn\n",
    "    )\n",
    "    data_frame.save()\n",
    "    return data_frame"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "actors = load_if_needed(\"actors\")\n",
    "roles = load_if_needed(\"roles\")\n",
    "movies = load_if_needed(\"movies\")\n",
    "movies_genres = load_if_needed(\"movies_genres\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">          id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">first_name   </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">last_name         </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">gender       </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string     </th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string</th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Michael</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;babeepower&#x27; Viera</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">3</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Eloy</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;Chincheta&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">4</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Dieguito</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El Cigala&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">5</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Antonio</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El de Chipiona&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">6</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">José</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El Francés&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817713</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845461</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Herdís</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þorvaldsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817714</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845462</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Katla Margrét</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þorvaldsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817715</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845463</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Lilja Nótt</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórarinsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817716</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845464</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Hólmfríður</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórhallsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817717</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845465</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Theódóra</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórðardóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    817718 rows x 4 columns<br />\n",
       "    memory usage: 40.22 MB<br />\n",
       "    name: actors<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name             id   first_name      last_name            gender       \n",
       "  role   unused_float   unused_string   unused_string        unused_string\n",
       "     0              2   Michael         'babeepower' Viera   M            \n",
       "     1              3   Eloy            'Chincheta'          M            \n",
       "     2              4   Dieguito        'El Cigala'          M            \n",
       "     3              5   Antonio         'El de Chipiona'     M            \n",
       "     4              6   José            'El Francés'         M            \n",
       "                  ...   ...             ...                  ...          \n",
       "817713         845461   Herdís          Þorvaldsdóttir       F            \n",
       "817714         845462   Katla Margrét   Þorvaldsdóttir       F            \n",
       "817715         845463   Lilja Nótt      Þórarinsdóttir       F            \n",
       "817716         845464   Hólmfríður      Þórhallsdóttir       F            \n",
       "817717         845465   Theódóra        Þórðardóttir         F            \n",
       "\n",
       "\n",
       "817718 rows x 4 columns\n",
       "memory usage: 40.22 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "actors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>   name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">    actor_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">    movie_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">role            </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">   role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string   </th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">280088</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Stevie</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">396232</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Various/lyricist</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">3</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">376687</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Gitano 1</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">4</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">336265</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">El Cigala</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">5</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">135644</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Himself</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431961</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845461</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">137097</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Kata</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431962</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845462</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">208838</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Magga</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431963</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845463</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">870</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Gunna</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431964</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845464</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378123</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Gudrun</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431965</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">845465</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378123</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">NULL</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    3431966 rows x 3 columns<br />\n",
       "    memory usage: 115.41 MB<br />\n",
       "    name: roles<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "   name       actor_id       movie_id   role            \n",
       "   role   unused_float   unused_float   unused_string   \n",
       "      0              2         280088   Stevie          \n",
       "      1              2         396232   Various/lyricist\n",
       "      2              3         376687   Gitano 1        \n",
       "      3              4         336265   El Cigala       \n",
       "      4              5         135644   Himself         \n",
       "                   ...            ...   ...             \n",
       "3431961         845461         137097   Kata            \n",
       "3431962         845462         208838   Magga           \n",
       "3431963         845463            870   Gunna           \n",
       "3431964         845464         378123   Gudrun          \n",
       "3431965         845465         378123   NULL            \n",
       "\n",
       "\n",
       "3431966 rows x 3 columns\n",
       "memory usage: 115.41 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "roles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">          id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">        year</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">        rank</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">name                            </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string                   </th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2002</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">#28</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2000</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">#7 Train: An Immigrant Journey, ...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1971</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "            \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">6</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >.4</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">3</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1913</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$1,000 Reward</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">4</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1915</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$1,000 Reward</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388264</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">412316</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1991</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;zem blch krlu&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388265</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">412317</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1995</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;rgammk&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388266</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">412318</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2002</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;zgnm Leyla&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388267</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">412319</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1983</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot; Istanbul&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388268</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">412320</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1958</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;sterreich&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    388269 rows x 4 columns<br />\n",
       "    memory usage: 19.92 MB<br />\n",
       "    name: movies<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name             id           year           rank   name                            \n",
       "  role   unused_float   unused_float   unused_float   unused_string                   \n",
       "     0              0           2002          nan     #28                             \n",
       "     1              1           2000          nan     #7 Train: An Immigrant Journey, ...\n",
       "     2              2           1971            6.4   $                               \n",
       "     3              3           1913          nan     $1,000 Reward                   \n",
       "     4              4           1915          nan     $1,000 Reward                   \n",
       "                  ...            ...           ...    ...                             \n",
       "388264         412316           1991          nan     \"zem blch krlu\"                 \n",
       "388265         412317           1995          nan     \"rgammk\"                        \n",
       "388266         412318           2002          nan     \"zgnm Leyla\"                    \n",
       "388267         412319           1983          nan     \" Istanbul\"                     \n",
       "388268         412320           1958          nan     \"sterreich\"                     \n",
       "\n",
       "\n",
       "388269 rows x 4 columns\n",
       "memory usage: 19.92 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_float\">    movie_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">genre        </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_float\">unused_float</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string</th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Documentary</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Short</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">2</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Crime</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">5</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Western</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395114</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378612</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Adventure</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395115</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378612</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Drama</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395116</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378613</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395117</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378613</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Drama</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395118</th>\n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align unused_float\">\n",
       "              <span class=\"left\">378614</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    395119 rows x 2 columns<br />\n",
       "    memory usage: 9.24 MB<br />\n",
       "    name: movies_genres<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name       movie_id   genre        \n",
       "  role   unused_float   unused_string\n",
       "     0              1   Documentary  \n",
       "     1              1   Short        \n",
       "     2              2   Comedy       \n",
       "     3              2   Crime        \n",
       "     4              5   Western      \n",
       "                  ...   ...          \n",
       "395114         378612   Adventure    \n",
       "395115         378612   Drama        \n",
       "395116         378613   Comedy       \n",
       "395117         378613   Drama        \n",
       "395118         378614   Comedy       \n",
       "\n",
       "\n",
       "395119 rows x 2 columns\n",
       "memory usage: 9.24 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_genres"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 1.2 Prepare data for getML"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "getML requires that we define *roles* for each of the columns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "actors[\"target\"] = (actors.gender == 'F')\n",
    "actors.set_role(\"id\", getml.data.roles.join_key)\n",
    "actors.set_role(\"target\", getml.data.roles.target)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span id='first-names'></span>\n",
    "The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag [above](#flags) to see how well getML incoporates such starightforward information into its feature logic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"join_key\">      id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"target\">target</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">first_name   </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">last_name         </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">gender       </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header join_key\">join_key</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header target\">target</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string     </th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string</th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Michael</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;babeepower&#x27; Viera</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">3</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Eloy</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;Chincheta&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">4</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Dieguito</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El Cigala&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">5</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Antonio</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El de Chipiona&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">6</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">0</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">José</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&#x27;El Francés&#x27;</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">M</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817713</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845461</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Herdís</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þorvaldsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817714</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845462</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Katla Margrét</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þorvaldsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817715</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845463</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Lilja Nótt</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórarinsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817716</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845464</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Hólmfríður</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórhallsdóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>817717</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845465</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align target\">\n",
       "              <span class=\"left\">1</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Theódóra</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">Þórðardóttir</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">F</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    817718 rows x 5 columns<br />\n",
       "    memory usage: 43.49 MB<br />\n",
       "    name: actors<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name         id   target   first_name      last_name            gender       \n",
       "  role   join_key   target   unused_string   unused_string        unused_string\n",
       "     0          2        0   Michael         'babeepower' Viera   M            \n",
       "     1          3        0   Eloy            'Chincheta'          M            \n",
       "     2          4        0   Dieguito        'El Cigala'          M            \n",
       "     3          5        0   Antonio         'El de Chipiona'     M            \n",
       "     4          6        0   José            'El Francés'         M            \n",
       "              ...      ...   ...             ...                  ...          \n",
       "817713     845461        1   Herdís          Þorvaldsdóttir       F            \n",
       "817714     845462        1   Katla Margrét   Þorvaldsdóttir       F            \n",
       "817715     845463        1   Lilja Nótt      Þórarinsdóttir       F            \n",
       "817716     845464        1   Hólmfríður      Þórhallsdóttir       F            \n",
       "817717     845465        1   Theódóra        Þórðardóttir         F            \n",
       "\n",
       "\n",
       "817718 rows x 5 columns\n",
       "memory usage: 43.49 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "if USE_FIRST_NAMES:\n",
    "    actors.set_role(\"first_name\", getml.data.roles.text)\n",
    "actors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>   name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"join_key\">actor_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"join_key\">movie_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"text\">role            </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">   role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header join_key\">join_key</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header join_key\">join_key</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header text\">text            </th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">280088</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Stevie</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">396232</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Various/lyricist</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">3</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">376687</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Gitano 1</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">4</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">336265</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">El Cigala</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">5</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">135644</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Himself</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431961</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845461</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">137097</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Kata</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431962</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845462</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">208838</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Magga</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431963</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845463</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">870</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Gunna</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431964</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845464</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378123</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">Gudrun</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3431965</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">845465</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378123</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"text\">NULL</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    3431966 rows x 3 columns<br />\n",
       "    memory usage: 87.96 MB<br />\n",
       "    name: roles<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "   name   actor_id   movie_id   role            \n",
       "   role   join_key   join_key   text            \n",
       "      0          2     280088   Stevie          \n",
       "      1          2     396232   Various/lyricist\n",
       "      2          3     376687   Gitano 1        \n",
       "      3          4     336265   El Cigala       \n",
       "      4          5     135644   Himself         \n",
       "               ...        ...   ...             \n",
       "3431961     845461     137097   Kata            \n",
       "3431962     845462     208838   Magga           \n",
       "3431963     845463        870   Gunna           \n",
       "3431964     845464     378123   Gudrun          \n",
       "3431965     845465     378123   NULL            \n",
       "\n",
       "\n",
       "3431966 rows x 3 columns\n",
       "memory usage: 87.96 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "roles.set_role([\"actor_id\", \"movie_id\"], getml.data.roles.join_key)\n",
    "roles.set_role(\"role\", getml.data.roles.text)\n",
    "roles"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"join_key\">      id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"numerical\">     year</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"numerical\">     rank</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"unused_string\">name                            </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header join_key\">join_key</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header numerical\">numerical</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header numerical\">numerical</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header unused_string\">unused_string                   </th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">0</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">2002</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">#28</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">1</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">2000</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">#7 Train: An Immigrant Journey, ...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1971</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "            \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">6</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >.4</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">3</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1913</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$1,000 Reward</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">4</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1915</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">$1,000 Reward</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">...</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388264</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">412316</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1991</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;zem blch krlu&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388265</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">412317</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1995</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;rgammk&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388266</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">412318</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">2002</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;zgnm Leyla&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388267</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">412319</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1983</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot; Istanbul&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>388268</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">412320</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">1958</span\n",
       "              ><span class=\"right\" style=\"width: 0ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            \n",
       "              \n",
       "              \n",
       "          \n",
       "            <td class=\"char-align numerical\">\n",
       "              <span class=\"left\">nan</span\n",
       "              ><span class=\"right\" style=\"width: 1ch\"\n",
       "                >&nbsp;</span\n",
       "              >\n",
       "            </td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"unused_string\">&quot;sterreich&quot;</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    388269 rows x 4 columns<br />\n",
       "    memory usage: 18.37 MB<br />\n",
       "    name: movies<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name         id        year        rank   name                            \n",
       "  role   join_key   numerical   numerical   unused_string                   \n",
       "     0          0        2002       nan     #28                             \n",
       "     1          1        2000       nan     #7 Train: An Immigrant Journey, ...\n",
       "     2          2        1971         6.4   $                               \n",
       "     3          3        1913       nan     $1,000 Reward                   \n",
       "     4          4        1915       nan     $1,000 Reward                   \n",
       "              ...         ...        ...    ...                             \n",
       "388264     412316        1991       nan     \"zem blch krlu\"                 \n",
       "388265     412317        1995       nan     \"rgammk\"                        \n",
       "388266     412318        2002       nan     \"zgnm Leyla\"                    \n",
       "388267     412319        1983       nan     \" Istanbul\"                     \n",
       "388268     412320        1958       nan     \"sterreich\"                     \n",
       "\n",
       "\n",
       "388269 rows x 4 columns\n",
       "memory usage: 18.37 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies.set_role(\"id\", getml.data.roles.join_key)\n",
    "movies.set_role([\"year\", \"rank\"], getml.data.roles.numerical)\n",
    "movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  table {\n",
       "    font-family: Helvetica, sans-serif;\n",
       "  }\n",
       "\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "  }\n",
       "  .join_key,\n",
       "  .numerical,\n",
       "  .target,\n",
       "  .unused_float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  .char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  name</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"join_key\">movie_id</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"categorical\">genre      </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "      <tr>\n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header\">  role</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header join_key\">join_key</th>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <th class=\"sub-header categorical\">categorical</th>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">1</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Documentary</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">1</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Short</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">2</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Crime</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">5</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Western</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">...</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395114</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378612</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Adventure</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395115</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378612</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Drama</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395116</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378613</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395117</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378613</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Drama</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>395118</th>\n",
       "        \n",
       "          \n",
       "            <td class=\"join_key\">378614</td>\n",
       "          \n",
       "        \n",
       "          \n",
       "            <td class=\"categorical\">Comedy</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    395119 rows x 2 columns<br />\n",
       "    memory usage: 3.16 MB<br />\n",
       "    name: movies_genres<br />\n",
       "    type: getml.DataFrame<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "  name   movie_id   genre      \n",
       "  role   join_key   categorical\n",
       "     0          1   Documentary\n",
       "     1          1   Short      \n",
       "     2          2   Comedy     \n",
       "     3          2   Crime      \n",
       "     4          5   Western    \n",
       "              ...   ...        \n",
       "395114     378612   Adventure  \n",
       "395115     378612   Drama      \n",
       "395116     378613   Comedy     \n",
       "395117     378613   Drama      \n",
       "395118     378614   Comedy     \n",
       "\n",
       "\n",
       "395119 rows x 2 columns\n",
       "memory usage: 3.16 MB\n",
       "type: getml.DataFrame"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies_genres.set_role(\"movie_id\", getml.data.roles.join_key)\n",
    "movies_genres.set_role(\"genre\", getml.data.roles.categorical)\n",
    "movies_genres"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need to separate our data set into a training, testing and validation set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th.sub-header {\n",
       "    font-weight: normal;\n",
       "    font-style: italic;\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right !important;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.numerical {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.numerical {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "\n",
       "  td.char-align {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  span.left {\n",
       "    text-align: right;\n",
       "    width: 3em;\n",
       "  }\n",
       "  span.right {\n",
       "    float: right;\n",
       "    text-align: left;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th>  </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th>          </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "        \n",
       "          \n",
       "            <td>train</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "        \n",
       "          \n",
       "            <td>validation</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "        \n",
       "          \n",
       "            <td>train</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>3</th>\n",
       "        \n",
       "          \n",
       "            <td>validation</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>4</th>\n",
       "        \n",
       "          \n",
       "            <td>validation</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th></th>\n",
       "        \n",
       "          \n",
       "            <td>...</td>\n",
       "          \n",
       "        \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "\n",
       "  <p>\n",
       "    infinite number of  rows<br />\n",
       "    \n",
       "    type: StringColumnView<br />\n",
       "    \n",
       "  </p>\n"
      ],
      "text/plain": [
       "               \n",
       " 0   train     \n",
       " 1   validation\n",
       " 2   train     \n",
       " 3   validation\n",
       " 4   validation\n",
       "     ...       \n",
       "\n",
       "\n",
       "infinite number of  rows\n",
       "type: StringColumnView"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)\n",
    "split"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div style='margin-top: 15px;'>\n",
       "<div style='float: left; margin-right: 50px;'>\n",
       "<div style='margin-bottom: 10px; font-size: 1rem;'>population</div>\n",
       "    <style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  th.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th class=\"int\"> </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">subset    </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">name  </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"int\">  rows</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">type</th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">test</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">actors</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">122794</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">View</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">train</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">actors</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">571807</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">View</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">validation</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">actors</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">123117</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">View</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "<div style='float: left;'>\n",
       "<div style='margin-bottom: 10px; font-size: 1rem;'>peripheral</div>\n",
       "    <style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  th.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th class=\"int\"> </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">name         </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"int\">   rows</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">type     </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">roles</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">3431966</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">DataFrame</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">movies</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">388269</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">DataFrame</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">movies_genres</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"int\">395119</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">DataFrame</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "</div>\n",
       "</div>"
      ],
      "text/plain": [
       "population\n",
       "    subset       name       rows   type\n",
       "0   test         actors   122794   View\n",
       "1   train        actors   571807   View\n",
       "2   validation   actors   123117   View\n",
       "\n",
       "peripheral\n",
       "    name               rows   type     \n",
       "0   roles           3431966   DataFrame\n",
       "1   movies           388269   DataFrame\n",
       "2   movies_genres    395119   DataFrame"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "container = getml.data.Container(population=actors, split=split)\n",
    "\n",
    "container.add(\n",
    "    roles=roles,\n",
    "    movies=movies,\n",
    "    movies_genres=movies_genres,\n",
    ")\n",
    "\n",
    "container"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2. Predictive modelling\n",
    "\n",
    "We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.1 Define relational model\n",
    "\n",
    "To get started with relational learning, we need to specify the data model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "            <div style='margin-top: 15px; margin-bottom: 5px;'>\n",
       "            <div style='margin-bottom: 10px; font-size: 1rem;'>diagram</div>\n",
       "            <div style=\"height:100px;width:1660px;position:relative;\"><svg height=\"90\" width=\"1650\"><rect y=\"0\" x=\"0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"73.8\" x=\"75.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">movies_genres</text><rect x=\"51\" y=\"10\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"67.0\" y1=\"10\" x2=\"67.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"83.0\" y1=\"10\" x2=\"83.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"26.0\" x2=\"99\" y2=\"26.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"51\" y1=\"42.0\" x2=\"99\" y2=\"42.0\" style=\"stroke:white;stroke-width:3\" /><rect y=\"0\" x=\"500\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"73.8\" x=\"575.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">movies</text><rect x=\"551\" y=\"10\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"567.0\" y1=\"10\" x2=\"567.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"583.0\" y1=\"10\" x2=\"583.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"551\" y1=\"26.0\" x2=\"599\" y2=\"26.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"551\" y1=\"42.0\" x2=\"599\" y2=\"42.0\" style=\"stroke:white;stroke-width:3\" /><rect y=\"0\" x=\"1000\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"73.8\" x=\"1075.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">roles</text><rect x=\"1051\" y=\"10\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"1067.0\" y1=\"10\" x2=\"1067.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1083.0\" y1=\"10\" x2=\"1083.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1051\" y1=\"26.0\" x2=\"1099\" y2=\"26.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1051\" y1=\"42.0\" x2=\"1099\" y2=\"42.0\" style=\"stroke:white;stroke-width:3\" /><rect y=\"0\" x=\"1500\" rx=\"10\" ry=\"10\" width=\"150\" height=\"90\" style=\"fill:#6829c2;stroke-width:0;\" /><text y=\"73.8\" x=\"1575.0\" dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\">actors</text><rect x=\"1551\" y=\"10\" rx=\"4\" ry=\"4\" width=\"48\" height=\"48\" style=\" fill:#6829c2;stroke:#ffffff;stroke-width:3;\" /><line x1=\"1567.0\" y1=\"10\" x2=\"1567.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1583.0\" y1=\"10\" x2=\"1583.0\" y2=\"58\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1551\" y1=\"26.0\" x2=\"1599\" y2=\"26.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"1551\" y1=\"42.0\" x2=\"1599\" y2=\"42.0\" style=\"stroke:white;stroke-width:3\" /><line x1=\"150\" y1=\"43.0\" x2=\"490\" y2=\"43.0\" style=\"stroke:#808080;;stroke-width:4\" /><polygon points=\"500, 43.0 490, 37.0 490, 49.0 \" style=\"fill:#808080;;stroke-width:0;\" /><rect y=\"10.0\" x=\"249.0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"70\" style=\"fill:#6829c2;stroke-width:0;\" /><text dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\"><tspan y=\"45.0\" x=\"324.0\" font-size=\"7pt\" >movie_id = id</tspan></text><line x1=\"650\" y1=\"43.0\" x2=\"990\" y2=\"43.0\" style=\"stroke:#808080;;stroke-width:4\" /><polygon points=\"1000, 43.0 990, 37.0 990, 49.0 \" style=\"fill:#808080;;stroke-width:0;\" /><rect y=\"10.0\" x=\"749.0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"70\" style=\"fill:#6829c2;stroke-width:0;\" /><text dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\"><tspan y=\"40.0\" x=\"824.0\" font-size=\"7pt\" >id = movie_id</tspan><tspan y=\"50.0\" x=\"824.0\" font-size=\"7pt\" >Relationship: many-to-one</tspan></text><line x1=\"1150\" y1=\"43.0\" x2=\"1490\" y2=\"43.0\" style=\"stroke:#808080;;stroke-width:4\" /><polygon points=\"1500, 43.0 1490, 37.0 1490, 49.0 \" style=\"fill:#808080;;stroke-width:0;\" /><rect y=\"10.0\" x=\"1249.0\" rx=\"10\" ry=\"10\" width=\"150\" height=\"70\" style=\"fill:#6829c2;stroke-width:0;\" /><text dominant-baseline=\"middle\" text-anchor=\"middle\" fill=\"white\"><tspan y=\"45.0\" x=\"1324.0\" font-size=\"7pt\" >actor_id = id</tspan></text></svg></div>\n",
       "            </div>\n",
       "\n",
       "            <div style='margin-top: 15px;'>\n",
       "            <div style='margin-bottom: 10px; font-size: 1rem;'>staging</div>\n",
       "            <style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  th.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th class=\"int\"> </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">data frames  </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">staging table                 </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">actors</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">ACTORS__STAGING_TABLE_1</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">movies_genres</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">MOVIES_GENRES__STAGING_TABLE_2</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>2</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">roles, movies</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">ROLES__STAGING_TABLE_3</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>\n",
       "            </div>\n",
       "            "
      ],
      "text/plain": [
       "actors:\n",
       "  columns:\n",
       "\n",
       "\n",
       "  joins:\n",
       "  - right: 'roles'\n",
       "    on: (actors.id, roles.actor_id)\n",
       "    relationship: 'many-to-many'\n",
       "    lagged_targets: False\n",
       "\n",
       "roles:\n",
       "  columns:\n",
       "  - actor_id: join_key\n",
       "  - movie_id: join_key\n",
       "  - role: text\n",
       "\n",
       "  joins:\n",
       "  - right: 'movies'\n",
       "    on: (roles.movie_id, movies.id)\n",
       "    relationship: 'many-to-one'\n",
       "    lagged_targets: False\n",
       "\n",
       "movies:\n",
       "  columns:\n",
       "  - id: join_key\n",
       "  - year: numerical\n",
       "  - rank: numerical\n",
       "  - name: unused_string\n",
       "\n",
       "  joins:\n",
       "  - right: 'movies_genres'\n",
       "    on: (movies.id, movies_genres.movie_id)\n",
       "    relationship: 'many-to-many'\n",
       "    lagged_targets: False\n",
       "\n",
       "movies_genres:\n",
       "  columns:\n",
       "  - genre: categorical\n",
       "  - movie_id: join_key"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dm = getml.data.DataModel(\"actors\")\n",
    "\n",
    "dm.add(getml.data.to_placeholder(\n",
    "    roles=roles,\n",
    "    movies=movies,\n",
    "    movies_genres=movies_genres,\n",
    "))\n",
    "\n",
    "dm.population.join(\n",
    "    dm.roles,\n",
    "    on=(\"id\", \"actor_id\"),\n",
    ")\n",
    "\n",
    "dm.roles.join(\n",
    "    dm.movies,\n",
    "    on=(\"movie_id\", \"id\"),\n",
    "    relationship=getml.data.relationship.many_to_one,\n",
    ")\n",
    "\n",
    "dm.movies.join(\n",
    "    dm.movies_genres,\n",
    "    on=(\"id\", \"movie_id\"),\n",
    ")\n",
    "\n",
    "dm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.2 getML pipeline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<!-- #### 2.1.1  -->\n",
    "__Set-up the feature learner & predictor__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "text_field_splitter = getml.preprocessors.TextFieldSplitter()\n",
    "\n",
    "mapping = getml.preprocessors.Mapping()\n",
    "\n",
    "fast_prop = getml.feature_learning.FastProp(\n",
    "    loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n",
    ")\n",
    "\n",
    "feature_selector = getml.predictors.XGBoostClassifier()\n",
    "\n",
    "predictor = getml.predictors.XGBoostClassifier()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "__Build the pipeline__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe = getml.pipeline.Pipeline(\n",
    "    tags=['fast_prop'],\n",
    "    data_model=dm,\n",
    "    preprocessors=[text_field_splitter, mapping],\n",
    "    feature_learners=[fast_prop],\n",
    "    feature_selectors=[feature_selector],\n",
    "    predictors=[predictor],\n",
    "    share_selected_features=0.1,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.3 Model training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Checking data model<span style=\"color: #808000; text-decoration-color: #808000\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Checking data model\u001b[33m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:28\n",
      "\u001b[2K  Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[?25h"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">The pipeline check generated <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> issues labeled INFO and <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span> issues labeled WARNING.\n",
       "</pre>\n"
      ],
      "text/plain": [
       "The pipeline check generated \u001b[1;36m1\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  th.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th class=\"int\"> </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">type</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">label                 </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">message                         </th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"str\">INFO</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">FOREIGN KEYS NOT FOUND</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">When joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over &#x27;id&#x27; and &#x27;movie_id&#x27;, there are no corresponding entries for 26.899421% of entries in &#x27;id&#x27; in &#x27;ROLES__STAGING_TABLE_3&#x27;. You might want to double-check your join keys.</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "    type   label                    message                         \n",
       "0   INFO   FOREIGN KEYS NOT FOUND   When joining ROLES__STAGING_TABL..."
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.check(container.train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Checking data model<span style=\"color: #808000; text-decoration-color: #808000\">...</span>\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Checking data model\u001b[33m...\u001b[0m\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09\n",
      "\u001b[?25h"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">The pipeline check generated <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">1</span> issues labeled INFO and <span style=\"color: #008080; text-decoration-color: #008080; font-weight: bold\">0</span> issues labeled WARNING.\n",
       "</pre>\n"
      ],
      "text/plain": [
       "The pipeline check generated \u001b[1;36m1\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">To see the issues in full, run <span style=\"color: #800080; text-decoration-color: #800080; font-weight: bold\">.check</span><span style=\"font-weight: bold\">()</span> on the pipeline.\n",
       "</pre>\n"
      ],
      "text/plain": [
       "To see the issues in full, run \u001b[1;35m.check\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m on the pipeline.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:09\n",
      "\u001b[2K  Indexing text fields... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06\n",
      "\u001b[2K  FastProp: Trying 226 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:52\n",
      "\u001b[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:05\n",
      "\u001b[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 01:03\n",
      "\u001b[2K  XGBoost: Training as feature selector... ━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 04:04\n",
      "\u001b[2K  XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:29\n",
      "\u001b[?25h"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">Trained pipeline.\n",
       "</pre>\n"
      ],
      "text/plain": [
       "Trained pipeline.\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time taken: 0:06:52.549544.\n",
      "\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<pre>Pipeline(data_model='actors',\n",
       "         feature_learners=['FastProp'],\n",
       "         feature_selectors=['XGBoostClassifier'],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['movies', 'movies_genres', 'roles'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=['TextFieldSplitter', 'Mapping'],\n",
       "         share_selected_features=0.1,\n",
       "         tags=['fast_prop', 'container-i1xBdh'])</pre>"
      ],
      "text/plain": [
       "Pipeline(data_model='actors',\n",
       "         feature_learners=['FastProp'],\n",
       "         feature_selectors=['XGBoostClassifier'],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['movies', 'movies_genres', 'roles'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=['TextFieldSplitter', 'Mapping'],\n",
       "         share_selected_features=0.1,\n",
       "         tags=['fast_prop', 'container-i1xBdh'])"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.fit(container.train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.4 Model evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "lines_to_next_cell": 0
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[2K  Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n",
      "\u001b[2K  Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:08\n",
      "\u001b[2K  FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:02\n",
      "\u001b[2K  FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n",
      "\u001b[?25h"
     ]
    },
    {
     "data": {
      "text/html": [
       "<style>\n",
       "  th {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  td {\n",
       "    text-align: left !important;\n",
       "  }\n",
       "  th:nth-child(1) {\n",
       "    text-align: right;\n",
       "    border-right: 1px solid LightGray;\n",
       "  }\n",
       "  th.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.float {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  th.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "  td.int {\n",
       "    text-align: right !important;\n",
       "  }\n",
       "</style>\n",
       "\n",
       "<table class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      \n",
       "        \n",
       "          <th class=\"int\"> </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"datetime\">date time          </th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">set used</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"str\">target</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"float\">accuracy</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"float\">    auc</th>\n",
       "        \n",
       "      \n",
       "        \n",
       "          <th class=\"float\">cross entropy</th>\n",
       "        \n",
       "      \n",
       "    </tr>\n",
       "    \n",
       "  </thead>\n",
       "  <tbody>\n",
       "    \n",
       "      <tr>\n",
       "        <th>0</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"datetime\">2024-09-12 13:13:37</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">train</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">target</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.8417</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.9139</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.3217</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "      <tr>\n",
       "        <th>1</th>\n",
       "          \n",
       "            \n",
       "              <td class=\"datetime\">2024-09-12 13:13:51</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">test</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"str\">target</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.842</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.9138</td>\n",
       "            \n",
       "          \n",
       "            \n",
       "              <td class=\"float\">0.323</td>\n",
       "            \n",
       "          \n",
       "      </tr>\n",
       "    \n",
       "  </tbody>\n",
       "</table>"
      ],
      "text/plain": [
       "    date time             set used   target   accuracy       auc   cross entropy\n",
       "0   2024-09-12 13:13:37   train      target     0.8417    0.9139          0.3217\n",
       "1   2024-09-12 13:13:51   test       target     0.842     0.9138          0.323 "
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.score(container.test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.5 Features\n",
    "\n",
    "The most important feature looks as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/markdown": [
       "```sql\n",
       "DROP TABLE IF EXISTS \"FEATURE_1_164\";\n",
       "\n",
       "CREATE TABLE \"FEATURE_1_164\" AS\n",
       "SELECT AVG( COALESCE( f_1_1_18.\"feature_1_1_18\", 0.0 ) ) AS \"feature_1_164\",\n",
       "       t1.rowid AS rownum\n",
       "FROM \"ACTORS__STAGING_TABLE_1\" t1\n",
       "INNER JOIN \"ROLES__STAGING_TABLE_3\" t2\n",
       "ON t1.\"id\" = t2.\"actor_id\"\n",
       "LEFT JOIN \"FEATURE_1_1_18\" f_1_1_18\n",
       "ON t2.rowid = f_1_1_18.rownum\n",
       "GROUP BY t1.rowid;\n",
       "```"
      ],
      "text/plain": [
       "'DROP TABLE IF EXISTS \"FEATURE_1_164\";\\n\\nCREATE TABLE \"FEATURE_1_164\" AS\\nSELECT AVG( COALESCE( f_1_1_18.\"feature_1_1_18\", 0.0 ) ) AS \"feature_1_164\",\\n       t1.rowid AS rownum\\nFROM \"ACTORS__STAGING_TABLE_1\" t1\\nINNER JOIN \"ROLES__STAGING_TABLE_3\" t2\\nON t1.\"id\" = t2.\"actor_id\"\\nLEFT JOIN \"FEATURE_1_1_18\" f_1_1_18\\nON t2.rowid = f_1_1_18.rownum\\nGROUP BY t1.rowid;'"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pipe.features.to_sql()[pipe.features.sort(by=\"importances\")[0].name]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### 2.6 Productionization\n",
    "\n",
    "It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(\"imdb_spark\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUN_SPARK:\n",
    "    spark = SparkSession.builder.appName(\n",
    "        \"online_retail\"\n",
    "    ).config(\n",
    "        \"spark.driver.maxResultSize\",\"10g\"\n",
    "    ).config(\n",
    "        \"spark.driver.memory\", \"10g\"\n",
    "    ).config(\n",
    "        \"spark.executor.memory\", \"20g\"\n",
    "    ).config(\n",
    "        \"spark.sql.execution.arrow.pyspark.enabled\", \"true\"\n",
    "    ).config(\n",
    "        \"spark.sql.session.timeZone\", \"UTC\"\n",
    "    ).enableHiveSupport().getOrCreate()\n",
    "\n",
    "    spark.sparkContext.setLogLevel(\"ERROR\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUN_SPARK:\n",
    "    population_spark = container.train.population.to_pyspark(spark, name=\"actors\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUN_SPARK:\n",
    "    movies_genres_spark = container.movies_genres.to_pyspark(spark, name=\"movies_genres\")\n",
    "    roles_spark = container.roles.to_pyspark(spark, name=\"roles\")\n",
    "    movies_spark = container.movies.to_pyspark(spark, name=\"movies\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUN_SPARK:\n",
    "    getml.spark.execute(spark, \"imdb_spark\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "if RUN_SPARK:\n",
    "    spark.sql(\"SELECT * FROM `FEATURES` LIMIT 20\").toPandas()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Conclusion\n",
    "\n",
    "In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our  approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "getml.engine.shutdown()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "Motl, Jan, and Oliver Schulte. \"The CTU prague relational learning repository.\" arXiv preprint arXiv:1511.03086 (2015).\n",
    "    \n",
    "Neville, Jennifer, and David Jensen. \"Relational dependency networks.\" Journal of Machine Learning Research 8.Mar (2007): 653-692.\n",
    "    \n",
    "Neville, Jennifer, and David Jensen. \"Collective classification with relational dependency networks.\" Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.\n",
    "    \n",
    "Neville, Jennifer, et al. \"Learning relational probability trees.\" Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.\n",
    "    \n",
    "Perovšek, Matic, et al. \"Wordification: Propositionalization by unfolding relational data into bags of words.\" Expert Systems with Applications 42.17-18 (2015): 6442-6456."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "encoding": "# -*- coding: utf-8 -*-",
   "notebook_metadata_filter": "-all"
  },
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}