{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# IMDb - Predicting actors' gender using getML\n", "\n", "In this tutorial, we demonstrate how getML can be applied to text fields. In relational databases, text fields are less structured and less standardized than categorical data, making it more difficult to extract useful information from them. Therefore, they are ignored in most data science projects on relational data. However, when using a relational learning tool such as getML, we can easily generate simple features from text fields and leverage the information contained therein.\n", "\n", "The point of this exercise is not to compete with modern deep-learning-based NLP approaches. The point is to develop an approach by which we can leverage fields in relational databases that would otherwise be ignored.\n", "\n", "As an example data set, we use the Internet Movie Database, which has been used by previous studies in the relational learning literature. This allows us to benchmark our approach to state-of-the-art algorithms in the relational learning literature. We demonstrate that getML outperforms these state-of-the-art algorithms.\n", "\n", "Summary:\n", "\n", "- Prediction type: __Classification model__\n", "- Domain: __Entertainment__\n", "- Prediction target: __The gender of an actor__ \n", "- Population size: __817718__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Background\n", "\n", "The data set contains about 800,000 actors. The goal is to predict the gender of said actors based on other information we have about them, such as the movies they have participated in and the roles they have played in these movies.\n", "\n", "It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/IMDb) (Motl and Schulte, 2015) (Now residing at [relational-data.org](https://relational-data.org/dataset/IMDb).)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get started with the analysis and set up your session:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "getML engine is already running.\n", "\n", "Connected to project 'imdb'\n" ] } ], "source": [ "import copy\n", "import os\n", "os.environ[\"PYARROW_IGNORE_TIMEZONE\"] = \"1\"\n", "from pathlib import Path\n", "\n", "from urllib import request\n", "\n", "import numpy as np\n", "import pandas as pd\n", "from IPython.display import Image\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "%matplotlib inline \n", "\n", "import getml\n", "from pyspark.sql import SparkSession\n", "\n", "getml.engine.launch(home_directory=Path.home(), allow_remote_ips=True, token='token')\n", "getml.engine.set_project('imdb')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "In the following, we set some flags that affect execution of the notebook:\n", "- We don't let the algorithms utilize the information on actors' first names (see [below](#first-names) for an explanation)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "USE_FIRST_NAMES = False\n", "RUN_SPARK = False" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Loading data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1 Download from source\n", "\n", "We begin by downloading the data from the source file:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Connection(dbname='imdb_ijs',\n", " dialect='mysql',\n", " host='db.relational-data.org',\n", " port=3306)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conn = getml.database.connect_mysql(\n", " host=\"db.relational-data.org\",\n", " dbname=\"imdb_ijs\",\n", " port=3306,\n", " user=\"guest\",\n", " password=\"relational\"\n", ")\n", "\n", "conn" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def load_if_needed(name):\n", " \"\"\"\n", " Loads the data from the relational learning\n", " repository, if the data frame has not already\n", " been loaded.\n", " \"\"\"\n", " if getml.data.exists(name):\n", " return getml.data.load_data_frame(name)\n", " data_frame = getml.data.DataFrame.from_db(\n", " name=name,\n", " table_name=name,\n", " conn=conn\n", " )\n", " data_frame.save()\n", " return data_frame" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "actors = load_if_needed(\"actors\")\n", "roles = load_if_needed(\"roles\")\n", "movies = load_if_needed(\"movies\")\n", "movies_genres = load_if_needed(\"movies_genres\")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name idfirst_name last_name gender
roleunused_floatunused_stringunused_string unused_string
0\n", " 2 \n", " Michael'babeepower' VieraM
1\n", " 3 \n", " Eloy'Chincheta'M
2\n", " 4 \n", " Dieguito'El Cigala'M
3\n", " 5 \n", " Antonio'El de Chipiona'M
4\n", " 6 \n", " José'El Francés'M
\n", " ... \n", " .........
817713\n", " 845461 \n", " HerdísÞorvaldsdóttirF
817714\n", " 845462 \n", " Katla MargrétÞorvaldsdóttirF
817715\n", " 845463 \n", " Lilja NóttÞórarinsdóttirF
817716\n", " 845464 \n", " HólmfríðurÞórhallsdóttirF
817717\n", " 845465 \n", " TheódóraÞórðardóttirF
\n", "\n", "

\n", " 817718 rows x 4 columns
\n", " memory usage: 40.22 MB
\n", " name: actors
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name id first_name last_name gender \n", " role unused_float unused_string unused_string unused_string\n", " 0 2 Michael 'babeepower' Viera M \n", " 1 3 Eloy 'Chincheta' M \n", " 2 4 Dieguito 'El Cigala' M \n", " 3 5 Antonio 'El de Chipiona' M \n", " 4 6 José 'El Francés' M \n", " ... ... ... ... \n", "817713 845461 Herdís Þorvaldsdóttir F \n", "817714 845462 Katla Margrét Þorvaldsdóttir F \n", "817715 845463 Lilja Nótt Þórarinsdóttir F \n", "817716 845464 Hólmfríður Þórhallsdóttir F \n", "817717 845465 Theódóra Þórðardóttir F \n", "\n", "\n", "817718 rows x 4 columns\n", "memory usage: 40.22 MB\n", "type: getml.DataFrame" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actors" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name actor_id movie_idrole
roleunused_floatunused_floatunused_string
0\n", " 2 \n", " \n", " 280088 \n", " Stevie
1\n", " 2 \n", " \n", " 396232 \n", " Various/lyricist
2\n", " 3 \n", " \n", " 376687 \n", " Gitano 1
3\n", " 4 \n", " \n", " 336265 \n", " El Cigala
4\n", " 5 \n", " \n", " 135644 \n", " Himself
\n", " ... \n", " \n", " ... \n", " ...
3431961\n", " 845461 \n", " \n", " 137097 \n", " Kata
3431962\n", " 845462 \n", " \n", " 208838 \n", " Magga
3431963\n", " 845463 \n", " \n", " 870 \n", " Gunna
3431964\n", " 845464 \n", " \n", " 378123 \n", " Gudrun
3431965\n", " 845465 \n", " \n", " 378123 \n", " NULL
\n", "\n", "

\n", " 3431966 rows x 3 columns
\n", " memory usage: 115.41 MB
\n", " name: roles
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name actor_id movie_id role \n", " role unused_float unused_float unused_string \n", " 0 2 280088 Stevie \n", " 1 2 396232 Various/lyricist\n", " 2 3 376687 Gitano 1 \n", " 3 4 336265 El Cigala \n", " 4 5 135644 Himself \n", " ... ... ... \n", "3431961 845461 137097 Kata \n", "3431962 845462 208838 Magga \n", "3431963 845463 870 Gunna \n", "3431964 845464 378123 Gudrun \n", "3431965 845465 378123 NULL \n", "\n", "\n", "3431966 rows x 3 columns\n", "memory usage: 115.41 MB\n", "type: getml.DataFrame" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roles" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name id year rankname
roleunused_floatunused_floatunused_floatunused_string
0\n", " 0 \n", " \n", " 2002 \n", " \n", " nan \n", " #28
1\n", " 1 \n", " \n", " 2000 \n", " \n", " nan \n", " #7 Train: An Immigrant Journey, ...
2\n", " 2 \n", " \n", " 1971 \n", " \n", " 6.4\n", " $
3\n", " 3 \n", " \n", " 1913 \n", " \n", " nan \n", " $1,000 Reward
4\n", " 4 \n", " \n", " 1915 \n", " \n", " nan \n", " $1,000 Reward
\n", " ... \n", " \n", " ... \n", " \n", " ... \n", " ...
388264\n", " 412316 \n", " \n", " 1991 \n", " \n", " nan \n", " "zem blch krlu"
388265\n", " 412317 \n", " \n", " 1995 \n", " \n", " nan \n", " "rgammk"
388266\n", " 412318 \n", " \n", " 2002 \n", " \n", " nan \n", " "zgnm Leyla"
388267\n", " 412319 \n", " \n", " 1983 \n", " \n", " nan \n", " " Istanbul"
388268\n", " 412320 \n", " \n", " 1958 \n", " \n", " nan \n", " "sterreich"
\n", "\n", "

\n", " 388269 rows x 4 columns
\n", " memory usage: 19.92 MB
\n", " name: movies
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name id year rank name \n", " role unused_float unused_float unused_float unused_string \n", " 0 0 2002 nan #28 \n", " 1 1 2000 nan #7 Train: An Immigrant Journey, ...\n", " 2 2 1971 6.4 $ \n", " 3 3 1913 nan $1,000 Reward \n", " 4 4 1915 nan $1,000 Reward \n", " ... ... ... ... \n", "388264 412316 1991 nan \"zem blch krlu\" \n", "388265 412317 1995 nan \"rgammk\" \n", "388266 412318 2002 nan \"zgnm Leyla\" \n", "388267 412319 1983 nan \" Istanbul\" \n", "388268 412320 1958 nan \"sterreich\" \n", "\n", "\n", "388269 rows x 4 columns\n", "memory usage: 19.92 MB\n", "type: getml.DataFrame" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name movie_idgenre
roleunused_floatunused_string
0\n", " 1 \n", " Documentary
1\n", " 1 \n", " Short
2\n", " 2 \n", " Comedy
3\n", " 2 \n", " Crime
4\n", " 5 \n", " Western
\n", " ... \n", " ...
395114\n", " 378612 \n", " Adventure
395115\n", " 378612 \n", " Drama
395116\n", " 378613 \n", " Comedy
395117\n", " 378613 \n", " Drama
395118\n", " 378614 \n", " Comedy
\n", "\n", "

\n", " 395119 rows x 2 columns
\n", " memory usage: 9.24 MB
\n", " name: movies_genres
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name movie_id genre \n", " role unused_float unused_string\n", " 0 1 Documentary \n", " 1 1 Short \n", " 2 2 Comedy \n", " 3 2 Crime \n", " 4 5 Western \n", " ... ... \n", "395114 378612 Adventure \n", "395115 378612 Drama \n", "395116 378613 Comedy \n", "395117 378613 Drama \n", "395118 378614 Comedy \n", "\n", "\n", "395119 rows x 2 columns\n", "memory usage: 9.24 MB\n", "type: getml.DataFrame" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_genres" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2 Prepare data for getML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "getML requires that we define *roles* for each of the columns." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "actors[\"target\"] = (actors.gender == 'F')\n", "actors.set_role(\"id\", getml.data.roles.join_key)\n", "actors.set_role(\"target\", getml.data.roles.target)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The benchmark studies do not state clearly, whether it is fair game to use the first names of the actors. Using the first names, we can easily increase the predictive accuracy to above 90%. However, when doing so the problem basically becomes a first name identification problem rather than a relational learning problem. This would undermine the point of this notebook: Showcase relational learning. Therefore, our assumption is that using the first names is not allowed. Feel free to set this flag [above](#flags) to see how well getML incoporates such starightforward information into its feature logic." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name idtargetfirst_name last_name gender
rolejoin_keytargetunused_stringunused_string unused_string
02\n", " 0 \n", " Michael'babeepower' VieraM
13\n", " 0 \n", " Eloy'Chincheta'M
24\n", " 0 \n", " Dieguito'El Cigala'M
35\n", " 0 \n", " Antonio'El de Chipiona'M
46\n", " 0 \n", " José'El Francés'M
...\n", " ... \n", " .........
817713845461\n", " 1 \n", " HerdísÞorvaldsdóttirF
817714845462\n", " 1 \n", " Katla MargrétÞorvaldsdóttirF
817715845463\n", " 1 \n", " Lilja NóttÞórarinsdóttirF
817716845464\n", " 1 \n", " HólmfríðurÞórhallsdóttirF
817717845465\n", " 1 \n", " TheódóraÞórðardóttirF
\n", "\n", "

\n", " 817718 rows x 5 columns
\n", " memory usage: 43.49 MB
\n", " name: actors
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name id target first_name last_name gender \n", " role join_key target unused_string unused_string unused_string\n", " 0 2 0 Michael 'babeepower' Viera M \n", " 1 3 0 Eloy 'Chincheta' M \n", " 2 4 0 Dieguito 'El Cigala' M \n", " 3 5 0 Antonio 'El de Chipiona' M \n", " 4 6 0 José 'El Francés' M \n", " ... ... ... ... ... \n", "817713 845461 1 Herdís Þorvaldsdóttir F \n", "817714 845462 1 Katla Margrét Þorvaldsdóttir F \n", "817715 845463 1 Lilja Nótt Þórarinsdóttir F \n", "817716 845464 1 Hólmfríður Þórhallsdóttir F \n", "817717 845465 1 Theódóra Þórðardóttir F \n", "\n", "\n", "817718 rows x 5 columns\n", "memory usage: 43.49 MB\n", "type: getml.DataFrame" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "if USE_FIRST_NAMES:\n", " actors.set_role(\"first_name\", getml.data.roles.text)\n", "actors" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
nameactor_idmovie_idrole
rolejoin_keyjoin_keytext
02280088Stevie
12396232Various/lyricist
23376687Gitano 1
34336265El Cigala
45135644Himself
.........
3431961845461137097Kata
3431962845462208838Magga
3431963845463870Gunna
3431964845464378123Gudrun
3431965845465378123NULL
\n", "\n", "

\n", " 3431966 rows x 3 columns
\n", " memory usage: 87.96 MB
\n", " name: roles
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name actor_id movie_id role \n", " role join_key join_key text \n", " 0 2 280088 Stevie \n", " 1 2 396232 Various/lyricist\n", " 2 3 376687 Gitano 1 \n", " 3 4 336265 El Cigala \n", " 4 5 135644 Himself \n", " ... ... ... \n", "3431961 845461 137097 Kata \n", "3431962 845462 208838 Magga \n", "3431963 845463 870 Gunna \n", "3431964 845464 378123 Gudrun \n", "3431965 845465 378123 NULL \n", "\n", "\n", "3431966 rows x 3 columns\n", "memory usage: 87.96 MB\n", "type: getml.DataFrame" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "roles.set_role([\"actor_id\", \"movie_id\"], getml.data.roles.join_key)\n", "roles.set_role(\"role\", getml.data.roles.text)\n", "roles" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name id year rankname
rolejoin_keynumericalnumericalunused_string
00\n", " 2002 \n", " \n", " nan \n", " #28
11\n", " 2000 \n", " \n", " nan \n", " #7 Train: An Immigrant Journey, ...
22\n", " 1971 \n", " \n", " 6.4\n", " $
33\n", " 1913 \n", " \n", " nan \n", " $1,000 Reward
44\n", " 1915 \n", " \n", " nan \n", " $1,000 Reward
...\n", " ... \n", " \n", " ... \n", " ...
388264412316\n", " 1991 \n", " \n", " nan \n", " "zem blch krlu"
388265412317\n", " 1995 \n", " \n", " nan \n", " "rgammk"
388266412318\n", " 2002 \n", " \n", " nan \n", " "zgnm Leyla"
388267412319\n", " 1983 \n", " \n", " nan \n", " " Istanbul"
388268412320\n", " 1958 \n", " \n", " nan \n", " "sterreich"
\n", "\n", "

\n", " 388269 rows x 4 columns
\n", " memory usage: 18.37 MB
\n", " name: movies
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name id year rank name \n", " role join_key numerical numerical unused_string \n", " 0 0 2002 nan #28 \n", " 1 1 2000 nan #7 Train: An Immigrant Journey, ...\n", " 2 2 1971 6.4 $ \n", " 3 3 1913 nan $1,000 Reward \n", " 4 4 1915 nan $1,000 Reward \n", " ... ... ... ... \n", "388264 412316 1991 nan \"zem blch krlu\" \n", "388265 412317 1995 nan \"rgammk\" \n", "388266 412318 2002 nan \"zgnm Leyla\" \n", "388267 412319 1983 nan \" Istanbul\" \n", "388268 412320 1958 nan \"sterreich\" \n", "\n", "\n", "388269 rows x 4 columns\n", "memory usage: 18.37 MB\n", "type: getml.DataFrame" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies.set_role(\"id\", getml.data.roles.join_key)\n", "movies.set_role([\"year\", \"rank\"], getml.data.roles.numerical)\n", "movies" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namemovie_idgenre
rolejoin_keycategorical
01Documentary
11Short
22Comedy
32Crime
45Western
......
395114378612Adventure
395115378612Drama
395116378613Comedy
395117378613Drama
395118378614Comedy
\n", "\n", "

\n", " 395119 rows x 2 columns
\n", " memory usage: 3.16 MB
\n", " name: movies_genres
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name movie_id genre \n", " role join_key categorical\n", " 0 1 Documentary\n", " 1 1 Short \n", " 2 2 Comedy \n", " 3 2 Crime \n", " 4 5 Western \n", " ... ... \n", "395114 378612 Adventure \n", "395115 378612 Drama \n", "395116 378613 Comedy \n", "395117 378613 Drama \n", "395118 378614 Comedy \n", "\n", "\n", "395119 rows x 2 columns\n", "memory usage: 3.16 MB\n", "type: getml.DataFrame" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "movies_genres.set_role(\"movie_id\", getml.data.roles.join_key)\n", "movies_genres.set_role(\"genre\", getml.data.roles.categorical)\n", "movies_genres" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to separate our data set into a training, testing and validation set:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0train
1validation
2train
3validation
4validation
...
\n", "\n", "

\n", " infinite number of rows
\n", " \n", " type: StringColumnView
\n", " \n", "

\n" ], "text/plain": [ " \n", " 0 train \n", " 1 validation\n", " 2 train \n", " 3 validation\n", " 4 validation\n", " ... \n", "\n", "\n", "infinite number of rows\n", "type: StringColumnView" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split = getml.data.split.random(train=0.7, validation=0.15, test=0.15)\n", "split" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
population
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subset name rowstype
0testactors122794View
1trainactors571807View
2validationactors123117View
\n", "
\n", "
\n", "
peripheral
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name rowstype
0roles3431966DataFrame
1movies388269DataFrame
2movies_genres395119DataFrame
\n", "
\n", "
" ], "text/plain": [ "population\n", " subset name rows type\n", "0 test actors 122794 View\n", "1 train actors 571807 View\n", "2 validation actors 123117 View\n", "\n", "peripheral\n", " name rows type \n", "0 roles 3431966 DataFrame\n", "1 movies 388269 DataFrame\n", "2 movies_genres 395119 DataFrame" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "container = getml.data.Container(population=actors, split=split)\n", "\n", "container.add(\n", " roles=roles,\n", " movies=movies,\n", " movies_genres=movies_genres,\n", ")\n", "\n", "container" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Predictive modelling\n", "\n", "We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1 Define relational model\n", "\n", "To get started with relational learning, we need to specify the data model." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
diagram
\n", "
movies_genresmoviesrolesactorsmovie_id = idid = movie_idRelationship: many-to-oneactor_id = id
\n", "
\n", "\n", "
\n", "
staging
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data frames staging table
0actorsACTORS__STAGING_TABLE_1
1movies_genresMOVIES_GENRES__STAGING_TABLE_2
2roles, moviesROLES__STAGING_TABLE_3
\n", "
\n", " " ], "text/plain": [ "actors:\n", " columns:\n", "\n", "\n", " joins:\n", " - right: 'roles'\n", " on: (actors.id, roles.actor_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", "\n", "roles:\n", " columns:\n", " - actor_id: join_key\n", " - movie_id: join_key\n", " - role: text\n", "\n", " joins:\n", " - right: 'movies'\n", " on: (roles.movie_id, movies.id)\n", " relationship: 'many-to-one'\n", " lagged_targets: False\n", "\n", "movies:\n", " columns:\n", " - id: join_key\n", " - year: numerical\n", " - rank: numerical\n", " - name: unused_string\n", "\n", " joins:\n", " - right: 'movies_genres'\n", " on: (movies.id, movies_genres.movie_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", "\n", "movies_genres:\n", " columns:\n", " - genre: categorical\n", " - movie_id: join_key" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = getml.data.DataModel(\"actors\")\n", "\n", "dm.add(getml.data.to_placeholder(\n", " roles=roles,\n", " movies=movies,\n", " movies_genres=movies_genres,\n", "))\n", "\n", "dm.population.join(\n", " dm.roles,\n", " on=(\"id\", \"actor_id\"),\n", ")\n", "\n", "dm.roles.join(\n", " dm.movies,\n", " on=(\"movie_id\", \"id\"),\n", " relationship=getml.data.relationship.many_to_one,\n", ")\n", "\n", "dm.movies.join(\n", " dm.movies_genres,\n", " on=(\"id\", \"movie_id\"),\n", ")\n", "\n", "dm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2 getML pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "__Set-up the feature learner & predictor__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can either use the relboost default parameters or some more fine-tuned parameters. Fine-tuning these parameters in this way can increase our predictive accuracy to 85%, but the training time increases to over 4 hours. We therefore assume that we want to use the default parameters." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "text_field_splitter = getml.preprocessors.TextFieldSplitter()\n", "\n", "mapping = getml.preprocessors.Mapping()\n", "\n", "fast_prop = getml.feature_learning.FastProp(\n", " loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n", ")\n", "\n", "feature_selector = getml.predictors.XGBoostClassifier()\n", "\n", "predictor = getml.predictors.XGBoostClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Build the pipeline__" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "pipe = getml.pipeline.Pipeline(\n", " tags=['fast_prop'],\n", " data_model=dm,\n", " preprocessors=[text_field_splitter, mapping],\n", " feature_learners=[fast_prop],\n", " feature_selectors=[feature_selector],\n", " predictors=[predictor],\n", " share_selected_features=0.1,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3 Model training" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking data model...\n", "Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n", "Preprocessing... 100% |██████████| [elapsed: 00:22, remaining: 00:00] \n", "Checking... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n", "\n", "The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typelabel message
0INFOFOREIGN KEYS NOT FOUNDWhen joining ROLES__STAGING_TABLE_3 and MOVIES_GENRES__STAGING_TABLE_2 over 'id' and 'movie_id', there are no corresponding entries for 26.899421% of entries in 'id' in 'ROLES__STAGING_TABLE_3'. You might want to double-check your join keys.
" ], "text/plain": [ " type label message \n", "0 INFO FOREIGN KEYS NOT FOUND When joining ROLES__STAGING_TABL..." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.check(container.train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Checking data model...\n", "Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n", "Preprocessing... 100% |██████████| [elapsed: 00:07, remaining: 00:00] \n", "\n", "The pipeline check generated 1 issues labeled INFO and 0 issues labeled WARNING.\n", "To see the issues in full, run .check() on the pipeline.\n", "\n", "Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n", "Preprocessing... 100% |██████████| [elapsed: 00:06, remaining: 00:00] \n", "Indexing text fields... 100% |██████████| [elapsed: 00:05, remaining: 00:00] \n", "FastProp: Trying 226 features... 100% |██████████| [elapsed: 00:20, remaining: 00:00] \n", "FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:03, remaining: 00:00] \n", "FastProp: Building features... 100% |██████████| [elapsed: 00:20, remaining: 00:00] \n", "XGBoost: Training as feature selector... 100% |██████████| [elapsed: 04:60, remaining: 00:00] \n", "XGBoost: Training as predictor... 100% |██████████| [elapsed: 00:43, remaining: 00:00] \n", "\n", "Trained pipeline.\n", "Time taken: 0h:6m:42.850666\n", "\n" ] }, { "data": { "text/html": [ "
Pipeline(data_model='actors',\n",
       "         feature_learners=['FastProp'],\n",
       "         feature_selectors=['XGBoostClassifier'],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['movies', 'movies_genres', 'roles'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=['TextFieldSplitter', 'Mapping'],\n",
       "         share_selected_features=0.1,\n",
       "         tags=['fast_prop', 'container-kiwgkg'])
" ], "text/plain": [ "Pipeline(data_model='actors',\n", " feature_learners=['FastProp'],\n", " feature_selectors=['XGBoostClassifier'],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['movies', 'movies_genres', 'roles'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=['TextFieldSplitter', 'Mapping'],\n", " share_selected_features=0.1,\n", " tags=['fast_prop', 'container-kiwgkg'])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.fit(container.train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4 Model evaluation" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Staging... 100% |██████████| [elapsed: 00:01, remaining: 00:00] \n", "Preprocessing... 100% |██████████| [elapsed: 00:07, remaining: 00:00] \n", "FastProp: Building subfeatures... 100% |██████████| [elapsed: 00:02, remaining: 00:00] \n", "FastProp: Building features... 113% |███████████| [elapsed: 00:00, remaining: 00:00] \n", "\n" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtargetaccuracy auccross entropy
02024-02-21 15:07:05traintarget0.84170.91390.3213
12024-02-21 15:07:19testtarget0.8420.91390.3225
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", "0 2024-02-21 15:07:05 train target 0.8417 0.9139 0.3213\n", "1 2024-02-21 15:07:19 test target 0.842 0.9139 0.3225" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(container.test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5 Features\n", "\n", "The most important feature looks as follows:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "DROP TABLE IF EXISTS \"FEATURE_1_114\";\n", "\n", "CREATE TABLE \"FEATURE_1_114\" AS\n", "SELECT AVG( COALESCE( f_1_1_13.\"feature_1_1_13\", 0.0 ) ) AS \"feature_1_114\",\n", " t1.rowid AS rownum\n", "FROM \"ACTORS__STAGING_TABLE_1\" t1\n", "INNER JOIN \"ROLES__STAGING_TABLE_3\" t2\n", "ON t1.\"id\" = t2.\"actor_id\"\n", "LEFT JOIN \"FEATURE_1_1_13\" f_1_1_13\n", "ON t2.rowid = f_1_1_13.rownum\n", "GROUP BY t1.rowid;\n", "```" ], "text/plain": [ "'DROP TABLE IF EXISTS \"FEATURE_1_114\";\\n\\nCREATE TABLE \"FEATURE_1_114\" AS\\nSELECT AVG( COALESCE( f_1_1_13.\"feature_1_1_13\", 0.0 ) ) AS \"feature_1_114\",\\n t1.rowid AS rownum\\nFROM \"ACTORS__STAGING_TABLE_1\" t1\\nINNER JOIN \"ROLES__STAGING_TABLE_3\" t2\\nON t1.\"id\" = t2.\"actor_id\"\\nLEFT JOIN \"FEATURE_1_1_13\" f_1_1_13\\nON t2.rowid = f_1_1_13.rownum\\nGROUP BY t1.rowid;'" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.features.to_sql()[pipe.features.sort(by=\"importances\")[0].name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " ### 2.6 Productionization\n", "\n", "It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Here, we will demonstrate how the pipeline can be transpiled to Spark SQL and then executed on a Spark cluster." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "pipe.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(\"imdb_spark\")" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " spark = SparkSession.builder.appName(\n", " \"online_retail\"\n", " ).config(\n", " \"spark.driver.maxResultSize\",\"10g\"\n", " ).config(\n", " \"spark.driver.memory\", \"10g\"\n", " ).config(\n", " \"spark.executor.memory\", \"20g\"\n", " ).config(\n", " \"spark.sql.execution.arrow.pyspark.enabled\", \"true\"\n", " ).config(\n", " \"spark.sql.session.timeZone\", \"UTC\"\n", " ).enableHiveSupport().getOrCreate()\n", "\n", " spark.sparkContext.setLogLevel(\"ERROR\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " population_spark = container.train.population.to_pyspark(spark, name=\"actors\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " movies_genres_spark = container.movies_genres.to_pyspark(spark, name=\"movies_genres\")\n", " roles_spark = container.roles.to_pyspark(spark, name=\"roles\")\n", " movies_spark = container.movies.to_pyspark(spark, name=\"movies\")" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " getml.spark.execute(spark, \"imdb_spark\")" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [], "source": [ "if RUN_SPARK:\n", " spark.sql(\"SELECT * FROM `FEATURES` LIMIT 20\").toPandas()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Conclusion\n", "\n", "In this notebook we have demonstrated how getML can be applied to text fields. We have demonstrated the our approach outperforms state-of-the-art relational learning algorithms on the IMDb dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "Motl, Jan, and Oliver Schulte. \"The CTU prague relational learning repository.\" arXiv preprint arXiv:1511.03086 (2015).\n", " \n", "Neville, Jennifer, and David Jensen. \"Relational dependency networks.\" Journal of Machine Learning Research 8.Mar (2007): 653-692.\n", " \n", "Neville, Jennifer, and David Jensen. \"Collective classification with relational dependency networks.\" Workshop on Multi-Relational Data Mining (MRDM-2003). 2003.\n", " \n", "Neville, Jennifer, et al. \"Learning relational probability trees.\" Proceedings of the Ninth ACM SIGKDD international conference on Knowledge discovery and data mining. 2003.\n", " \n", "Perovšek, Matic, et al. \"Wordification: Propositionalization by unfolding relational data into bags of words.\" Expert Systems with Applications 42.17-18 (2015): 6442-6456." ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "encoding": "# -*- coding: utf-8 -*-", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.18" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 4 }