{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CORA - Categorizing academic publications\n", "\n", "In this notebook, we compare getML against extant approaches in the relational learning literature on the CORA data set, which is often used for benchmarking. We demonstrate that getML outperforms the state of the art in the relational learning literature on this data set. Beyond the benchmarking aspects, this notebooks showcases getML's excellent capabilities in dealing with categorical data.\n", "\n", "Summary:\n", "\n", "- Prediction type: __Classification model__\n", "- Domain: __Academia__\n", "- Prediction target: __The category of a paper__ \n", "- Population size: __2708__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Background\n", "\n", "CORA is a well-known benchmarking dataset in the academic literature on relational learning. The dataset contains 2708 scientific publications on machine learning. The papers are divided into 7 categories. The challenge is to predict the category of a paper based on the papers it cites, the papers it is cited by and keywords contained in the paper.\n", "\n", "It has been downloaded from the [CTU Prague relational learning repository](https://relational.fit.cvut.cz/dataset/CORA) (Motl and Schulte, 2015)(Now residing at [relational-data.org](https://relational-data.org/dataset/CORA).)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's get started with the analysis and set up your session:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%pip install -q \"getml==1.5.0\" \"matplotlib==3.9.2\" \"ipywidgets==8.1.5\"" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "getML API version: 1.5.0\n", "\n" ] } ], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "import getml\n", "\n", "%matplotlib inline \n", "\n", "print(f\"getML API version: {getml.__version__}\\n\")" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Launching ./getML --allow-push-notifications=true --allow-remote-ips=true --home-directory=/home/user --in-memory=true --install=false --launch-browser=true --log=false --token=token in /home/user/.getML/getml-1.5.0-x64-linux...\n", "Launched the getML Engine. The log output will be stored in /home/user/.getML/logs/20240912134108.log.\n", "\u001b[2K Loading pipelines... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
Connected to project 'cora'.\n",
       "
\n" ], "text/plain": [ "Connected to project \u001b[32m'cora'\u001b[0m.\n" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "getml.engine.launch(allow_remote_ips=True, token='token')\n", "getml.engine.set_project('cora')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1. Loading data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.1 Download from source\n", "\n", "We begin by downloading the data from the source file:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Connection(dbname='CORA', dialect='mysql', host='relational.fel.cvut.cz', port=3306)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "conn = getml.database.connect_mysql(\n", " host=\"relational.fel.cvut.cz\",\n", " dbname=\"CORA\",\n", " port=3306,\n", " user=\"guest\",\n", " password=\"ctu-relational\"\n", ")\n", "\n", "conn" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def load_if_needed(name):\n", " \"\"\"\n", " Loads the data from the relational learning\n", " repository, if the data frame has not already\n", " been loaded.\n", " \"\"\"\n", " if not getml.data.exists(name):\n", " data_frame = getml.data.DataFrame.from_db(\n", " name=name,\n", " table_name=name,\n", " conn=conn\n", " )\n", " data_frame.save()\n", " else:\n", " data_frame = getml.data.load_data_frame(name)\n", " return data_frame" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "paper = load_if_needed(\"paper\")\n", "cites = load_if_needed(\"cites\")\n", "content = load_if_needed(\"content\")" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name paper_idclass_label
roleunused_floatunused_string
0\n", " 35 \n", " Genetic_Algorithms
1\n", " 40 \n", " Genetic_Algorithms
2\n", " 114 \n", " Reinforcement_Learning
3\n", " 117 \n", " Reinforcement_Learning
4\n", " 128 \n", " Reinforcement_Learning
\n", " ... \n", " ...
2703\n", " 1154500 \n", " Case_Based
2704\n", " 1154520 \n", " Neural_Networks
2705\n", " 1154524 \n", " Rule_Learning
2706\n", " 1154525 \n", " Rule_Learning
2707\n", " 1155073 \n", " Rule_Learning
\n", "\n", "

\n", " 2708 rows x 2 columns
\n", " memory usage: 0.09 MB
\n", " name: paper
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ "name paper_id class_label \n", "role unused_float unused_string \n", " 0 35 Genetic_Algorithms \n", " 1 40 Genetic_Algorithms \n", " 2 114 Reinforcement_Learning\n", " 3 117 Reinforcement_Learning\n", " 4 128 Reinforcement_Learning\n", " ... ... \n", "2703 1154500 Case_Based \n", "2704 1154520 Neural_Networks \n", "2705 1154524 Rule_Learning \n", "2706 1154525 Rule_Learning \n", "2707 1155073 Rule_Learning \n", "\n", "\n", "2708 rows x 2 columns\n", "memory usage: 0.09 MB\n", "type: getml.DataFrame" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paper" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecited_paper_idciting_paper_id
role unused_float unused_float
0\n", " 35 \n", " \n", " 887 \n", "
1\n", " 35 \n", " \n", " 1033 \n", "
2\n", " 35 \n", " \n", " 1688 \n", "
3\n", " 35 \n", " \n", " 1956 \n", "
4\n", " 35 \n", " \n", " 8865 \n", "
\n", " ... \n", " \n", " ... \n", "
5424\n", " 853116 \n", " \n", " 19621 \n", "
5425\n", " 853116 \n", " \n", " 853155 \n", "
5426\n", " 853118 \n", " \n", " 1140289 \n", "
5427\n", " 853155 \n", " \n", " 853118 \n", "
5428\n", " 954315 \n", " \n", " 1155073 \n", "
\n", "\n", "

\n", " 5429 rows x 2 columns
\n", " memory usage: 0.09 MB
\n", " name: cites
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ "name cited_paper_id citing_paper_id\n", "role unused_float unused_float\n", " 0 35 887\n", " 1 35 1033\n", " 2 35 1688\n", " 3 35 1956\n", " 4 35 8865\n", " ... ...\n", "5424 853116 19621\n", "5425 853116 853155\n", "5426 853118 1140289\n", "5427 853155 853118\n", "5428 954315 1155073\n", "\n", "\n", "5429 rows x 2 columns\n", "memory usage: 0.09 MB\n", "type: getml.DataFrame" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cites" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name paper_idword_cited_id
roleunused_floatunused_string
0\n", " 35 \n", " word100
1\n", " 35 \n", " word1152
2\n", " 35 \n", " word1175
3\n", " 35 \n", " word1228
4\n", " 35 \n", " word1248
\n", " ... \n", " ...
49211\n", " 1155073 \n", " word75
49212\n", " 1155073 \n", " word759
49213\n", " 1155073 \n", " word789
49214\n", " 1155073 \n", " word815
49215\n", " 1155073 \n", " word979
\n", "\n", "

\n", " 49216 rows x 2 columns
\n", " memory usage: 1.20 MB
\n", " name: content
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name paper_id word_cited_id\n", " role unused_float unused_string\n", " 0 35 word100 \n", " 1 35 word1152 \n", " 2 35 word1175 \n", " 3 35 word1228 \n", " 4 35 word1248 \n", " ... ... \n", "49211 1155073 word75 \n", "49212 1155073 word759 \n", "49213 1155073 word789 \n", "49214 1155073 word815 \n", "49215 1155073 word979 \n", "\n", "\n", "49216 rows x 2 columns\n", "memory usage: 1.20 MB\n", "type: getml.DataFrame" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 1.2 Prepare data for getML" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "getML requires that we define *roles* for each of the columns." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepaper_idclass_label
rolejoin_keycategorical
035Genetic_Algorithms
140Genetic_Algorithms
2114Reinforcement_Learning
3117Reinforcement_Learning
4128Reinforcement_Learning
......
27031154500Case_Based
27041154520Neural_Networks
27051154524Rule_Learning
27061154525Rule_Learning
27071155073Rule_Learning
\n", "\n", "

\n", " 2708 rows x 2 columns
\n", " memory usage: 0.02 MB
\n", " name: paper
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ "name paper_id class_label \n", "role join_key categorical \n", " 0 35 Genetic_Algorithms \n", " 1 40 Genetic_Algorithms \n", " 2 114 Reinforcement_Learning\n", " 3 117 Reinforcement_Learning\n", " 4 128 Reinforcement_Learning\n", " ... ... \n", "2703 1154500 Case_Based \n", "2704 1154520 Neural_Networks \n", "2705 1154524 Rule_Learning \n", "2706 1154525 Rule_Learning \n", "2707 1155073 Rule_Learning \n", "\n", "\n", "2708 rows x 2 columns\n", "memory usage: 0.02 MB\n", "type: getml.DataFrame" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "paper.set_role(\"paper_id\", getml.data.roles.join_key)\n", "paper.set_role(\"class_label\", getml.data.roles.categorical)\n", "paper" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namecited_paper_idciting_paper_id
role join_key join_key
035887
1351033
2351688
3351956
4358865
......
542485311619621
5425853116853155
54268531181140289
5427853155853118
54289543151155073
\n", "\n", "

\n", " 5429 rows x 2 columns
\n", " memory usage: 0.04 MB
\n", " name: cites
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ "name cited_paper_id citing_paper_id\n", "role join_key join_key\n", " 0 35 887\n", " 1 35 1033\n", " 2 35 1688\n", " 3 35 1956\n", " 4 35 8865\n", " ... ...\n", "5424 853116 19621\n", "5425 853116 853155\n", "5426 853118 1140289\n", "5427 853155 853118\n", "5428 954315 1155073\n", "\n", "\n", "5429 rows x 2 columns\n", "memory usage: 0.04 MB\n", "type: getml.DataFrame" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cites.set_role([\"cited_paper_id\", \"citing_paper_id\"], getml.data.roles.join_key)\n", "cites" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We need to separate our data set into a training, testing and validation set:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepaper_idword_cited_id
rolejoin_keycategorical
035word100
135word1152
235word1175
335word1228
435word1248
......
492111155073word75
492121155073word759
492131155073word789
492141155073word815
492151155073word979
\n", "\n", "

\n", " 49216 rows x 2 columns
\n", " memory usage: 0.39 MB
\n", " name: content
\n", " type: getml.DataFrame
\n", " \n", "

\n" ], "text/plain": [ " name paper_id word_cited_id\n", " role join_key categorical \n", " 0 35 word100 \n", " 1 35 word1152 \n", " 2 35 word1175 \n", " 3 35 word1228 \n", " 4 35 word1248 \n", " ... ... \n", "49211 1155073 word75 \n", "49212 1155073 word759 \n", "49213 1155073 word789 \n", "49214 1155073 word815 \n", "49215 1155073 word979 \n", "\n", "\n", "49216 rows x 2 columns\n", "memory usage: 0.39 MB\n", "type: getml.DataFrame" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "content.set_role(\"paper_id\", getml.data.roles.join_key)\n", "content.set_role(\"word_cited_id\", getml.data.roles.categorical)\n", "content" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The goal is to predict seven different labels. We generate a target column for each of those labels. We also have to separate the data set into a training and testing set." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namepaper_idclass_label=Case_Basedclass_label=Genetic_Algorithmsclass_label=Neural_Networksclass_label=Probabilistic_Methodsclass_label=Reinforcement_Learningclass_label=Rule_Learningclass_label=Theory
rolejoin_key target target target target target target target
035\n", " 0 \n", " \n", " 1 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", "
140\n", " 0 \n", " \n", " 1 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", "
2114\n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 1 \n", " \n", " 0 \n", " \n", " 0 \n", "
3117\n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 1 \n", " \n", " 0 \n", " \n", " 0 \n", "
4128\n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 0 \n", " \n", " 1 \n", " \n", " 0 \n", " \n", " 0 \n", "
......\n", " ... \n", " \n", " ... \n", " \n", " ... \n", " \n", " ... \n", " \n", " ... \n", " \n", " ... \n", " \n", " ... \n", "
\n", "\n", "

\n", " 2708 rows
\n", " \n", " type: getml.data.View
\n", " \n", "

\n" ], "text/plain": [ "name paper_id class_label=Case_Based class_label=Genetic_Algorithms class_label=Neural_Networks\n", "role join_key target target target\n", " 0 35 0 1 0\n", " 1 40 0 1 0\n", " 2 114 0 0 0\n", " 3 117 0 0 0\n", " 4 128 0 0 0\n", " ... ... ... ... ...\n", "\n", "name class_label=Probabilistic_Methods class_label=Reinforcement_Learning class_label=Rule_Learning class_label=Theory\n", "role target target target target\n", " 0 0 0 0 0\n", " 1 0 0 0 0\n", " 2 0 1 0 0\n", " 3 0 1 0 0\n", " 4 0 1 0 0\n", " ... ... ... ... ...\n", "\n", "\n", "2708 rows x 8 columns\n", "type: getml.data.View" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_full = getml.data.make_target_columns(paper, \"class_label\")\n", "data_full" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0train
1test
2train
3test
4test
...
\n", "\n", "

\n", " infinite number of rows
\n", " \n", " type: StringColumnView
\n", " \n", "

\n" ], "text/plain": [ " \n", " 0 train\n", " 1 test \n", " 2 train\n", " 3 test \n", " 4 test \n", " ... \n", "\n", "\n", "infinite number of rows\n", "type: StringColumnView" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "split = getml.data.split.random(train=0.7, test=0.3, validation=0.0)\n", "split" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
\n", "
population
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
subsetname rowstype
0testpaper821View
1trainpaper1887View
\n", "
\n", "
\n", "
peripheral
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
name rowstype
0cites5429DataFrame
1content49216DataFrame
2paper2708DataFrame
\n", "
\n", "
" ], "text/plain": [ "population\n", " subset name rows type\n", "0 test paper 821 View\n", "1 train paper 1887 View\n", "\n", "peripheral\n", " name rows type \n", "0 cites 5429 DataFrame\n", "1 content 49216 DataFrame\n", "2 paper 2708 DataFrame" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "container = getml.data.Container(population=data_full, split=split)\n", "container.add(cites=cites, content=content, paper=paper)\n", "container.freeze()\n", "container" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2. Predictive modeling\n", "\n", "We loaded the data and defined the roles and units. Next, we create a getML pipeline for relational learning." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.1 Define relational model\n", "\n", "To get started with relational learning, we need to specify the data model. Even though the data set itself is quite simple with only three tables and six columns in total, the resulting data model is actually quite complicated.\n", "\n", "That is because the class label can be predicting using three different pieces of information:\n", "\n", "- The keywords used by the paper\n", "- The keywords used by papers it cites and by papers that cite the paper\n", "- The class label of papers it cites and by papers that cite the paper\n", "\n", "The main challenge here is that `cites` is used twice, once to connect the _cited_ papers and then to connect the _citing_ papers. To resolve this, we need two placeholders on `cites`." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "
diagram
\n", "
contentpapercitescontentpapercitescontentpopulationpaper_id = citing_paper_idpaper_id = citing_paper_idRelationship: many-to-onepaper_id = cited_paper_idpaper_id = cited_paper_idRelationship: many-to-onecited_paper_id = paper_idciting_paper_id = paper_idpaper_id = paper_id
\n", "
\n", "\n", "
\n", "
staging
\n", " \n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
data frames staging table
0populationPOPULATION__STAGING_TABLE_1
1cites, paperCITES__STAGING_TABLE_2
2cites, paperCITES__STAGING_TABLE_3
3contentCONTENT__STAGING_TABLE_4
\n", "
\n", " " ], "text/plain": [ "population:\n", " columns:\n", " - class_label: categorical\n", " - paper_id: join_key\n", "\n", " joins:\n", " - right: 'cites'\n", " on: (population.paper_id, cites.cited_paper_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", " - right: 'cites'\n", " on: (population.paper_id, cites.citing_paper_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", " - right: 'content'\n", " on: (population.paper_id, content.paper_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", "\n", "cites:\n", " columns:\n", " - cited_paper_id: join_key\n", " - citing_paper_id: join_key\n", "\n", " joins:\n", " - right: 'content'\n", " on: (cites.citing_paper_id, content.paper_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", " - right: 'paper'\n", " on: (cites.citing_paper_id, paper.paper_id)\n", " relationship: 'many-to-one'\n", " lagged_targets: False\n", "\n", "content:\n", " columns:\n", " - word_cited_id: categorical\n", " - paper_id: join_key\n", "\n", "paper:\n", " columns:\n", " - class_label: categorical\n", " - paper_id: join_key\n", "\n", "cites:\n", " columns:\n", " - cited_paper_id: join_key\n", " - citing_paper_id: join_key\n", "\n", " joins:\n", " - right: 'content'\n", " on: (cites.cited_paper_id, content.paper_id)\n", " relationship: 'many-to-many'\n", " lagged_targets: False\n", " - right: 'paper'\n", " on: (cites.cited_paper_id, paper.paper_id)\n", " relationship: 'many-to-one'\n", " lagged_targets: False\n", "\n", "content:\n", " columns:\n", " - word_cited_id: categorical\n", " - paper_id: join_key\n", "\n", "paper:\n", " columns:\n", " - class_label: categorical\n", " - paper_id: join_key\n", "\n", "content:\n", " columns:\n", " - word_cited_id: categorical\n", " - paper_id: join_key" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dm = getml.data.DataModel(paper.to_placeholder(\"population\"))\n", "\n", "# We need two different placeholders for cites.\n", "dm.add(getml.data.to_placeholder(cites=[cites]*2, content=content, paper=paper))\n", "\n", "dm.population.join(\n", " dm.cites[0],\n", " on=('paper_id', 'cited_paper_id')\n", ")\n", "\n", "dm.cites[0].join(\n", " dm.content,\n", " on=('citing_paper_id', 'paper_id')\n", ")\n", "\n", "dm.cites[0].join(\n", " dm.paper,\n", " on=('citing_paper_id', 'paper_id'),\n", " relationship=getml.data.relationship.many_to_one\n", ")\n", "\n", "dm.population.join(\n", " dm.cites[1],\n", " on=('paper_id', 'citing_paper_id')\n", ")\n", "\n", "dm.cites[1].join(\n", " dm.content,\n", " on=('cited_paper_id', 'paper_id')\n", ")\n", "\n", "dm.cites[1].join(\n", " dm.paper,\n", " on=('cited_paper_id', 'paper_id'),\n", " relationship=getml.data.relationship.many_to_one\n", ")\n", "\n", "dm.population.join(\n", " dm.content,\n", " on='paper_id'\n", ")\n", "\n", "dm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.2 getML pipeline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "__Set-up the feature learner & predictor__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the relboost algorithms for this problem. Because of the large number of keywords, we regularize the model a bit by requiring a minimum support for the keywords (`min_num_samples`)." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "mapping = getml.preprocessors.Mapping()\n", "\n", "fast_prop = getml.feature_learning.FastProp(\n", " loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n", " num_threads=1\n", ")\n", "\n", "relboost = getml.feature_learning.Relboost(\n", " num_features=10,\n", " num_subfeatures=10,\n", " loss_function=getml.feature_learning.loss_functions.CrossEntropyLoss,\n", " seed=4367,\n", " num_threads=1,\n", " min_num_samples=30\n", ")\n", "\n", "predictor = getml.predictors.XGBoostClassifier()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Build the pipeline__" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
       "         feature_learners=['FastProp'],\n",
       "         feature_selectors=[],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['cites', 'content', 'paper'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=['Mapping'],\n",
       "         share_selected_features=0.5,\n",
       "         tags=['fast_prop'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['FastProp'],\n", " feature_selectors=[],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['cites', 'content', 'paper'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=['Mapping'],\n", " share_selected_features=0.5,\n", " tags=['fast_prop'])" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1 = getml.pipeline.Pipeline(\n", " tags=['fast_prop'],\n", " data_model=dm,\n", " preprocessors=[mapping],\n", " feature_learners=[fast_prop],\n", " predictors=[predictor]\n", ")\n", "\n", "pipe1" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
       "         feature_learners=['Relboost'],\n",
       "         feature_selectors=[],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['cites', 'content', 'paper'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=[],\n",
       "         share_selected_features=0.5,\n",
       "         tags=['relboost'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['Relboost'],\n", " feature_selectors=[],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['cites', 'content', 'paper'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=[],\n", " share_selected_features=0.5,\n", " tags=['relboost'])" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2 = getml.pipeline.Pipeline(\n", " tags=['relboost'],\n", " data_model=dm,\n", " feature_learners=[relboost],\n", " predictors=[predictor]\n", ")\n", "\n", "pipe2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.3 Model training" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
       "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K⠏ Preprocessing... 0% • 00:00" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.\n",
       "
\n" ], "text/plain": [ "The pipeline check generated \u001b[1;36m3\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typelabel message
0INFOMIGHT TAKE LONGThe number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
1INFOFOREIGN KEYS NOT FOUNDWhen joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
2INFOFOREIGN KEYS NOT FOUNDWhen joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
" ], "text/plain": [ " type label message \n", "0 INFO MIGHT TAKE LONG The number of unique entries in ...\n", "1 INFO FOREIGN KEYS NOT FOUND When joining POPULATION__STAGING...\n", "2 INFO FOREIGN KEYS NOT FOUND When joining POPULATION__STAGING..." ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1.check(container.train)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
       "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.\n",
       "
\n" ], "text/plain": [ "The pipeline check generated \u001b[1;36m3\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
To see the issues in full, run .check() on the pipeline.\n",
       "
\n" ], "text/plain": [ "To see the issues in full, run \u001b[1;35m.check\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m on the pipeline.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Trying 3780 features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:06\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
Trained pipeline.\n",
       "
\n" ], "text/plain": [ "Trained pipeline.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:00:16.670446.\n", "\n" ] }, { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
       "         feature_learners=['FastProp'],\n",
       "         feature_selectors=[],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['cites', 'content', 'paper'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=['Mapping'],\n",
       "         share_selected_features=0.5,\n",
       "         tags=['fast_prop', 'container-BRPpU2'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['FastProp'],\n", " feature_selectors=[],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['cites', 'content', 'paper'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=['Mapping'],\n", " share_selected_features=0.5,\n", " tags=['fast_prop', 'container-BRPpU2'])" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1.fit(container.train)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
       "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K⠧ Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 90% • 00:01" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Checking... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.\n",
       "
\n" ], "text/plain": [ "The pipeline check generated \u001b[1;36m3\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
typelabel message
0INFOMIGHT TAKE LONGThe number of unique entries in column 'word_cited_id' in CONTENT__STAGING_TABLE_4 is 1432. This might take a long time to fit. You should consider setting its role to unused_string or using it for comparison only (you can do the latter by setting a unit that contains 'comparison only').
1INFOFOREIGN KEYS NOT FOUNDWhen joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_2 over 'paper_id' and 'cited_paper_id', there are no corresponding entries for 41.759406% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
2INFOFOREIGN KEYS NOT FOUNDWhen joining POPULATION__STAGING_TABLE_1 and CITES__STAGING_TABLE_3 over 'paper_id' and 'citing_paper_id', there are no corresponding entries for 17.700053% of entries in 'paper_id' in 'POPULATION__STAGING_TABLE_1'. You might want to double-check your join keys.
" ], "text/plain": [ " type label message \n", "0 INFO MIGHT TAKE LONG The number of unique entries in ...\n", "1 INFO FOREIGN KEYS NOT FOUND When joining POPULATION__STAGING...\n", "2 INFO FOREIGN KEYS NOT FOUND When joining POPULATION__STAGING..." ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2.check(container.train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The training process seems a bit intimidating. That is because the relboost algorithms needs to train separate models for each class label. This is due to the nature of the generated features." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Checking data model...\n",
       "
\n" ], "text/plain": [ "Checking data model\u001b[33m...\u001b[0m\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
The pipeline check generated 3 issues labeled INFO and 0 issues labeled WARNING.\n",
       "
\n" ], "text/plain": [ "The pipeline check generated \u001b[1;36m3\u001b[0m issues labeled INFO and \u001b[1;36m0\u001b[0m issues labeled WARNING.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/html": [ "
To see the issues in full, run .check() on the pipeline.\n",
       "
\n" ], "text/plain": [ "To see the issues in full, run \u001b[1;35m.check\u001b[0m\u001b[1m(\u001b[0m\u001b[1m)\u001b[0m on the pipeline.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:01\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Training features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K XGBoost: Training as predictor... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "
Trained pipeline.\n",
       "
\n" ], "text/plain": [ "Trained pipeline.\n" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Time taken: 0:00:39.645936.\n", "\n" ] }, { "data": { "text/html": [ "
Pipeline(data_model='population',\n",
       "         feature_learners=['Relboost'],\n",
       "         feature_selectors=[],\n",
       "         include_categorical=False,\n",
       "         loss_function='CrossEntropyLoss',\n",
       "         peripheral=['cites', 'content', 'paper'],\n",
       "         predictors=['XGBoostClassifier'],\n",
       "         preprocessors=[],\n",
       "         share_selected_features=0.5,\n",
       "         tags=['relboost', 'container-BRPpU2'])
" ], "text/plain": [ "Pipeline(data_model='population',\n", " feature_learners=['Relboost'],\n", " feature_selectors=[],\n", " include_categorical=False,\n", " loss_function='CrossEntropyLoss',\n", " peripheral=['cites', 'content', 'paper'],\n", " predictors=['XGBoostClassifier'],\n", " preprocessors=[],\n", " share_selected_features=0.5,\n", " tags=['relboost', 'container-BRPpU2'])" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2.fit(container.train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.4 Model evaluation" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtarget accuracy auccross entropy
02024-09-12 13:05:07trainclass_label=Case_Based0.99790.99990.02323
12024-09-12 13:05:07trainclass_label=Genetic_Algorithms1.01.0.004862
22024-09-12 13:05:07trainclass_label=Neural_Networks0.98460.99830.065852
32024-09-12 13:05:07trainclass_label=Probabilistic_Methods0.99580.99980.027649
42024-09-12 13:05:07trainclass_label=Reinforcement_Learning0.99951.0.008878
..................
92024-09-12 13:05:48testclass_label=Neural_Networks0.95130.97870.163577
102024-09-12 13:05:48testclass_label=Probabilistic_Methods0.97440.98730.082802
112024-09-12 13:05:48testclass_label=Reinforcement_Learning0.98050.97360.073926
122024-09-12 13:05:48testclass_label=Rule_Learning0.98420.99370.052303
132024-09-12 13:05:48testclass_label=Theory0.95620.9770.128617
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", " 0 2024-09-12 13:05:07 train class_label=Case_Based 0.9979 0.9999 0.02323 \n", " 1 2024-09-12 13:05:07 train class_label=Genetic_Algorithms 1.0 1. 0.004862\n", " 2 2024-09-12 13:05:07 train class_label=Neural_Networks 0.9846 0.9983 0.065852\n", " 3 2024-09-12 13:05:07 train class_label=Probabilistic_Method... 0.9958 0.9998 0.027649\n", " 4 2024-09-12 13:05:07 train class_label=Reinforcement_Learni... 0.9995 1. 0.008878\n", " ... ... ... ... ... ...\n", " 9 2024-09-12 13:05:48 test class_label=Neural_Networks 0.9513 0.9787 0.163577\n", "10 2024-09-12 13:05:48 test class_label=Probabilistic_Method... 0.9744 0.9873 0.082802\n", "11 2024-09-12 13:05:48 test class_label=Reinforcement_Learni... 0.9805 0.9736 0.073926\n", "12 2024-09-12 13:05:48 test class_label=Rule_Learning 0.9842 0.9937 0.052303\n", "13 2024-09-12 13:05:48 test class_label=Theory 0.9562 0.977 0.128617" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1.score(container.test)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] }, { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtarget accuracy auccross entropy
02024-09-12 13:05:47trainclass_label=Case_Based1.01.0.009385
12024-09-12 13:05:47trainclass_label=Genetic_Algorithms1.01.0.004222
22024-09-12 13:05:47trainclass_label=Neural_Networks0.9910.99960.03766
32024-09-12 13:05:47trainclass_label=Probabilistic_Methods0.99891.0.013846
42024-09-12 13:05:47trainclass_label=Reinforcement_Learning1.01.0.004409
..................
92024-09-12 13:05:51testclass_label=Neural_Networks0.93910.97570.193588
102024-09-12 13:05:51testclass_label=Probabilistic_Methods0.97690.98920.072601
112024-09-12 13:05:51testclass_label=Reinforcement_Learning0.97690.97730.094757
122024-09-12 13:05:51testclass_label=Rule_Learning0.98420.99120.060603
132024-09-12 13:05:51testclass_label=Theory0.94880.97450.142024
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", " 0 2024-09-12 13:05:47 train class_label=Case_Based 1.0 1. 0.009385\n", " 1 2024-09-12 13:05:47 train class_label=Genetic_Algorithms 1.0 1. 0.004222\n", " 2 2024-09-12 13:05:47 train class_label=Neural_Networks 0.991 0.9996 0.03766 \n", " 3 2024-09-12 13:05:47 train class_label=Probabilistic_Method... 0.9989 1. 0.013846\n", " 4 2024-09-12 13:05:47 train class_label=Reinforcement_Learni... 1.0 1. 0.004409\n", " ... ... ... ... ... ...\n", " 9 2024-09-12 13:05:51 test class_label=Neural_Networks 0.9391 0.9757 0.193588\n", "10 2024-09-12 13:05:51 test class_label=Probabilistic_Method... 0.9769 0.9892 0.072601\n", "11 2024-09-12 13:05:51 test class_label=Reinforcement_Learni... 0.9769 0.9773 0.094757\n", "12 2024-09-12 13:05:51 test class_label=Rule_Learning 0.9842 0.9912 0.060603\n", "13 2024-09-12 13:05:51 test class_label=Theory 0.9488 0.9745 0.142024" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2.score(container.test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make things a bit easier, we just look at our test results." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtarget accuracy auccross entropy
02024-09-12 13:05:48testclass_label=Case_Based0.97080.98610.08689
12024-09-12 13:05:48testclass_label=Genetic_Algorithms0.98420.99810.04915
22024-09-12 13:05:48testclass_label=Neural_Networks0.95130.97870.16358
32024-09-12 13:05:48testclass_label=Probabilistic_Methods0.97440.98730.0828
42024-09-12 13:05:48testclass_label=Reinforcement_Learning0.98050.97360.07393
52024-09-12 13:05:48testclass_label=Rule_Learning0.98420.99370.0523
62024-09-12 13:05:48testclass_label=Theory0.95620.9770.12862
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", "0 2024-09-12 13:05:48 test class_label=Case_Based 0.9708 0.9861 0.08689\n", "1 2024-09-12 13:05:48 test class_label=Genetic_Algorithms 0.9842 0.9981 0.04915\n", "2 2024-09-12 13:05:48 test class_label=Neural_Networks 0.9513 0.9787 0.16358\n", "3 2024-09-12 13:05:48 test class_label=Probabilistic_Method... 0.9744 0.9873 0.0828 \n", "4 2024-09-12 13:05:48 test class_label=Reinforcement_Learni... 0.9805 0.9736 0.07393\n", "5 2024-09-12 13:05:48 test class_label=Rule_Learning 0.9842 0.9937 0.0523 \n", "6 2024-09-12 13:05:48 test class_label=Theory 0.9562 0.977 0.12862" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1.scores.filter(lambda score: score.set_used == \"test\")" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
date time set usedtarget accuracy auccross entropy
02024-09-12 13:05:51testclass_label=Case_Based0.97440.98950.08319
12024-09-12 13:05:51testclass_label=Genetic_Algorithms0.99030.99880.03866
22024-09-12 13:05:51testclass_label=Neural_Networks0.93910.97570.19359
32024-09-12 13:05:51testclass_label=Probabilistic_Methods0.97690.98920.0726
42024-09-12 13:05:51testclass_label=Reinforcement_Learning0.97690.97730.09476
52024-09-12 13:05:51testclass_label=Rule_Learning0.98420.99120.0606
62024-09-12 13:05:51testclass_label=Theory0.94880.97450.14202
" ], "text/plain": [ " date time set used target accuracy auc cross entropy\n", "0 2024-09-12 13:05:51 test class_label=Case_Based 0.9744 0.9895 0.08319\n", "1 2024-09-12 13:05:51 test class_label=Genetic_Algorithms 0.9903 0.9988 0.03866\n", "2 2024-09-12 13:05:51 test class_label=Neural_Networks 0.9391 0.9757 0.19359\n", "3 2024-09-12 13:05:51 test class_label=Probabilistic_Method... 0.9769 0.9892 0.0726 \n", "4 2024-09-12 13:05:51 test class_label=Reinforcement_Learni... 0.9769 0.9773 0.09476\n", "5 2024-09-12 13:05:51 test class_label=Rule_Learning 0.9842 0.9912 0.0606 \n", "6 2024-09-12 13:05:51 test class_label=Theory 0.9488 0.9745 0.14202" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2.scores.filter(lambda score: score.set_used == \"test\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We take the average of the AUC values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9849300280909873\n", "0.9851644641057435\n" ] } ], "source": [ "fastprop_auc = np.mean(pipe1.auc)\n", "relboost_auc = np.mean(pipe2.auc)\n", "print(fastprop_auc)\n", "print(relboost_auc)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The accuracy for multiple targets can be calculated using one of two methods. The first method is to simply take the average of the pair-wise accuracy values, which is also the value that appears in the getML monitor (http://localhost:1709/#/listpipelines/cora)." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.9716373760222724\n", "0.9700713415695145\n" ] } ], "source": [ "print(np.mean(pipe1.accuracy))\n", "print(np.mean(pipe2.accuracy))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "However, the benchmarking papers actually use a different approach: \n", "\n", "- They first generate probabilities for each of the labels:" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K FastProp: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Staging... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Preprocessing... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building subfeatures... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[2K Relboost: Building features... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% • 00:00\n", "\u001b[?25h" ] } ], "source": [ "probabilities1 = pipe1.predict(container.test)\n", "probabilities2 = pipe2.predict(container.test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- They then find the class label with the highest probability:" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "class_label = paper.class_label.unique()\n", "\n", "ix_max = np.argmax(probabilities1, axis=1)\n", "predicted_labels1 = np.asarray([class_label[ix] for ix in ix_max])\n", "\n", "ix_max = np.argmax(probabilities2, axis=1)\n", "predicted_labels2 = np.asarray([class_label[ix] for ix in ix_max])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- They then compare that value to the actual class label:" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Share of accurately predicted class labels (pipe1):\n", "0.9001218026796589\n", "\n", "Share of accurately predicted class labels (pipe2):\n", "0.8964677222898904\n", "\n" ] } ], "source": [ "actual_labels = paper[split == \"test\"].class_label.to_numpy()\n", "fastprop_accuracy = (actual_labels == predicted_labels1).sum() / len(actual_labels)\n", "relboost_accuracy = (actual_labels == predicted_labels2).sum() / len(actual_labels)\n", "\n", "print(\"Share of accurately predicted class labels (pipe1):\")\n", "print(fastprop_accuracy)\n", "print()\n", "print(\"Share of accurately predicted class labels (pipe2):\")\n", "print(relboost_accuracy)\n", "print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this is the method the benchmark papers use, this is the accuracy score we will report as well." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.5 Studying features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Feature correlations__\n", "\n", "We want to analyze how the features are correlated with the target variables." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "TARGET_NUM = 0" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "names, correlations = pipe2.features.correlations(target_num=TARGET_NUM)\n", "\n", "plt.subplots(figsize=(20, 10))\n", "\n", "plt.bar(names, correlations)\n", "\n", "plt.title('Feature correlations with class label ' + class_label[TARGET_NUM])\n", "plt.xlabel('Features')\n", "plt.ylabel('Correlations')\n", "plt.xticks(rotation='vertical')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Feature importances__\n", " \n", "Feature importances are calculated by analyzing the improvement in predictive accuracy on each node of the trees in the XGBoost predictor. They are then normalized, so that all importances add up to 100%." ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "lines_to_next_cell": 0 }, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "names, importances = pipe2.features.importances()\n", "\n", "plt.subplots(figsize=(20, 10))\n", "\n", "plt.bar(names, importances)\n", "\n", "plt.title('Feature importances for class label ' + class_label[TARGET_NUM])\n", "plt.xlabel('Features')\n", "plt.ylabel('Importances')\n", "plt.xticks(rotation='vertical')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "__Column importances__\n", "\n", "Because getML uses relational learning, we can apply the principles we used to calculate the feature importances to individual columns as well." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "names, importances = pipe2.columns.importances(target_num=TARGET_NUM)\n", "\n", "plt.subplots(figsize=(20, 10))\n", "\n", "plt.bar(names, importances)\n", "\n", "plt.title('Columns importances for class label ' + class_label[TARGET_NUM])\n", "plt.xlabel('Columns')\n", "plt.ylabel('Importances')\n", "plt.xticks(rotation='vertical')\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most important features look as follows:" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "DROP TABLE IF EXISTS \"FEATURE_1_51\";\n", "\n", "CREATE TABLE \"FEATURE_1_51\" AS\n", "SELECT AVG( t2.\"t4__class_label__mapping_2_target_3_avg\" ) AS \"feature_1_51\",\n", " t1.rowid AS rownum\n", "FROM \"POPULATION__STAGING_TABLE_1\" t1\n", "INNER JOIN \"CITES__STAGING_TABLE_3\" t2\n", "ON t1.\"paper_id\" = t2.\"citing_paper_id\"\n", "GROUP BY t1.rowid;\n", "```" ], "text/plain": [ "'DROP TABLE IF EXISTS \"FEATURE_1_51\";\\n\\nCREATE TABLE \"FEATURE_1_51\" AS\\nSELECT AVG( t2.\"t4__class_label__mapping_2_target_3_avg\" ) AS \"feature_1_51\",\\n t1.rowid AS rownum\\nFROM \"POPULATION__STAGING_TABLE_1\" t1\\nINNER JOIN \"CITES__STAGING_TABLE_3\" t2\\nON t1.\"paper_id\" = t2.\"citing_paper_id\"\\nGROUP BY t1.rowid;'" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe1.features.to_sql()[pipe1.features.sort(by=\"importances\")[0].name]" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/markdown": [ "```sql\n", "DROP TABLE IF EXISTS \"FEATURE_6_1\";\n", "\n", "CREATE TABLE \"FEATURE_6_1\" AS\n", "SELECT AVG( \n", " CASE\n", " WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" > 5.516436 ) AND ( f_6_2.\"feature_6_2_1\" > 15.824859 ) THEN 19.11484833926196\n", " WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" > 5.516436 ) AND ( f_6_2.\"feature_6_2_1\" <= 15.824859 ) THEN 16.25336464210706\n", " WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" <= 5.516436 ) AND ( f_6_2.\"feature_6_2_20\" > 0.747603 ) THEN 13.8749941754607\n", " WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" <= 5.516436 ) AND ( f_6_2.\"feature_6_2_20\" <= 0.747603 ) THEN 8.209072454235654\n", " WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" > 0.575234 ) THEN 5.856092769106291\n", " WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" <= 0.575234 ) AND ( f_6_2.\"feature_6_2_4\" > 1.058131 ) THEN -2.241272133429655\n", " WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" <= 0.575234 ) AND ( f_6_2.\"feature_6_2_4\" <= 1.058131 ) THEN -0.6025668375656026\n", " ELSE NULL\n", " END\n", ") AS \"feature_6_1\",\n", " t1.rowid AS rownum\n", "FROM \"POPULATION__STAGING_TABLE_1\" t1\n", "INNER JOIN \"CITES__STAGING_TABLE_3\" t2\n", "ON t1.\"paper_id\" = t2.\"citing_paper_id\"\n", "LEFT JOIN \"FEATURES_6_2\" f_6_2\n", "ON t2.rowid = f_6_2.\"rownum\"\n", "GROUP BY t1.rowid;\n", "```" ], "text/plain": [ "'DROP TABLE IF EXISTS \"FEATURE_6_1\";\\n\\nCREATE TABLE \"FEATURE_6_1\" AS\\nSELECT AVG( \\n CASE\\n WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" > 5.516436 ) AND ( f_6_2.\"feature_6_2_1\" > 15.824859 ) THEN 19.11484833926196\\n WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" > 5.516436 ) AND ( f_6_2.\"feature_6_2_1\" <= 15.824859 ) THEN 16.25336464210706\\n WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" <= 5.516436 ) AND ( f_6_2.\"feature_6_2_20\" > 0.747603 ) THEN 13.8749941754607\\n WHEN ( f_6_2.\"feature_6_2_2\" > 3.142350 ) AND ( f_6_2.\"feature_6_2_11\" <= 5.516436 ) AND ( f_6_2.\"feature_6_2_20\" <= 0.747603 ) THEN 8.209072454235654\\n WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" > 0.575234 ) THEN 5.856092769106291\\n WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" <= 0.575234 ) AND ( f_6_2.\"feature_6_2_4\" > 1.058131 ) THEN -2.241272133429655\\n WHEN ( f_6_2.\"feature_6_2_2\" <= 3.142350 ) AND ( f_6_2.\"feature_6_2_13\" <= 0.575234 ) AND ( f_6_2.\"feature_6_2_4\" <= 1.058131 ) THEN -0.6025668375656026\\n ELSE NULL\\n END\\n) AS \"feature_6_1\",\\n t1.rowid AS rownum\\nFROM \"POPULATION__STAGING_TABLE_1\" t1\\nINNER JOIN \"CITES__STAGING_TABLE_3\" t2\\nON t1.\"paper_id\" = t2.\"citing_paper_id\"\\nLEFT JOIN \"FEATURES_6_2\" f_6_2\\nON t2.rowid = f_6_2.\"rownum\"\\nGROUP BY t1.rowid;'" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe2.features.to_sql()[pipe2.features.sort(by=\"importances\")[0].name]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.6 Productionization\n", "\n", "It is possible to productionize the pipeline by transpiling the features into production-ready SQL code. Please also refer to getML's `sqlite3` and `spark` modules." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Creates a folder containing the SQL code.\n", "pipe1.features.to_sql().save(\"cora_pipeline\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pipe1.features.to_sql(dialect=getml.pipeline.dialect.spark_sql).save(\"cora_spark\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### 2.7 Benchmarks" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "State-of-the-art approaches on this data set perform as follows:\n", "\n", "| Approach | Study | Accuracy | AUC |\n", "| :-------------------------- | :------------------------ | -----------: | ------: |\n", "| RelF | Dinh et al (2012) | 85.7% | -- |\n", "| LBP | Dinh et al (2012) | 85.0% | -- |\n", "| EPRN | Preisach and Thieme (2006) | 84.0% | -- |\n", "| PRN | Preisach and Thieme (2006) | 81.0% | -- |\n", "| ACORA | Perlich and Provost (2006) | -- | 97.0% |\n", "\n", "\n", "As we can see, the performance of the relboost algorithm, as used in this notebook, compares favorably to these benchmarks." ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ApproachAccuracyAUC
0FastProp90.0%98.5%
1Relboost89.6%98.5%
\n", "
" ], "text/plain": [ " Approach Accuracy AUC\n", "0 FastProp 90.0% 98.5%\n", "1 Relboost 89.6% 98.5%" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(data={\n", " 'Approach': ['FastProp', 'Relboost'],\n", " 'Accuracy': [f'{score:.1%}' for score in [fastprop_accuracy, relboost_accuracy]],\n", " 'AUC': [f'{score:,.1%}' for score in [fastprop_auc, relboost_auc]]\n", "})" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "getml.engine.shutdown()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 3. Conclusion\n", "\n", "In this notebook we have demonstrated that getML outperforms state-of-the-art relational learning algorithms on the CORA dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## References\n", "\n", "Dinh, Quang-Thang, Christel Vrain, and Matthieu Exbrayat. \"A Link-Based Method for Propositionalization.\" ILP (Late Breaking Papers). 2012.\n", "\n", "Motl, Jan, and Oliver Schulte. \"The CTU prague relational learning repository.\" arXiv preprint arXiv:1511.03086 (2015).\n", "\n", "Perlich, Claudia, and Foster Provost. \"Distribution-based aggregation for relational learning with identifier attributes.\" Machine Learning 62.1-2 (2006): 65-105.\n", "\n", "Preisach, Christine, and Lars Schmidt-Thieme. \"Relational ensemble classification.\" Sixth International Conference on Data Mining (ICDM'06). IEEE, 2006." ] } ], "metadata": { "jupytext": { "encoding": "# -*- coding: utf-8 -*-", "formats": "ipynb,py:percent,md" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": false, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "vscode": { "interpreter": { "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6" } } }, "nbformat": 4, "nbformat_minor": 4 }