{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"../../img/ods_stickers.jpg\">\n",
    "    \n",
    "## [mlcourse.ai](https://mlcourse.ai) - Open Machine Learning Course\n",
    "\n",
    "Authors: [Maria Sumarokova](https://www.linkedin.com/in/mariya-sumarokova-230b4054/), and [Yury Kashnitsky](https://www.linkedin.com/in/festline/). Translated and edited by Gleb Filatov, Aleksey Kiselev, [Anastasia Manokhina](https://www.linkedin.com/in/anastasiamanokhina/), [Egor Polusmak](https://www.linkedin.com/in/egor-polusmak/), and [Yuanyuan Pao](https://www.linkedin.com/in/yuanyuanpao/). All content is distributed under the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "# <center> Assignment #3 (demo). Solution\n",
    "## <center>  Decision trees with a toy task and the UCI Adult dataset \n",
    "\n",
    "Same assignment as a [Kaggle Kernel](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees) + [solution](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees-solution). Fill in the answers in the [web-form](https://docs.google.com/forms/d/1wfWYYoqXTkZNOPy1wpewACXaj2MZjBdLOL58htGWYBA/edit)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's start by loading all necessary libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "plt.rcParams[\"figure.figsize\"] = (10, 8)\n",
    "\n",
    "import collections\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "from sklearn.metrics import accuracy_score\n",
    "from sklearn.model_selection import GridSearchCV, cross_val_score\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.tree import DecisionTreeClassifier, plot_tree"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 1. Toy dataset \"Will They? Won't They?\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your goal is to figure out how decision trees work by walking through a toy problem. While a single decision tree does not yield outstanding results, other performant algorithms like gradient boosting and random forests are based on the same idea. That is why knowing how decision trees work might be useful."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll go through a toy example of binary classification - Person A is deciding whether they will go on a second date with Person B. It will depend on their looks, eloquence, alcohol consumption (only for example), and how much money was spent on the first date."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Creating the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create dataframe with dummy variables\n",
    "def create_df(dic, feature_list):\n",
    "    out = pd.DataFrame(dic)\n",
    "    out = pd.concat([out, pd.get_dummies(out[feature_list])], axis=1)\n",
    "    out.drop(feature_list, axis=1, inplace=True)\n",
    "    return out\n",
    "\n",
    "\n",
    "# Some feature values are present in train and absent in test and vice-versa.\n",
    "def intersect_features(train, test):\n",
    "    common_feat = list(set(train.keys()) & set(test.keys()))\n",
    "    return train[common_feat], test[common_feat]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = [\"Looks\", \"Alcoholic_beverage\", \"Eloquence\", \"Money_spent\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Training data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_train = {}\n",
    "df_train[\"Looks\"] = [\n",
    "    \"handsome\",\n",
    "    \"handsome\",\n",
    "    \"handsome\",\n",
    "    \"repulsive\",\n",
    "    \"repulsive\",\n",
    "    \"repulsive\",\n",
    "    \"handsome\",\n",
    "]\n",
    "df_train[\"Alcoholic_beverage\"] = [\"yes\", \"yes\", \"no\", \"no\", \"yes\", \"yes\", \"yes\"]\n",
    "df_train[\"Eloquence\"] = [\"high\", \"low\", \"average\", \"average\", \"low\", \"high\", \"average\"]\n",
    "df_train[\"Money_spent\"] = [\"lots\", \"little\", \"lots\", \"little\", \"lots\", \"lots\", \"lots\"]\n",
    "df_train[\"Will_go\"] = LabelEncoder().fit_transform([\"+\", \"-\", \"+\", \"-\", \"-\", \"+\", \"+\"])\n",
    "\n",
    "df_train = create_df(df_train, features)\n",
    "df_train"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test = {}\n",
    "df_test[\"Looks\"] = [\"handsome\", \"handsome\", \"repulsive\"]\n",
    "df_test[\"Alcoholic_beverage\"] = [\"no\", \"yes\", \"yes\"]\n",
    "df_test[\"Eloquence\"] = [\"average\", \"high\", \"average\"]\n",
    "df_test[\"Money_spent\"] = [\"lots\", \"little\", \"lots\"]\n",
    "df_test = create_df(df_test, features)\n",
    "df_test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Some feature values are present in train and absent in test and vice-versa.\n",
    "y = df_train[\"Will_go\"]\n",
    "df_train, df_test = intersect_features(train=df_train, test=df_test)\n",
    "df_train"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_test"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Draw a decision tree (by hand or in any graphics editor) for this dataset. Optionally you can also implement tree construction and draw it here."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1\\. What is the entropy $S_0$ of the initial system? By system states, we mean values of the binary feature \"Will_go\" - 0 or 1 - two states in total."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer: </font>  $S_0 = -\\frac{3}{7}\\log_2{\\frac{3}{7}}-\\frac{4}{7}\\log_2{\\frac{4}{7}} = 0.985$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "2\\. Let's split the data by the feature \"Looks_handsome\". What is the entropy $S_1$ of the left group - the one with \"Looks_handsome\". What is the entropy $S_2$ in the opposite group? What is the information gain (IG) if we consider such a split?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer: </font> $S_1 = -\\frac{1}{4}\\log_2{\\frac{1}{4}}-\\frac{3}{4}\\log_2{\\frac{3}{4}} = 0.811$, $S_2 = -\\frac{2}{3}\\log_2{\\frac{2}{3}}-\\frac{1}{3}\\log_2{\\frac{1}{3}} = 0.918$, $IG = S_0-\\frac{4}{7}S_1-\\frac{3}{7}S_2 = 0.128$."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Train a decision tree using sklearn on the training data. You may choose any depth for the tree."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dt = DecisionTreeClassifier(criterion=\"entropy\", random_state=17)\n",
    "dt.fit(df_train, y);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Additional: display the resulting tree using graphviz."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plot_tree(\n",
    "    dt, feature_names=df_train.columns, filled=True, class_names=[\"Won't go\", \"Will go\"]\n",
    ");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 2. Functions for calculating entropy and information gain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider the following warm-up example: we have 9 blue balls and 11 yellow balls. Let ball have label **1** if it is blue, **0** otherwise."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "balls = [1 for i in range(9)] + [0 for i in range(11)]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src = '../../img/decision_tree3.png'>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next split the balls into two groups:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src = '../../img/decision_tree4.png'>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# two groups\n",
    "balls_left = [1 for i in range(8)] + [0 for i in range(5)]  # 8 blue and 5 yellow\n",
    "balls_right = [1 for i in range(1)] + [0 for i in range(6)]  # 1 blue and 6 yellow"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Implement a function to calculate the Shannon Entropy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from math import log\n",
    "\n",
    "\n",
    "def entropy(a_list):\n",
    "    lst = list(a_list)\n",
    "    size = len(lst)\n",
    "    entropy = 0\n",
    "    set_elements = len(set(lst))\n",
    "    if set_elements in [0, 1]:\n",
    "        return 0\n",
    "    for i in set(lst):\n",
    "        occ = lst.count(i)\n",
    "        entropy -= occ / size * log(occ / size, 2)\n",
    "    return entropy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(entropy(balls))  # 9 blue and 11 yellow ones\n",
    "print(entropy(balls_left))  # 8 blue and 5 yellow ones\n",
    "print(entropy(balls_right))  # 1 blue and 6 yellow ones\n",
    "print(entropy([1, 2, 3, 4, 5, 6]))  # entropy of a fair 6-sided die"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "3\\. What is the entropy of the state given by the list **balls_left**?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer:</font> 0.961"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "4\\. What is the entropy of a fair dice? (where we look at a dice as a system with 6 equally probable states)?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer:</font> 2.585"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# information gain calculation\n",
    "def information_gain(root, left, right):\n",
    "    \"\"\" root - initial data, left and right - two partitions of initial data\"\"\"\n",
    "\n",
    "    return (\n",
    "        entropy(root)\n",
    "        - 1.0 * len(left) / len(root) * entropy(left)\n",
    "        - 1.0 * len(right) / len(root) * entropy(right)\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(information_gain(balls, balls_left, balls_right))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "5\\. What is the information gain from splitting the initial dataset into **balls_left** and **balls_right** ?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer:</font> 0.161"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def information_gains(X, y):\n",
    "    \"\"\"Outputs information gain when splitting with each feature\"\"\"\n",
    "    out = []\n",
    "    for i in X.columns:\n",
    "        out.append(information_gain(y, y[X[i] == 0], y[X[i] == 1]))\n",
    "    return out"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Optional:\n",
    "- Implement a decision tree building algorithm by calling `information_gains` recursively\n",
    "- Plot the resulting tree"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "information_gains(df_train, y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def btree(X, y, feature_names):\n",
    "    clf = information_gains(X, y)\n",
    "    best_feat_id = clf.index(max(clf))\n",
    "    best_feature = feature_names[best_feat_id]\n",
    "    print(f\"Best feature to split: {best_feature}\")\n",
    "\n",
    "    x_left = X[X.iloc[:, best_feat_id] == 0]\n",
    "    x_right = X[X.iloc[:, best_feat_id] == 1]\n",
    "    print(f\"Samples: {len(x_left)} (left) and {len(x_right)} (right)\")\n",
    "\n",
    "    y_left = y[X.iloc[:, best_feat_id] == 0]\n",
    "    y_right = y[X.iloc[:, best_feat_id] == 1]\n",
    "    entropy_left = entropy(y_left)\n",
    "    entropy_right = entropy(y_right)\n",
    "    print(f\"Entropy: {entropy_left} (left) and {entropy_right} (right)\")\n",
    "    print(\"_\" * 30 + \"\\n\")\n",
    "    if entropy_left != 0:\n",
    "        print(f\"Splitting the left group with {len(x_left)} samples:\")\n",
    "        btree(x_left, y_left, feature_names)\n",
    "    if entropy_right != 0:\n",
    "        print(f\"Splitting the right group with {len(x_right)} samples:\")\n",
    "        btree(x_right, y_right, feature_names)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "btree(df_train, y, df_train.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This visualization is far from perfect, but it's easy to grasp if you compare it to the normal tree visualization (by sklearn) done above."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Part 3. The \"Adult\" dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Dataset description:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/adult) UCI Adult (no need to download it, we have a copy in the course repository): classify people using demographical data - whether they earn more than \\$50,000 per year or not."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feature descriptions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **Age** – continuous feature\n",
    "- **Workclass** –  continuous feature\n",
    "- **fnlwgt** – final weight of object, continuous feature\n",
    "- **Education** –  categorical feature\n",
    "- **Education_Num** – number of years of education, continuous feature\n",
    "- **Martial_Status** –  categorical feature\n",
    "- **Occupation** –  categorical feature\n",
    "- **Relationship** – categorical feature\n",
    "- **Race** – categorical feature\n",
    "- **Sex** – categorical feature\n",
    "- **Capital_Gain** – continuous feature\n",
    "- **Capital_Loss** – continuous feature\n",
    "- **Hours_per_week** – continuous feature\n",
    "- **Country** – categorical feature"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Target** – earnings level, categorical (binary) feature."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Reading train and test data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = pd.read_csv(\"../../data/adult_train.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test = pd.read_csv(\"../../data/adult_test.csv\", sep=\";\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# necessary to remove rows with incorrect labels in test dataset\n",
    "data_test = data_test[\n",
    "    (data_test[\"Target\"] == \" >50K.\") | (data_test[\"Target\"] == \" <=50K.\")\n",
    "]\n",
    "\n",
    "# encode target variable as integer\n",
    "data_train.loc[data_train[\"Target\"] == \" <=50K\", \"Target\"] = 0\n",
    "data_train.loc[data_train[\"Target\"] == \" >50K\", \"Target\"] = 1\n",
    "\n",
    "data_test.loc[data_test[\"Target\"] == \" <=50K.\", \"Target\"] = 0\n",
    "data_test.loc[data_test[\"Target\"] == \" >50K.\", \"Target\"] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Primary data analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test.describe(include=\"all\").T"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train[\"Target\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(25, 15))\n",
    "cols = 5\n",
    "rows = np.ceil(float(data_train.shape[1]) / cols)\n",
    "for i, column in enumerate(data_train.columns):\n",
    "    ax = fig.add_subplot(rows, cols, i + 1)\n",
    "    ax.set_title(column)\n",
    "    if data_train.dtypes[column] == np.object:\n",
    "        data_train[column].value_counts().plot(kind=\"bar\", axes=ax)\n",
    "    else:\n",
    "        data_train[column].hist(axes=ax)\n",
    "        plt.xticks(rotation=\"vertical\")\n",
    "plt.subplots_adjust(hspace=0.7, wspace=0.2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Checking data types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we see, in the test data, age is treated as type **object**. We need to fix this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test[\"Age\"] = data_test[\"Age\"].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also we'll cast all **float** features to **int** type to keep types consistent between our train and test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test[\"fnlwgt\"] = data_test[\"fnlwgt\"].astype(int)\n",
    "data_test[\"Education_Num\"] = data_test[\"Education_Num\"].astype(int)\n",
    "data_test[\"Capital_Gain\"] = data_test[\"Capital_Gain\"].astype(int)\n",
    "data_test[\"Capital_Loss\"] = data_test[\"Capital_Loss\"].astype(int)\n",
    "data_test[\"Hours_per_week\"] = data_test[\"Hours_per_week\"].astype(int)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Fill in missing data for continuous features with their median values, for categorical features with their mode."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we see some missing values\n",
    "data_train.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# choose categorical and continuous features from data\n",
    "\n",
    "categorical_columns = [\n",
    "    c for c in data_train.columns if data_train[c].dtype.name == \"object\"\n",
    "]\n",
    "numerical_columns = [\n",
    "    c for c in data_train.columns if data_train[c].dtype.name != \"object\"\n",
    "]\n",
    "\n",
    "print(\"categorical_columns:\", categorical_columns)\n",
    "print(\"numerical_columns:\", numerical_columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# fill missing data\n",
    "\n",
    "for c in categorical_columns:\n",
    "    data_train[c].fillna(data_train[c].mode()[0], inplace=True)\n",
    "    data_test[c].fillna(data_train[c].mode()[0], inplace=True)\n",
    "\n",
    "for c in numerical_columns:\n",
    "    data_train[c].fillna(data_train[c].median(), inplace=True)\n",
    "    data_test[c].fillna(data_train[c].median(), inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# no more missing values\n",
    "data_train.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We'll dummy code some categorical features: **Workclass**, **Education**, **Martial_Status**, **Occupation**, **Relationship**, **Race**, **Sex**, **Country**. It can be done via pandas method **get_dummies**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = pd.concat(\n",
    "    [data_train[numerical_columns], pd.get_dummies(data_train[categorical_columns])],\n",
    "    axis=1,\n",
    ")\n",
    "\n",
    "data_test = pd.concat(\n",
    "    [data_test[numerical_columns], pd.get_dummies(data_test[categorical_columns])],\n",
    "    axis=1,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(data_train.columns) - set(data_test.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train.shape, data_test.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### There is no Holland in the test data. Create new zero-valued feature."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test[\"Country_ Holand-Netherlands\"] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "set(data_train.columns) - set(data_test.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_test.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = data_train.drop([\"Target\"], axis=1)\n",
    "y_train = data_train[\"Target\"]\n",
    "\n",
    "X_test = data_test.drop([\"Target\"], axis=1)\n",
    "y_test = data_test[\"Target\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 Decision tree without parameter tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a decision tree **(DecisionTreeClassifier)** with a maximum depth of 3, and evaluate the accuracy metric on the test data. Use parameter **random_state = 17** for results reproducibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree = DecisionTreeClassifier(max_depth=3, random_state=17)\n",
    "tree.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make a prediction with the trained model on the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_test = X_test[X_train.columns] # The feature names should match those that were passed during fit\n",
    "tree_predictions = tree.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_score(y_test, tree_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "6\\. What is the test set accuracy of a decision tree with maximum tree depth of 3 and **random_state = 17**?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Decision tree with parameter tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a decision tree **(DecisionTreeClassifier, random_state = 17).** Find the optimal maximum depth using 5-fold cross-validation **(GridSearchCV)**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "tree_params = {\"max_depth\": range(2, 11)}\n",
    "\n",
    "locally_best_tree = GridSearchCV(\n",
    "    DecisionTreeClassifier(random_state=17), tree_params, cv=5\n",
    ")\n",
    "\n",
    "locally_best_tree.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Best params:\", locally_best_tree.best_params_)\n",
    "print(\"Best cross validaton score\", locally_best_tree.best_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a decision tree with maximum depth of 9 (it is the best **max_depth** in my case), and compute the test set accuracy. Use parameter **random_state = 17** for reproducibility."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuned_tree = DecisionTreeClassifier(max_depth=9, random_state=17)\n",
    "tuned_tree.fit(X_train, y_train)\n",
    "tuned_tree_predictions = tuned_tree.predict(X_test)\n",
    "accuracy_score(y_test, tuned_tree_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "7\\. What is the test set accuracy of a decision tree with maximum depth of 9 and **random_state = 17**?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font color='red'>Answer:</font> 0.848"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 (Optional) Random forest without parameter tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's take a sneak peek of upcoming lectures and try to use a random forest for our task. For now, you can imagine a random forest as a bunch of decision trees, trained on slightly different subsets of the training data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a random forest **(RandomForestClassifier)**. Set the number of trees to 100 and use **random_state = 17**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rf = RandomForestClassifier(n_estimators=100, random_state=17)\n",
    "rf.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Perfrom cross-validation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "cv_scores = cross_val_score(rf, X_train, y_train, cv=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cv_scores, cv_scores.mean()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make predictions for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest_predictions = rf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "accuracy_score(y_test, forest_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4 (Optional) Random forest with parameter tuning"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Train a random forest **(RandomForestClassifier)** of 10 trees. Tune the maximum depth and maximum number of features for each tree using **GridSearchCV**. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "forest_params = {\"max_depth\": range(10, 16), \"max_features\": range(5, 105, 20)}\n",
    "\n",
    "locally_best_forest = GridSearchCV(\n",
    "    RandomForestClassifier(n_estimators=10, random_state=17, n_jobs=-1),\n",
    "    forest_params,\n",
    "    cv=3,\n",
    "    verbose=1,\n",
    ")\n",
    "\n",
    "locally_best_forest.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"Best params:\", locally_best_forest.best_params_)\n",
    "print(\"Best cross validaton score\", locally_best_forest.best_score_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Make predictions for the test data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuned_forest_predictions = locally_best_forest.predict(X_test)\n",
    "accuracy_score(y_test, tuned_forest_predictions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow! Looks that with some tuning we made a forest of 10 trees perform better than a forest of 100 trees with default hyperparameter values. "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}