{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Lab 2 Classification and Regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this lab, we will work with classification and regression. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To complete this lab:\n",
    "1. Follow the instructions running the code when asked.\n",
    "2. Discuss each question in your group.\n",
    "3. Some questions are self-grading by running some code (the cells with `ok.grade()` functions) where asked. However, you should keep notes for your answers to the more reflective questions in a separate MS Word document (you can use [this template for Lab 2](Lab2_answers_template.docx)).\n",
    "4. When completed, submit your reflective answers in your Word document to Studium, so that your teachers can provide feedback."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression\n",
    "\n",
    "If we have a data set with two variables that depend on each other, then with the help of linear regression we can make a predictive model. We try to find a causal relationship between two variables, one of which depends on a number of independent variables. We will use a dataset that describes heights and weights of men and women.\n",
    "\n",
    "Let's first set up the notebook by importing the components we will use in our code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn import linear_model\n",
    "import matplotlib\n",
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from graphviz import Source\n",
    "from sklearn import tree\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "from client.api.notebook import Notebook\n",
    "ok = Notebook('lab2.ok')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To read the dataset, run the following code cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "body_stats = pd.read_csv(\"weight-height.csv\")\n",
    "body_stats.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This dataset is actually in Imperial units with height measured in inches and weight measured in pounds. Let's first changes this to metric values! Run the following cell that will do just that:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "body_stats = pd.read_csv(\"weight-height.csv\")\n",
    "body_stats.Height = body_stats.Height.apply(lambda x: x * 2.54)\n",
    "body_stats.Weight = body_stats.Weight.apply(lambda x: x / 2.2046)\n",
    "body_stats"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q1.** How many rows and columns are in the dataset?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "body_stats_number_of_rows = ...\n",
    "body_stats_number_of_columns = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = ok.grade('q21')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can see the dataset shape and what the data points look like, let's get a feel of how data looks by visualizing it. Run the next code cell to create a scatterplot of the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = body_stats.plot.scatter(x='Weight', y='Height')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q2.** Can you you distinguish between the male and female groups in the dataset? Explain your observation."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "One way that we can distinguish between distinct groups that we know about is to colour them differently in our plots. Run the next cell to create the same scatterplot but with male and female data distinguished by colour."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "cmap = {'Male': 'blue', 'Female': 'pink'}\n",
    "_ = body_stats.plot.scatter(x='Weight', y='Height', c=[cmap.get(c) for c in body_stats.Gender])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's look at each category of data in isolation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "male_stats = body_stats[body_stats.Gender == 'Male']\n",
    "_ = male_stats.plot.scatter(x='Weight', y='Height', c='blue')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "female_stats = body_stats[body_stats.Gender == 'Female']\n",
    "_ = female_stats.plot.scatter(x='Weight', y='Height', c='pink')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The reason we use linear regression is to find the right line that lies as close to all the points as possible to allow us to enter, for example, a height and then be able to infer the height (make a prediction). As you can see in the scatterplots, the data is a bit more spread out and it actually quite difficult to manually fit the right line $y = ax + b$ to represent linear relationship.\n",
    "\n",
    "Luckily, we can use the `LinerarRegression` model from the `sklearn` Python package to build a linear regression model. First we fit the data. This trains the model based on the two variables, in this case `male_stats.Weight` and `male_stats.Height`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "lm_male = linear_model.LinearRegression()\n",
    "lm_male.fit([[x] for x in male_stats.Weight], male_stats.Height)\n",
    "m = lm_male.coef_[0]\n",
    "b = lm_male.intercept_\n",
    "print(\"slope=\", m, \"intercept=\", b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To see how the linear model we trained on the male weight and height data, we can plot the linear relationship over our scattergraph we created earlier using the slope and intercept we extracted from our linear model `lm_male`: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(x=male_stats.Weight, y=male_stats.Height, c='blue')\n",
    "predicted_values_m = [lm_male.coef_ * i + lm_male.intercept_ for i in male_stats.Weight]\n",
    "plt.plot(male_stats.Weight, predicted_values_m, 'black')\n",
    "plt.xlabel(\"Weight\")\n",
    "plt.ylabel(\"Height\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the next two cells to do the same for the female weight and height data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "lm_female = linear_model.LinearRegression()\n",
    "lm_female.fit([[x] for x in female_stats.Weight], female_stats.Height)\n",
    "m = lm_female.coef_[0]\n",
    "b = lm_female.intercept_\n",
    "print(\"slope=\", m, \"intercept=\", b)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(female_stats.Weight, female_stats.Height, c='pink')\n",
    "predicted_values_f = [lm_female.coef_ * i + lm_female.intercept_ for i in female_stats.Weight]\n",
    "plt.plot(female_stats.Weight, predicted_values_f, 'purple')\n",
    "plt.ylabel(\"Height\")\n",
    "plt.xlabel(\"Weight\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Finally, let's visualize both together:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.scatter(body_stats.Weight, body_stats.Height, c=[cmap.get(c) for c in body_stats.Gender])\n",
    "plt.plot(male_stats.Weight, predicted_values_m, 'black')\n",
    "plt.plot(female_stats.Weight, predicted_values_f, 'purple')\n",
    "plt.xlabel(\"Weight\")\n",
    "plt.ylabel(\"Height\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Apart from plotting the fitted linear relationship over the scatterplot, our linear regression model also provides us with a function `predict()` that can take an input variable and output a prediction. For example, assuming a fitted linear model named `my_linear_model` we might use the code:\n",
    "\n",
    "    my_linear_model.predict([[64.7]])\n",
    "    \n",
    "to find out how tall a person of the weight 64.7kg, in cm. \n",
    "\n",
    "**Q3.** Based on your linear models for male and female data, what are the predicted heights for a male weighing 80.6kg (the average weight of a man in Sweden) and a female weighing 64.7 kg (the average weight for a woman in Sweden)? Provide your answers to a minimum of 2 decimal places.\n",
    "\n",
    "Write your own code in each of the two following cells to calculate this using the linear models."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "...  # use lm_male"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "...  # use lm_female"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "predicted_male_height = ...\n",
    "predicted_female_height = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = ok.grade('q23')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification\n",
    "\n",
    "Next, we will explore classification by looking at a dataset relating to the demographics of the survivors of the Titantic disaster. If you are not familiar with the Titanic, you could watch the film but it is more than 3 hours long, so do not watch it during this lab. In this lab we will use a decision tree as a predictive model to predict if a person with particular features might have lived or died on the Titanic.\n",
    "\n",
    "First, we load the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titantic_passengers = pd.read_csv(\"titanic.csv\")\n",
    "titantic_passengers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Data dictionary\n",
    "\n",
    "- `survival` - Whether the person survived or not (0 = No, 1 = Yes)\n",
    "- `pclass` - Passenger class\n",
    "- `sex` - Gender of the person (male or female)\n",
    "- `age`- Age in years\n",
    "- `sibsp` - # of siblings/spouses aboard the Titanic\n",
    "- `parch` - # of parents/children aboard the Titanic\n",
    "- `ticket` - Ticket ID number\n",
    "- `fare` - Passenger fare\n",
    "- `cabin` - Cabin number\n",
    "- `embarked` - Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q4.** According to this data set, how many passengers were on board the Titanic?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "num_passengers = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = ok.grade('q24')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preliminary exploration\n",
    "\n",
    "Before we build our classifier, let us explore the invidual features that might affect the survival outcome. We can extract the number of survivors according to this data by looking at the `survived` columns. Note that for each record a survivor is represented with a 1 and a non-survivor represented with a 0 (zero). This means we can simply retrive the sum each category from the `survived` column as follows using the `value_counts()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titantic_passengers.survived.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can additionally plot this quite easliy:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = titantic_passengers.survived.value_counts().plot.bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q5.** According to this dataset, what percentage of passengers perished when the Titanic sank? Give your answer to the nearest whole percentage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "percentage_perished = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = ok.grade('q25')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see from this figure, we can hypothesize that if we had gone on the Titanic we may have been more than likely than not to have died, since more than half of the total number of passengers perished. This is not that exciting analysis and does not reveal who might have had a better chance of survival.\n",
    "\n",
    "### Analysis based on variables\n",
    "\n",
    "In this section we will do a little more specific predictions but with only one column that the outcome may be due to.\n",
    "\n",
    "#### Gender\n",
    "\n",
    "Let's look at the distribution based on gender:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titantic_passengers.sex.value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = titantic_passengers.sex.value_counts().plot.bar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that in general there were almost twice as many men as women. Let's check the distribution of survival rates based on gender:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "survived_by_gender = titantic_passengers.groupby('sex').survived.mean()\n",
    "_ = survived_by_gender.plot.bar()\n",
    "survived_by_gender"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q6.** What do you observe in the data and why do you think the distribution is as seen? Is there any possibility to reason why this is the case from only the data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Age\n",
    "\n",
    "Let's look at the distribution based on age:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "survived_by_age = titantic_passengers.groupby('age').survived.sum()\n",
    "_ = survived_by_age.plot.bar()\n",
    "survived_by_age.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q7.** As you can see, there are quite a few values that make the plot impossible to read. How many unique values occur in our age distribution (see the shape output above)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "num_unique_ages_in_distribution = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "_ = ok.grade('q27')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To be able to do a better analysis, we can apply some age categorisations. For example, we can add a column in the dataset that categorises a passenger as a child or not. The next code cell adds this column to the original dataset, where we define a child as being a person under 18 years old:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titantic_passengers['age_range'] = pd.cut(titantic_passengers.age, [0, 15, 80], labels=['child', 'adult'])\n",
    "titantic_passengers.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can plot the proportion of adults and children who survived:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "survived_by_age = titantic_passengers.groupby('age_range').survived.mean()\n",
    "_ = survived_by_age.plot.bar()\n",
    "survived_by_age"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q8.** What does the plot showing the proportion of adults and children who survived indicate?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We could continue exploring all of the different features in the dataset like this, but instead we will do something a bit more interesting by building a classifier using decision trees."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Decision Trees\n",
    "\n",
    "So far we have tried to identify what might have been crucial for survival. However, there are other ways to identify these features. What we are going to do is build a decision tree model with which we can the use to make predictions.\n",
    "\n",
    "First of all, we need to do some data cleaning in order to build our model correctly. By running the `info()` function we can see what columns have missing (null) cells:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titantic_passengers.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the output, we can see that there is actually quite a lot missing. However we will ignore some of the incomplete columns in our analysis, and fix the ones we need to keep.\n",
    "\n",
    "**Q9.** Which columns are identified as having missing (null) values? Provide your answer by adding the incomplete columns to the list of strings in the code cell below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "incomplete_columns = [\"age\", \"fare\", ]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = ok.grade('q29')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we need to fix out `age` and `fare` columns and we can do that by filling the null values with values fitting the mean values of the column. This is called **mean imputation**.\n",
    "\n",
    "> \"*Mean imputation* is the replacement of a missing observation with the mean of the non-missing observations for that variable.\"\n",
    "\n",
    "Imputing the mean preserves the mean in the original data. If the missing data is missing completely at random, this ensure that the estimate of the mean remains unbiased. Also, by imputing the mean, you are able to preserve the full sample size, otherwise one may have to drop those rows with missing data. To understand this, let's compare in the original data the age column before and after imputation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.DataFrame({\n",
    "    \"original\": titantic_passengers.age.describe(),\n",
    "    \"imputed\": titantic_passengers.age.fillna(titantic_passengers.age.mean()).describe()\n",
    "})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should observe that the total count of the data is increased in the imputed column, because we fill those `NaN` values with the mean of the column. At the same time, we can see that most of the summary features of the column remain unchanged (mean, min, and max are all preseved).\n",
    "\n",
    "*Mean imputation is not without its issues, but for the purposes of this exercise we will not worry about that.*\n",
    "\n",
    "Run the following cell to impute the `age` and `fare` columns and modify the `titantic_passengers` DataFrame:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "titantic_passengers.age = titantic_passengers.age.fillna(titantic_passengers.age.median())\n",
    "titantic_passengers.fare = titantic_passengers.fare.fillna(titantic_passengers.fare.median())\n",
    "titantic_passengers.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our decision tree model from `sklearn` needs all of the input data to be encoded numerically. This means that the categorical information that is not already encoded as numbers need to also be converted. The following line of code converts the `sex` column into 0 for female and 1 for male, adding a new column `sex_male` to support this encoding."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "titantic_passengers = pd.get_dummies(titantic_passengers, columns=['sex'], drop_first=True)\n",
    "titantic_passengers"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should notice that we now have a new column called `sex_male` with a numerical representation in it. Now we can drop the parts of the table we are no longer interested in:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "survived_data = titantic_passengers.survived  # save this for training later\n",
    "titantic_passengers = titantic_passengers[['sex_male', 'fare', 'age', 'sibsp']]\n",
    "titantic_passengers.info()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to train our model we need to split the data into a training dataset and a test dataset. This means that we can train our classifier on the training dataset and the validate it after training using the test dataset that was kept separate during the training process.\n",
    "\n",
    "We use components from the `sklearn` package to help us. We split two datasets: (1) The full set of features and (2) the target labels (survived or not). We split the data 75% / 25% since we wish to use as much data as possible to train the classifier but preserve enough data to validate the classifier after training."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(titantic_passengers, survived_data, test_size=0.25)\n",
    "print(\"Our training data has {} rows\".format(len(X_train)))\n",
    "print(\"Our test data has {} rows\".format(len(X_test)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we train our `DecisionTreeClassifier` on the training data by asking `sklearn` to fit the features found in `X_train` (the 75% sample from the `titanic_passengers` data) against `y_train` (the 25% sample)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "classifier = DecisionTreeClassifier(max_depth=3)\n",
    "classifier.fit(X_train.values, y_train.values)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now our decision tree has been created. We can test this by taking some test data that we split before into `X_test`. Note that the records in `X_test` were **not** used during the training process, which is why we can use it to validate our trained model.\n",
    "\n",
    "The next cell takes the first 10 records (for convenience) as a sample table, then we run the `classifier.predict()` function on the sample and add it to the sample table for us to view the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sample = X_test.head(10)\n",
    "sample['survived'] = classifier.predict(sample)\n",
    "sample"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q2.10.** Inspect the table created by our prediction on the `sample` records. How many men survived and how many women survived in our prediction?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "number_of_predicted_male_survivors = ...\n",
    "number_of_predicted_female_survivors = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "_ = ok.grade('q210')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q10.** How well do you think the model makes it's predicted classifications?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Even though we can see the effects by running some data through the classifer to view the predicted outcomes, we cannot see how the decision tree model makes its decisions. We can visualize the model we trained.\n",
    "\n",
    "Run the following cell to visualize our classifier:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "tree_plot = Source(tree.export_graphviz(classifier, out_file=None, \n",
    "                            feature_names=X_train.columns, class_names=['Dead', 'Alive'], \n",
    "                            filled=True, rounded=True, special_characters=True))\n",
    "tree_plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Q11.** Take a look at the decision tree that was created (ignore the variables that we have not discussed such as `gini`, `samples`, `value`). What do you think about how the decision tree has logically rationalised the selection of survivors?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---\n",
    "**When you are finished, let a teacher know you are finished and to check if you have any questions about the lab.**\n",
    "\n",
    "If you wish to save your work in this notebook, choose **Save and Checkpoint** from the **File** menu, then choose **Download as Notebook**  from the **File** menu and save it to your computer or USB stick."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "nbTranslate": {
   "displayLangs": [],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "sv",
   "targetLang": "en",
   "useGoogleTranslate": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}