{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Basic Modeling in scikit-learn\n",
    "> Before we can validate models, we need an understanding of how to create and work with them. This chapter provides an introduction to running regression and classification models in scikit-learn. We will use this model building foundation throughout the remaining chapters. This is the Summary of lecture \"Model Validation in Python\", via datacamp.\n",
    "\n",
    "- toc: true \n",
    "- badges: true\n",
    "- comments: true\n",
    "- author: Chanseok Kang\n",
    "- categories: [Python, Datacamp, Machine_Learning]\n",
    "- image: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction to model validation\n",
    "- Model validation\n",
    "    - Ensuring your model performs as expected on new data\n",
    "    - Testing model performance on holdout datasets\n",
    "    - Selecting the best model, parameters, and accuracy metrics\n",
    "    - Achieving the best accuracy for the given data\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Seen vs. unseen data\n",
    "Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not.\n",
    "\n",
    "You've built a model based on 50 candies using the dataset `X_train` and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (`X_test`) it has never seen. You will use the mean absolute error, `mae()`, as the accuracy metric."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>competitorname</th>\n",
       "      <th>chocolate</th>\n",
       "      <th>fruity</th>\n",
       "      <th>caramel</th>\n",
       "      <th>peanutyalmondy</th>\n",
       "      <th>nougat</th>\n",
       "      <th>crispedricewafer</th>\n",
       "      <th>hard</th>\n",
       "      <th>bar</th>\n",
       "      <th>pluribus</th>\n",
       "      <th>sugarpercent</th>\n",
       "      <th>pricepercent</th>\n",
       "      <th>winpercent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>100 Grand</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.732</td>\n",
       "      <td>0.860</td>\n",
       "      <td>66.971725</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>3 Musketeers</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.604</td>\n",
       "      <td>0.511</td>\n",
       "      <td>67.602936</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>One dime</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.011</td>\n",
       "      <td>0.116</td>\n",
       "      <td>32.261086</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>One quarter</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.011</td>\n",
       "      <td>0.511</td>\n",
       "      <td>46.116505</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Air Heads</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.906</td>\n",
       "      <td>0.511</td>\n",
       "      <td>52.341465</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  competitorname  chocolate  fruity  caramel  peanutyalmondy  nougat  \\\n",
       "0      100 Grand          1       0        1               0       0   \n",
       "1   3 Musketeers          1       0        0               0       1   \n",
       "2       One dime          0       0        0               0       0   \n",
       "3    One quarter          0       0        0               0       0   \n",
       "4      Air Heads          0       1        0               0       0   \n",
       "\n",
       "   crispedricewafer  hard  bar  pluribus  sugarpercent  pricepercent  \\\n",
       "0                 1     0    1         0         0.732         0.860   \n",
       "1                 0     0    1         0         0.604         0.511   \n",
       "2                 0     0    0         0         0.011         0.116   \n",
       "3                 0     0    0         0         0.011         0.511   \n",
       "4                 0     0    0         0         0.906         0.511   \n",
       "\n",
       "   winpercent  \n",
       "0   66.971725  \n",
       "1   67.602936  \n",
       "2   32.261086  \n",
       "3   46.116505  \n",
       "4   52.341465  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "candy = pd.read_csv('./dataset/candy-data.csv')\n",
    "candy.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = candy.drop(['competitorname', 'winpercent'], axis=1)\n",
    "y = candy['winpercent']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>chocolate</th>\n",
       "      <th>fruity</th>\n",
       "      <th>caramel</th>\n",
       "      <th>peanutyalmondy</th>\n",
       "      <th>nougat</th>\n",
       "      <th>crispedricewafer</th>\n",
       "      <th>hard</th>\n",
       "      <th>bar</th>\n",
       "      <th>pluribus</th>\n",
       "      <th>sugarpercent</th>\n",
       "      <th>pricepercent</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.732</td>\n",
       "      <td>0.860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0.604</td>\n",
       "      <td>0.511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.011</td>\n",
       "      <td>0.116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.011</td>\n",
       "      <td>0.511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.906</td>\n",
       "      <td>0.511</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>80</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.220</td>\n",
       "      <td>0.116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>81</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.093</td>\n",
       "      <td>0.116</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>82</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.313</td>\n",
       "      <td>0.313</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>83</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.186</td>\n",
       "      <td>0.267</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>84</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0.872</td>\n",
       "      <td>0.848</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>85 rows × 11 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    chocolate  fruity  caramel  peanutyalmondy  nougat  crispedricewafer  \\\n",
       "0           1       0        1               0       0                 1   \n",
       "1           1       0        0               0       1                 0   \n",
       "2           0       0        0               0       0                 0   \n",
       "3           0       0        0               0       0                 0   \n",
       "4           0       1        0               0       0                 0   \n",
       "..        ...     ...      ...             ...     ...               ...   \n",
       "80          0       1        0               0       0                 0   \n",
       "81          0       1        0               0       0                 0   \n",
       "82          0       1        0               0       0                 0   \n",
       "83          0       0        1               0       0                 0   \n",
       "84          1       0        0               0       0                 1   \n",
       "\n",
       "    hard  bar  pluribus  sugarpercent  pricepercent  \n",
       "0      0    1         0         0.732         0.860  \n",
       "1      0    1         0         0.604         0.511  \n",
       "2      0    0         0         0.011         0.116  \n",
       "3      0    0         0         0.011         0.511  \n",
       "4      0    0         0         0.906         0.511  \n",
       "..   ...  ...       ...           ...           ...  \n",
       "80     0    0         0         0.220         0.116  \n",
       "81     1    0         0         0.093         0.116  \n",
       "82     0    0         1         0.313         0.313  \n",
       "83     1    0         0         0.186         0.267  \n",
       "84     0    0         1         0.872         0.848  \n",
       "\n",
       "[85 rows x 11 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_absolute_error as mae\n",
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)\n",
    "\n",
    "model = RandomForestRegressor(n_estimators=50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Model error on seen data: 3.71.\n",
      "Model error on unseen data: 8.67.\n"
     ]
    }
   ],
   "source": [
    "# The model is fit using X_train and y_train\n",
    "model.fit(X_train, y_train)\n",
    "\n",
    "# Create vectors of predictions\n",
    "train_predictions = model.predict(X_train)\n",
    "test_predictions = model.predict(X_test)\n",
    "\n",
    "# Train/Test Errors\n",
    "train_error = mae(y_true=y_train, y_pred=train_predictions)\n",
    "test_error = mae(y_true=y_test, y_pred=test_predictions)\n",
    "\n",
    "# Print the accuracy for seen and unseen data\n",
    "print(\"Model error on seen data: {0:.2f}.\".format(train_error))\n",
    "print(\"Model error on unseen data: {0:.2f}.\".format(test_error))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regression models\n",
    "- Random forest parameters\n",
    "    - `n_estimators`: the number of trees in the forest\n",
    "    - `max_depth`: the maximum depth of the trees\n",
    "    - `random_state`: random seed"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Set parameters and fit a model\n",
    "Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a **regression** model.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
    "rfr = RandomForestRegressor()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',\n",
       "                      max_depth=6, max_features='auto', max_leaf_nodes=None,\n",
       "                      max_samples=None, min_impurity_decrease=0.0,\n",
       "                      min_impurity_split=None, min_samples_leaf=1,\n",
       "                      min_samples_split=2, min_weight_fraction_leaf=0.0,\n",
       "                      n_estimators=100, n_jobs=None, oob_score=False,\n",
       "                      random_state=1111, verbose=0, warm_start=False)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Set the number of trees\n",
    "rfr.n_estimators = 100\n",
    "\n",
    "# Add a maximum depth\n",
    "rfr.max_depth = 6\n",
    "\n",
    "# Set the random date\n",
    "rfr.random_state = 1111\n",
    "\n",
    "# Fit the model\n",
    "rfr.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You have updated parameters after the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Feature importances\n",
    "Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be important to model prediction. After a random forest model has been fit, you can review the model's attribute, `.feature_importances_`, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using `enumerate()`.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "chocolate: 0.23\n",
      "fruity: 0.04\n",
      "caramel: 0.04\n",
      "peanutyalmondy: 0.06\n",
      "nougat: 0.01\n",
      "crispedricewafer: 0.01\n",
      "hard: 0.02\n",
      "bar: 0.02\n",
      "pluribus: 0.03\n",
      "sugarpercent: 0.24\n",
      "pricepercent: 0.30\n"
     ]
    }
   ],
   "source": [
    "# Print how important each column is to the model\n",
    "for i, item in enumerate(rfr.feature_importances_):\n",
    "    # Use i and item to print out the feature importance of each column\n",
    "    print(\"{0:s}: {1:.2f}\".format(X_train.columns[i], item))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "No surprise here - chocolate is the most important variable. `.feature_importances_` is a great way to see which variables were important to your random forest model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classification models"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Classification predictions\n",
    "In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.\n",
    "\n",
    "| Probability      | Prediction | Meaning |\n",
    "| ----------- | ----------- | ---------- |\n",
    "| 0 < .5      | 0       | Team Loses |\n",
    "| .5 < 1   | 1        | Team Wins |\n",
    "\n",
    "In this exercise, you look at the methods, `.predict()` and `.predict_proba()` using the `tic_tac_toe` dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Top-Left</th>\n",
       "      <th>Top-Middle</th>\n",
       "      <th>Top-Right</th>\n",
       "      <th>Middle-Left</th>\n",
       "      <th>Middle-Middle</th>\n",
       "      <th>Middle-Right</th>\n",
       "      <th>Bottom-Left</th>\n",
       "      <th>Bottom-Middle</th>\n",
       "      <th>Bottom-Right</th>\n",
       "      <th>Class</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>x</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>b</td>\n",
       "      <td>b</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>x</td>\n",
       "      <td>o</td>\n",
       "      <td>o</td>\n",
       "      <td>b</td>\n",
       "      <td>o</td>\n",
       "      <td>b</td>\n",
       "      <td>positive</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  Top-Left Top-Middle Top-Right Middle-Left Middle-Middle Middle-Right  \\\n",
       "0        x          x         x           x             o            o   \n",
       "1        x          x         x           x             o            o   \n",
       "2        x          x         x           x             o            o   \n",
       "3        x          x         x           x             o            o   \n",
       "4        x          x         x           x             o            o   \n",
       "\n",
       "  Bottom-Left Bottom-Middle Bottom-Right     Class  \n",
       "0           x             o            o  positive  \n",
       "1           o             x            o  positive  \n",
       "2           o             o            x  positive  \n",
       "3           o             b            b  positive  \n",
       "4           b             o            b  positive  "
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tic_tac_toe = pd.read_csv('./dataset/tic-tac-toe.csv')\n",
    "tic_tac_toe.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "y = tic_tac_toe['Class'].apply(lambda x: 1 if x == 'positive' else 0)\n",
    "X = tic_tac_toe.drop('Class', axis=1)\n",
    "X = pd.get_dummies(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.ensemble import RandomForestClassifier\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8)\n",
    "rfc = RandomForestClassifier()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1    568\n",
      "0    199\n",
      "dtype: int64\n",
      "The first predicted probabilities are: [0.76 0.24]\n"
     ]
    }
   ],
   "source": [
    "# Fit the rfc model\n",
    "rfc.fit(X_train, y_train)\n",
    "\n",
    "# Create arrays of predictions\n",
    "classification_predictions = rfc.predict(X_test)\n",
    "probability_predictions = rfc.predict_proba(X_test)\n",
    "\n",
    "# Print out count of binary predictions\n",
    "print(pd.Series(classification_predictions).value_counts())\n",
    "\n",
    "# Print the first value from probability_predictions\n",
    "print('The first predicted probabilities are: {}'.format(probability_predictions[0]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see there were 563 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the `predicted_probabilities` array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reusing model parameters\n",
    "Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as [Stack Overflow](https://stackoverflow.com/). You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters.\n",
    "\n",
    "In this exercise, you use various methods to recall which parameters were used in a model.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n",
      "                       criterion='gini', max_depth=6, max_features='auto',\n",
      "                       max_leaf_nodes=None, max_samples=None,\n",
      "                       min_impurity_decrease=0.0, min_impurity_split=None,\n",
      "                       min_samples_leaf=1, min_samples_split=2,\n",
      "                       min_weight_fraction_leaf=0.0, n_estimators=50,\n",
      "                       n_jobs=None, oob_score=False, random_state=1111,\n",
      "                       verbose=0, warm_start=False)\n",
      "The random state is: 1111\n",
      "Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}\n"
     ]
    }
   ],
   "source": [
    "rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)\n",
    "\n",
    "# Print the classification model\n",
    "print(rfc)\n",
    "\n",
    "# Print the classification model's random state parameter\n",
    "print('The random state is: {}'.format(rfc.random_state))\n",
    "\n",
    "# Print all parameters\n",
    "print('Printing the parameters dictionary: {}'.format(rfc.get_params()))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Random forest classifier\n",
    "This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:\n",
    "\n",
    "1. Create a random forest classification model.\n",
    "2. Fit the model using the tic_tac_toe dataset.\n",
    "3. Make predictions on whether Player One will win (1) or lose (0) the current game.\n",
    "4. Finally, you will evaluate the overall accuracy of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[0 1 1 0 1]\n",
      "0.8213820078226858\n"
     ]
    }
   ],
   "source": [
    "# Create a random forest classifier\n",
    "rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)\n",
    "\n",
    "# Fit rfc using X_train and y_train\n",
    "rfc.fit(X_train, y_train)\n",
    "\n",
    "# Create predictions on X_test\n",
    "predictions = rfc.predict(X_test)\n",
    "print(predictions[0:5])\n",
    "\n",
    "# Print model accuracy using score() and the testing data\n",
    "print(rfc.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's all the steps! Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. You also see the model accuracy was only 82%."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}