{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "s1",
     "content",
     "l1"
    ]
   },
   "source": [
    "# Logistic Regression: Model Building and Implementation\n",
    "\n",
    "<br/><br/><br/>\n",
    "## Titanic Survivors - Data Selection & Preparation\n",
    "\n",
    "Prior to fitting a logistic regression model for classifying who would likely survive, we have to examine the dataset with information from EDA as well as using other statistical methods. The logistic regression algorithm is also a supervised learning technique. \n",
    "\n",
    "The dataset from training and testing has data that cannot be directly used due to many issues including but not limited to:\n",
    "\n",
    "- Sparse column entries in certain columns such as Cabin.\n",
    "- NaN entries in columns.\n",
    "- Categorical variables with string entries.\n",
    "- Selection of right columns.\n",
    "\n",
    "\n",
    "### Sparsity\n",
    "\n",
    "Let us examine sparse columns by counting the ratio of NaNs to all the values. describe() function on dataframe provides information about mean, median and the number of values ignoring NaNs only for float/integer columns. \n",
    "\n",
    "train_data.describe()\n",
    "`<pre>\n",
    "       PassengerId    Survived      Pclass         Age       SibSp  \\\n",
    "count   891.000000  891.000000  891.000000  714.000000  891.000000   \n",
    "mean    446.000000    0.383838    2.308642   29.699118    0.523008   \n",
    "std     257.353842    0.486592    0.836071   14.526497    1.102743   \n",
    "min       1.000000    0.000000    1.000000    0.420000    0.000000   \n",
    "25%     223.500000    0.000000    2.000000   20.125000    0.000000   \n",
    "50%     446.000000    0.000000    3.000000   28.000000    0.000000   \n",
    "75%     668.500000    1.000000    3.000000   38.000000    1.000000   \n",
    "max     891.000000    1.000000    3.000000   80.000000    8.000000   \n",
    "\n",
    "            Parch        Fare  \n",
    "count  891.000000  891.000000  \n",
    "mean     0.381594   32.204208  \n",
    "std      0.806057   49.693429  \n",
    "min      0.000000    0.000000  \n",
    "25%      0.000000    7.910400  \n",
    "50%      0.000000   14.454200  \n",
    "75%      0.000000   31.000000  \n",
    "max      6.000000  512.329200\n",
    "</pre>\n",
    "The count row provides information about how many values exist with rest being NaNs.\n",
    "\n",
    "### Age\n",
    "\n",
    "We can see that Age column has 714 entries with missing (891 - 714 = 177) values. This would be 177/891 = 0.19 or approximately 20% of missing values. If this percentage was small, we could choose to ignore those rows for fitting a logistic regression model. There are various methods to fill the missing values. But before we discuss the ways to fix the issue of Age column sparsity, let us examine other columns as well.\n",
    "\n",
    "### PassengerId, Name, Ticket\n",
    "\n",
    "We can see that PassengerId, Name and Ticket are all unique to each person and hence will not serve as columns for modeling. Logistic Regression or any supervised or unsupervised learning methods need to understand patterns in the dataset. This is a necessary condition, so that algorithms can make sense of the data available by mathematically recording these patterns. Hence, Ids, Names are usually candidates that aren't useful for modeling. They are needed for identifying the person post recommendation, prediction or classification. They are also going to be useful later for other columns, thereby improving the overall dataset. \n",
    "\n",
    "### Cabin\n",
    "\n",
    "The Cabin column is really sparse with just (148/891 = 0.16) 16% data available. You can use len(train_data.Cabin.unique()) to determine the total length. When data is very sparse, we can ignore it for modeling in the first iteration. Later for improving the fit, this column can be investigated deeper to extract more information.\n",
    "\n",
    "### Embarked\n",
    "\n",
    "This data shows the point where passengers Embarked. It has very less sparsity with train_data[train_data.Embarked.notnull()] = 889 which is nearly all the data (891). Hence, it can be useful for modeling.\n",
    "\n",
    "### Person\n",
    "\n",
    "This is a column we created ourselves by splitting up age into different bands of Child, Adult, Senior and Unknown. We have to determine how any Unknown people are there so that we can build better models. Since, this variable depends directly on age, if we can fix the sparsity of age, this will be fixed as well.\n",
    "\n",
    "<br/>\n",
    "## Exercise:\n",
    "\n",
    "The Fare column has no sparsity and is complete.\n",
    "\n",
    "- Generate a plot to visualize Fare just as we did visualize Age.\n",
    "- Write down what bands would be able to split the Fare into.\n",
    "- Assign the plot to variable fare_plot as shown in the code below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "tags": [
     "s1",
     "ce",
     "l1"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n",
      "  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n"
     ]
    }
   ],
   "source": [
    "# Here is the distplot used to generate Age plot. Modify features variable for fare.\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "\n",
    "train_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv\")\n",
    "test_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv\")\n",
    "\n",
    "ind_var = train_data[train_data['Age'].notnull()].Age\n",
    "fare_plot = sns.distplot(ind_var)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "s1",
     "l1",
     "hint"
    ]
   },
   "source": [
    "<p>You don't need to look for NaNs as Fare is a complete set.</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "tags": [
     "s1",
     "l1",
     "ans"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n",
      "  y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n"
     ]
    }
   ],
   "source": [
    "ind_var = train_data['Fare']\n",
    "fare_plot = sns.distplot(ind_var)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "tags": [
     "s1",
     "hid",
     "l1"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    ind_var_ = train_data['Fare']\n",
    "    \n",
    "    if np.all(ind_var == ind_var_):\n",
    "      ref_assert_var = True\n",
    "      out = fare_plot.get_figure()\n",
    "    else:\n",
    "      ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "\n",
    "assert ref_tmp_var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l2",
     "content",
     "s2"
    ]
   },
   "source": [
    "\n",
    "\n",
    "\n",
    "<br/><br/><br/>\n",
    "## Statistical Imputations\n",
    "\n",
    "### Dummy Variables\n",
    "\n",
    "Dummy variables are numerical values used in place of categorical variables. These are necessary to convert range of unique set of strings or numbers into a unique array of values that represents the string. The columns, 'Sex' and 'Embarked' are categorical variables with string content. \n",
    "\n",
    "### Sex Column\n",
    "\n",
    "The Sex column has entries male or female. This column could be mapped into two columns of male and female. We can then have 1s and 0s to indicate if the row is male or female. Since it is mutually exclusive that the row is either a male or a female and hence the columns can only have the following entries:\n",
    "<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\" style=\"width:500px\">\n",
    "\t<tbody>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>Male</td>\n",
    "\t\t\t<td>Female</td>\n",
    "\t\t</tr>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>0</td>\n",
    "\t\t\t<td>1</td>\n",
    "\t\t</tr>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>1</td>\n",
    "\t\t\t<td>0</td>\n",
    "\t\t</tr>\n",
    "\t</tbody>\n",
    "</table>\n",
    "\n",
    "After this transform the logistic regression model can also 'mathematically model' or understand that the row refers to either male or female without having to go through the Sex column with string content. What happens if we eliminate one of the columns such as the Male column as below?\n",
    "\n",
    "<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\" style=\"width:500px\">\n",
    "\t<tbody>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>Female</td>\n",
    "\t\t</tr>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>0</td>\n",
    "\t\t</tr>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>1</td>\n",
    "\t\t</tr>\n",
    "\t</tbody>\n",
    "</table>\n",
    "\n",
    "We can still determine uniquely if the subject is a female or not, where 0 represents absence and 1 represents presence. This step of elimination of a single column in a range of dummy variables not only reduces complexity for modeling with lesser columns conveying the information but also necessary due to mathematical reasons. When all the columns are added, a singularity matrix can occur which gives an error in classification.\n",
    "\n",
    "Imputation refers to methods of substituting estimates for missing values in the data. This is an important step that can train the model better as more data becomes available post imputations. There are many known methods of imputations. Sometime, by analysis and EDA, we can design custom imputation methods that provide best statistical estimates for the missing value. This reduces sparsity in the dataset. In the Titanic dataset, let us start investigating various methods to impute sparse columns.\n",
    "\n",
    "### Age Column\n",
    "\n",
    "To impute the age column, we can use the name information. How many of names with Mr., Mrs., Miss and Master exist and use the mean values for each where the ages are missing. Here are the estimates for each category:\n",
    "```python\n",
    "miss_est = train_data[train_data['Name'].str.contains('Miss. ')].Age.mean()\n",
    "21.773972602739725\n",
    "master_est = train_data[train_data['Name'].str.contains('Master. ')].Age.mean()\n",
    "4.5741666666666667\n",
    "mrs_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()\n",
    "35.898148148148145\n",
    "mr_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()\n",
    "32.332089552238806\n",
    "```\n",
    "\n",
    "The above estimates can be improved further by considering the Parents column as those names containing Master and Miss would have a subset of children (unmarried referring to Master & Miss). Here is a function that takes all of these rules into consideration:\n",
    "<pre>\n",
    "girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "3.696\n",
    "\n",
    "boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "12.0\n",
    "\n",
    "woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()\n",
    "27.763\n",
    "\n",
    "man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "12.0\n",
    "\n",
    "woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()\n",
    "35.898\n",
    "\n",
    "man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()\n",
    "32.332089552238806\n",
    "</pre>\n",
    "We shall use the above estimates with an imputation function that we will build based on the same rules as above. Math function is imported as to check for NaNs outside of dataframe methods, it is needed. \n",
    "```python\n",
    "import math\n",
    "def impute_age(row):\n",
    "  if math.isnan(row[5]):\n",
    "    if ((('Miss. ') in row[3]) and (row[7] == 1)):\n",
    "      return girl_child_est\n",
    "    elif ((('Master. ') in row[3]) and (row[7] == 1)):\n",
    "      return boy_child_est\n",
    "    elif ((('Miss. ') in row[3]) and (row[7] == 0)):\n",
    "      return woman_adult_est\n",
    "    elif (('Mrs. ') in row[3]):\n",
    "      return woman_married_est\n",
    "    else:\n",
    "      return man_married_est\n",
    "  else:\n",
    "    return row[5]\n",
    "```\n",
    "    \n",
    "Dataframe has an apply method that can apply this function to either each element in the dataframe or to each row. To specify row instead of applying the function to each element, pass axis=1 to the function:\n",
    "```python\n",
    "train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)\n",
    "test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)\n",
    "```\n",
    "<br/>\n",
    "## Exercise:\n",
    "\n",
    "apply function on dataframes as shown in the code section below operates on every single row; i.e, every row is passed on to the impute_age function which returns the estimated age when it doesn't exist. \n",
    "\n",
    " - Given the math function, math.isnan(x) which returns if the result is a Nan or Not, use it at the approriate place in the code below, and create a new column called - Imputed_Age to contain ages where they exist as well as imputed ages in place of NaNs.\n",
    "- Assign the column Imputed_Age to variable imputed_age.\n",
    "- print out first few rows using the .head() command."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true,
    "tags": [
     "l2",
     "ce",
     "s2"
    ]
   },
   "outputs": [],
   "source": [
    "import math\n",
    "import statsmodels.api as sm\n",
    "from sklearn.metrics import roc_curve, roc_auc_score\n",
    "\n",
    "girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()\n",
    "man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n",
    "woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()\n",
    "man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()\n",
    "\n",
    "# Modify and uncomment the code below to impute the age.\n",
    "#def impute_age(row):\n",
    "#    if ((('Miss. ') in row[3]) and (row[7] == 1)):\n",
    "#      return girl_child_est\n",
    "#   elif ((('Master. ') in row[3]) and (row[7] == 1)):\n",
    "#      return boy_child_est\n",
    "#    elif ((('Miss. ') in row[3]) and (row[7] == 0)):\n",
    "#      return woman_adult_est\n",
    "#    elif (('Mrs. ') in row[3]):\n",
    "#      return woman_married_est\n",
    "#    else:\n",
    "#      return man_married_est\n",
    "\n",
    "\n",
    "#train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l2",
     "s2",
     "hint"
    ]
   },
   "source": [
    "<p>use train_data.head() or .columns to determine the index of age in each row and use that with math.isnan(x)</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "tags": [
     "l2",
     "s2",
     "ans"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>PassengerId</th>\n",
       "      <th>Survived</th>\n",
       "      <th>Pclass</th>\n",
       "      <th>Name</th>\n",
       "      <th>Sex</th>\n",
       "      <th>Age</th>\n",
       "      <th>SibSp</th>\n",
       "      <th>Parch</th>\n",
       "      <th>Ticket</th>\n",
       "      <th>Fare</th>\n",
       "      <th>Cabin</th>\n",
       "      <th>Embarked</th>\n",
       "      <th>Imputed_Age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Braund, Mr. Owen Harris</td>\n",
       "      <td>male</td>\n",
       "      <td>22.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>A/5 21171</td>\n",
       "      <td>7.2500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>22.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Cumings, Mrs. John Bradley (Florence Briggs Th...</td>\n",
       "      <td>female</td>\n",
       "      <td>38.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>PC 17599</td>\n",
       "      <td>71.2833</td>\n",
       "      <td>C85</td>\n",
       "      <td>C</td>\n",
       "      <td>38.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>Heikkinen, Miss. Laina</td>\n",
       "      <td>female</td>\n",
       "      <td>26.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>STON/O2. 3101282</td>\n",
       "      <td>7.9250</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>26.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>Futrelle, Mrs. Jacques Heath (Lily May Peel)</td>\n",
       "      <td>female</td>\n",
       "      <td>35.0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>113803</td>\n",
       "      <td>53.1000</td>\n",
       "      <td>C123</td>\n",
       "      <td>S</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>Allen, Mr. William Henry</td>\n",
       "      <td>male</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>373450</td>\n",
       "      <td>8.0500</td>\n",
       "      <td>NaN</td>\n",
       "      <td>S</td>\n",
       "      <td>35.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   PassengerId  Survived  Pclass  \\\n",
       "0            1         0       3   \n",
       "1            2         1       1   \n",
       "2            3         1       3   \n",
       "3            4         1       1   \n",
       "4            5         0       3   \n",
       "\n",
       "                                                Name     Sex   Age  SibSp  \\\n",
       "0                            Braund, Mr. Owen Harris    male  22.0      1   \n",
       "1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   \n",
       "2                             Heikkinen, Miss. Laina  female  26.0      0   \n",
       "3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   \n",
       "4                           Allen, Mr. William Henry    male  35.0      0   \n",
       "\n",
       "   Parch            Ticket     Fare Cabin Embarked  Imputed_Age  \n",
       "0      0         A/5 21171   7.2500   NaN        S         22.0  \n",
       "1      0          PC 17599  71.2833   C85        C         38.0  \n",
       "2      0  STON/O2. 3101282   7.9250   NaN        S         26.0  \n",
       "3      0            113803  53.1000  C123        S         35.0  \n",
       "4      0            373450   8.0500   NaN        S         35.0  "
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "def impute_age(row):\n",
    "    if math.isnan(row[5]):\n",
    "        if ((('Miss. ') in row[3]) and (row[7] == 1)):\n",
    "            return girl_child_est\n",
    "        elif ((('Master. ') in row[3]) and (row[7] == 1)):\n",
    "            return boy_child_est\n",
    "        elif ((('Miss. ') in row[3]) and (row[7] == 0)):\n",
    "            return woman_adult_est\n",
    "        elif (('Mrs. ') in row[3]):\n",
    "            return woman_married_est\n",
    "        else:\n",
    "            return man_married_est\n",
    "    else:\n",
    "        return row[5]\n",
    "\n",
    "train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)\n",
    "test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)\n",
    "\n",
    "imputed_age = train_data['Imputed_Age']\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "tags": [
     "l2",
     "hid",
     "s2"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    import numpy\n",
    "    \n",
    "    def impute_age(row):\n",
    "        if math.isnan(row[5]):\n",
    "            if ((('Miss. ') in row[3]) and (row[7] == 1)):\n",
    "                return girl_child_est\n",
    "            elif ((('Master. ') in row[3]) and (row[7] == 1)):\n",
    "                return boy_child_est\n",
    "            elif ((('Miss. ') in row[3]) and (row[7] == 0)):\n",
    "                return woman_adult_est\n",
    "            elif (('Mrs. ') in row[3]):\n",
    "                return woman_married_est\n",
    "            else:\n",
    "                return man_married_est\n",
    "        else:\n",
    "            return row[5]\n",
    "    \n",
    "    t_data_ = train_data.apply(impute_age, axis=1)\n",
    "    train_data.head()\n",
    "    \n",
    "    if np.all(imputed_age == t_data_):\n",
    "        ref_assert_var = True\n",
    "        out = train_data.head()\n",
    "    else:\n",
    "        ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "\n",
    "assert ref_tmp_var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l3",
     "s3",
     "content"
    ]
   },
   "source": [
    "\n",
    "\n",
    "\n",
    "<br/><br/><br/>\n",
    "## ROC Curves\n",
    "\n",
    "### Working with Categorical Variables \n",
    "\n",
    "Categorical variables are variables where the values belong to a finite set. For example, the days of the week (0-7), Sex (Male/Female) are categorical variables. These categorical variables such as where the person embarked the ship, the sex need to be split up into separate columns so that the logistic regression model can understand them. To do so, we use dummy variables. These dummy variables encode the categorical variables into a set of 0s & 1 where 0 indicates absence of the feature and 1 indicates presence of the feature. Hence, the logistic regression model can tune to this dataset. Here, we have used the get_dummies to prepare the dataset.\n",
    "```python\n",
    "train_embarked = pd.get_dummies(train_data['Embarked'])\n",
    "train_sex = pd.get_dummies(train_data['Sex'])\n",
    "train_data = train_data.join([train_embarked, train_sex])\n",
    "test_embarked = pd.get_dummies(test_data['Embarked'])\n",
    "test_sex = pd.get_dummies(test_data['Sex'])\n",
    "test_data = test_data.join([test_embarked, test_sex])\n",
    "train_data['Age_Imputed']=train_data.apply(impute_age, axis=1)\n",
    "test_data['Age_Imputed']=test_data.apply(impute_age, axis=1)\n",
    "train_data.head()\n",
    "```\n",
    "\n",
    "ROC is a short form of Region of Convergence. We need to choose a threshold that best provides the estimate of classes. Before we proceed with understanding ROC curves we need to understand a confusion matrix that is used to analyse the performance of the classifier. Suppose we are trying to predict class A, then:\n",
    "\n",
    "### Confusion Matrix:\n",
    "\n",
    "<table border=\"1\" cellpadding=\"1\" cellspacing=\"1\" style=\"width:500px\">\n",
    "\t<thead>\n",
    "\t\t<tr>\n",
    "\t\t\t<th scope=\"col\">&nbsp;</th>\n",
    "\t\t\t<th scope=\"col\">Class A</th>\n",
    "\t\t\t<th scope=\"col\">Not Class A</th>\n",
    "\t\t</tr>\n",
    "\t</thead>\n",
    "\t<tbody>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>&nbsp;\"Class A\" Prediction</td>\n",
    "\t\t\t<td>True Positive</td>\n",
    "\t\t\t<td>False Positive</td>\n",
    "\t\t</tr>\n",
    "\t\t<tr>\n",
    "\t\t\t<td>&nbsp;\"Not Class A\" Prediction</td>\n",
    "\t\t\t<td>False Negative</td>\n",
    "\t\t\t<td>True Negative</td>\n",
    "\t\t</tr>\n",
    "\t</tbody>\n",
    "</table>\n",
    " \t\n",
    "- Sensitivity or True positive rate : TPR=TP/(TP+FN) \n",
    "- False positive rate: FPR=FP/(FP+TN)\n",
    "- True negative rate (or specificity): TNR=TN/(FP+TN)\n",
    "- Total accuracy = (TP + TN)/(TP + FN + FP + TN)\n",
    "- From the total accuracy formula, you can observe that the goal is to minimize False Postives and False Negatives. \n",
    "\n",
    "Let us use ROC functions from sklearn:\n",
    "```python\n",
    "from sklearn.metrics import roc_curve, roc_auc_score\n",
    "roc_survival = roc_curve(train_data[['Survived']], y_pred)\n",
    "```\n",
    "ROC plots are plots of True Positives vs False Positives. Hence we need the top left curve to be closer to the upper and left axis as much as possible to increase performance. \n",
    "\n",
    "Seaborn visualization is better than matplotlib. Matplotlib is a basic visualization tool and we can use it with seaborn styles. \n",
    "```python\n",
    "sns.set_style(\"whitegrid\")\n",
    "plt.plot(roc_survival[0], roc_survival[1])\n",
    "plt.show()\n",
    "```\n",
    "The above curve is how a typical ROC curve looks like and you should quickly be able to identify it. We would want to improve the fit, which would imply pushing the curve closer to left and top axis, minimizing False positives and maximizing True positives.\n",
    "\n",
    "<br/>\n",
    "## Exercise:\n",
    "\n",
    "Train the logistic regression model for all input features in the same order as it is present in the dataframe columns. Use those features you think are best suited and get the predictions on training data. \n",
    "\n",
    "- Plot the ROC curve and assign it to the variable roc_plot. Use roc_survival variable to compute ROC.\n",
    "- How different is the ROC curve from the previous ROC curve for model trained with one feature?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "tags": [
     "l3",
     "s3",
     "ce"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.689550\n",
      "         Iterations 4\n"
     ]
    }
   ],
   "source": [
    "from sklearn.metrics import roc_curve, roc_auc_score\n",
    "\n",
    "# You dont want this to join twice if you are attempting this lesson multiple times.\n",
    "\n",
    "try:\n",
    "    train_embarked = pd.get_dummies(train_data['Embarked'])\n",
    "    train_sex = pd.get_dummies(train_data['Sex'])\n",
    "    train_data = train_data.join([train_embarked, train_sex])\n",
    "\n",
    "    test_embarked = pd.get_dummies(test_data['Embarked']) \n",
    "    test_sex = pd.get_dummies(test_data['Sex'])\n",
    "    test_data = test_data.join([test_embarked, test_sex])\n",
    "except:\n",
    "    print(\"The columns have categorical variables\")\n",
    "\n",
    "# Modify the features list to include the relevant features and plot the roc curve.\n",
    "\n",
    "features = ['Fare']\n",
    "log_model = sm.Logit(train_data['Survived'], train_data[features]).fit()\n",
    "y_pred = log_model.predict(train_data[features])\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l3",
     "s3",
     "hint"
    ]
   },
   "source": [
    "<p>Include all features and generate the ROC curve. </p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "tags": [
     "l3",
     "s3",
     "ans"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimization terminated successfully.\n",
      "         Current function value: 0.450489\n",
      "         Iterations 6\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x24aeceea9b0>]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n",
    "log_model = sm.Logit(train_data['Survived'], train_data[features]).fit()\n",
    "y_pred = log_model.predict(train_data[features])\n",
    "roc_survival = roc_curve(train_data['Survived'], y_pred)\n",
    "sns.set_style(\"whitegrid\")\n",
    "sns.plt.xlabel('False Positive Rate')\n",
    "sns.plt.ylabel('True Positive Rate')\n",
    "roc_plot = sns.plt.plot(roc_survival[0], roc_survival[1])\n",
    "roc_plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "tags": [
     "l3",
     "s3",
     "hid"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    features_ = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n",
    "    \n",
    "    import numpy as np\n",
    "    \n",
    "    if np.all(features == features_):\n",
    "      ref_assert_var = True\n",
    "      out = sns.plt\n",
    "    else:\n",
    "      ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "\n",
    "assert ref_tmp_var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "l4",
     "s4",
     "content"
    ]
   },
   "source": [
    "\n",
    "\n",
    "\n",
    "<br/><br/><br/>\n",
    "## Logistic Regression using Scikit-Learn\n",
    "\n",
    "Here we shall learn how to perform modeling using scikit-learn.\n",
    "```python\n",
    "from sklearn import metrics\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "```\n",
    "\n",
    "Instantiate a logistic regression model as:\n",
    "```python\n",
    "log_sci_model = LogisticRegression()\n",
    "```\n",
    "\n",
    "Train the model with 'Fare':\n",
    "```python\n",
    "log_sci_model = log_sci_model.fit(train_data['Fare'], train_data['Survived'])\n",
    "```\n",
    "Measure the performance of the trained model over the training set:\n",
    "```python\n",
    "log_sci_model.score(train_data['Fare'], train_data['Survived'])\n",
    "0.66554433221099885\n",
    "```\n",
    "\n",
    "<br/>\n",
    "## Exercise:\n",
    "\n",
    "Train the model with all possible features.\n",
    "\n",
    "- Assign all the list of features to the variable, features.\n",
    "- Train using scikit learn logistic regression module.\n",
    "- Get the prediction on the training set and print out the score."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "tags": [
     "l4",
     "s4",
     "ce"
    ]
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\sklearn\\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.\n",
      "  \"This module will be removed in 0.20.\", DeprecationWarning)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "0.66554433221099885"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from sklearn import metrics, cross_validation\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "# Modify the code below to include all possible features.\n",
    "\n",
    "features = ['Fare']\n",
    "\n",
    "log_sci_model = LogisticRegression()\n",
    "log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])\n",
    "log_sci_model.score(train_data[features], train_data['Survived'])\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "s4",
     "l4",
     "hint"
    ]
   },
   "source": [
    "<p>change features list</p>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "tags": [
     "s4",
     "l4",
     "ans"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.796857463524\n"
     ]
    }
   ],
   "source": [
    "features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n",
    "\n",
    "log_sci_model = LogisticRegression()\n",
    "log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])\n",
    "log_score = log_sci_model.score(train_data[features], train_data['Survived'])\n",
    "print(log_score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "tags": [
     "s4",
     "hid",
     "l4"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "continue\n"
     ]
    }
   ],
   "source": [
    "ref_tmp_var = False\n",
    "\n",
    "\n",
    "try:\n",
    "    ref_assert_var = False\n",
    "    features_ = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n",
    "    \n",
    "    if np.all(features == features_):\n",
    "      ref_assert_var = True\n",
    "    else:\n",
    "      ref_assert_var = False\n",
    "    \n",
    "except Exception:\n",
    "    print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "else:\n",
    "    if ref_assert_var:\n",
    "        ref_tmp_var = True\n",
    "    else:\n",
    "        print('Please follow the instructions given and use the same variables provided in the instructions.')\n",
    "\n",
    "\n",
    "assert ref_tmp_var"
   ]
  }
 ],
 "metadata": {
  "executed_sections": [],
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}