{ "cells": [ { "cell_type": "markdown", "metadata": { "tags": [ "s1", "content", "l1" ] }, "source": [ "# Logistic Regression\n", "\n", "## Logistic Regression\n", "\n", "### Introduction to Classification\n", "\n", "Classification refers to associating a target variable uniquely into one of the known classes. When number of classes are more than 2, then it is called multiclass classification. When the number of classes are two, then it is called binary classification. Logistic Regression is one of the popular type of classification techniques used. \n", "\n", "Consider the following type of scenario where the output y, varies with x as shown in the figure below where linear regression is used for fitting the model. \n", "\n", "\n", "\n", "As you can see above the line does not pass through the data points. If they were to pass through the data points and 'fit' the data approximately well, then a Linear Regression model would be useful. Also, there are a good number of overlaps of 0s and 1s for the same value of X. This means that for a given age, it is difficult to tell if the person aboard the Titanic survived or dead easily with the above graph. This problem can be clearly identified as a classification problem as the output is a finite number, either a 0 or a 1. This is also an example of how Exploratory Data Analysis (EDA) can be useful by plotting the output variable, 'Survived' against one of the independent variables, 'Age'.\n", "\n", "\n", "We cannot use a linear regression to model a binary response, because the predicted values of $y$ will not be limited to 0 or 1. Instead of $y$, we could model probability of $y$, but even then, the predicted values cannot be limited to values between 0 and 1. Also, the relationship is not linear, but sigmoidal or S-shaped. In such cases, a function of the probability is used. The most frequently used is the logit function which is the natural log of the odds of success. The logistic regression model is expressed as shown below.\n", "\n", "$Ln \\big(\\frac{p(y)}{1-p(y)}\\big) = \\beta_0 + \\sum\\limits_{i=0}^{n}X_i\\beta_i \\quad$ where $p(y)$ is the probability of success.\n", "\n", "\n", "The sigmoid function that is used in logistic regression is shown below:\n", "\n", "\n", "\n", "\n", "\n", "
\n", "## Exercise:\n", "\n", "Given the data from Kaggle related to survivors of the Titanic, identify if the problem is a classification problem or regression problem without visualization.\n", "\n", "- What command would you use to identify the problem? \n", "- assign the result to the variable titanic_stats" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "s1", "ce", "l1" ] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22.0 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 \n", "2 Heikkinen, Miss. Laina female 26.0 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 \n", "4 Allen, Mr. William Henry male 35.0 0 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "\n", "train_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv\")\n", "test_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv\")\n", "train_data.head()" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "s1", "l1", "hint" ] }, "source": [ "

Think of a command that can be used to print the unique values in a column. 

" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "s1", "l1", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1]\n" ] } ], "source": [ "titanic_stats = train_data['Survived'].unique()\n", "print(titanic_stats)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "s1", "hid", "l1" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " titanic_stats_ = train_data['Survived'].unique()\n", " \n", " if np.all(titanic_stats == titanic_stats_):\n", " ref_assert_var = True\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\n", "\n", "\n", "


\n", "## Titanic Survivors\n", "\n", "Kaggle posted a famous dataset from Titanic. It was about the survivors amongst the people aboard. Here is the description from Kaggle:\n", "\n", "### Competition Description\n", "\n", "The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.\n", "\n", "One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.\n", "\n", "In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.\n", "\n", "source: https://www.kaggle.com/c/titanic\n", "\n", "We shall use this dataset for analysis and study of logistic regression. \n", "\n", "### Loading the Data\n", "\n", "The data has been split into train_data, test_data and can be loaded from kaggle with read_csv command:\n", "```python\n", "import pandas as pd\n", "train_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv\")\n", "test_data = pd.read_csv(\"https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv\")\n", "```\n", "\n", "### Analysis of the Dataset\n", "\n", "Let us perform EDA on the dataset to understand it better, prior to fitting the logistic regression model. Let us use distplot in seaborn to generate the distribution plot of those who survived and those who didn't:\n", "```python\n", "sns.distplot(train_data['Survived'])\n", "```\n", "\n", "\n", "From the above distribution, it can be observed that those who survived are approximately 2/3rd in the training set.We can observe that there were children and mostly adults followed by lesser aged people with max of 80 years approximately. This can be observed by two peaks in the fitted distribution.\n", "\n", "
\n", "## Exercise:\n", "\n", "#### Visualizing the Age of People Aboard\n", "\n", "Let us look at the distribution of age of people aboard. This is the histogram plot which shows the number of people (y-axis) in the age band.\n", "\n", " - Remove rows that have unknown Age using .notnull() over the Age column to detect the non-null entries.\n", " - Using the non-null entries plot the distribution using the sns.distplot() and assign the plot to variable, dist_plot." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true, "tags": [ "l2", "ce", "s2" ] }, "outputs": [], "source": [ "# Modify the plot below and assign to the variable g\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "s2", "hint" ] }, "source": [ "

train_data[train_data['Age'].notnull() shows which rows have Age columns with values and not NaNs. Now reference the age over this dataframe and pass this into the function.

" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "l2", "s2", "ans" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n" ] } ], "source": [ "sns.plt.ylabel('Probability')\n", "dist_plot = sns.distplot(train_data[train_data['Age'].notnull()].Age)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "l2", "hid", "s2" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " dist_plot_ = sns.distplot(train_data[train_data['Age'].notnull()].Age)\n", " \n", " if len(dist_plot_.get_lines()[0].get_data()) == len(dist_plot.get_lines()[0].get_data()):\n", " ref_assert_var = True\n", " out = dist_plot.get_figure()\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l3", "s3", "content" ] }, "source": [ "\n", "\n", "\n", "


\n", "## Exploratory Data Analysis (EDA)\n", "\n", "### Understanding Age\n", "\n", "We do not have any prior information about how the age for child was defined in those years. The above information is helpful as we can categorize the people as child, adults and seniors by looking the peaks.\n", "\n", "Let us define functions to categorize the age into these three categories with age of 16 and 48 as the defining separators from the graph.\n", "\n", "```python\n", "def person_type(x):\n", " if x <=16:\n", " return 'C'\n", " elif x <= 48:\n", " return 'A'\n", " elif x <= 90:\n", " return 'S'\n", " else:\n", " return 'U'\n", "```\n", "The above function categorizes the continous variable 'Age' into a categorized variable. We can use apply function to transform each entry in the Age column and assign it to Person.\n", "```python\n", "train_data['Person'] = train_data['Age'].apply(person_type)\n", "test_data['Person'] = test_data['Age'].apply(person_type)\n", "```\n", "We can now look at who is likely to survive depending on the type of Person with a factor plot. A factor plot can consider another factor such as Sex along with the Age. \n", "```python\n", "g = sns.factorplot(x=\"Person\", y=\"Survived\", hue=\"Sex\", data=train_data,\n", " size=5, kind=\"bar\", palette=\"muted\")\n", "```\n", "\n", "\n", "From the above plot you can see that senior women were most likely to survive with highest probability than anyone else. Now we can ignore the Sex type and visualize who was likely to survive.\n", "\n", "
\n", "## Exercise:\n", "\n", "- Using the information from the above graph, visualize the number of people who survived vs Person ignoring the Sex attribute. This will provide us information on who mostly survived in the dataset." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": true, "tags": [ "l3", "s3", "ce" ] }, "outputs": [], "source": [ "def person_type(x):\n", " if x <=16:\n", " return 'C'\n", " elif x <= 48:\n", " return 'A'\n", " elif x <= 90:\n", " return 'S'\n", " else:\n", " return 'U'\n", "\n", "train_data['Person'] = train_data['Age'].apply(person_type)\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l3", "s3", "hint" ] }, "source": [ "

Remove hue column as it is optional.

" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true, "tags": [ "l3", "s3", "ans" ] }, "outputs": [], "source": [ "g = sns.factorplot(x='Person', y='Survived', data=train_data, size=5, kind='bar', palette='muted')\n", " " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "l3", "s3", "hid" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " g_ = sns.factorplot(x='Person', y='Survived', data=train_data, size=5, kind='bar', palette='muted')\n", " \n", " if np.all(g.data.Person == g_.data.Person) and np.all(g.data.Survived == g_.data.Survived):\n", " ref_assert_var = True\n", " out = g\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l4", "s4", "content" ] }, "source": [ "\n", "\n", "\n", "


\n", "## Chances of Survival\n", "\n", "Let us now look at the class from which people had better chances of survival:\n", "```python\n", "g = sns.factorplot(x=\"Pclass\", y=\"Survived\", data=train_data, size=5, kind=\"bar\", palette=\"muted\")\n", "```\n", "\n", "\n", "We can see that 1st class passengers had much better chances of survival. We can also visualize the distribution of Male and Female using a violin plot. This is very useful to measure the mean and variance of passengers as to their likelihood of survival in each class grouped by Male and Female. The distribution graphs are shown on either side of the line for each class.\n", "```python\n", "sns.set(style=\"whitegrid\", palette=\"pastel\", color_codes=True)\n", "sns.violinplot(x=\"Pclass\", y=\"Survived\", hue=\"Sex\", data=train_data, split=True,\n", "... inner=\"quart\", palette={\"male\": \"b\", \"female\": \"y\"})\n", "```\n", "\n", "\n", "From the above plot you can see that in 1st class, females were more likely to survive and in the 3rd class, most men were unlikely to survive. \n", "\n", "
\n", "## Exercise:\n", "\n", "- Generate a violin plot of type of person vs Survived for both men and women\n", "- What do you interpret from the visualization?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true, "tags": [ "l4", "s4", "ce" ] }, "outputs": [], "source": [ "# Modify the code below\n", "g = sns.violinplot(x='Pclass', y='Survived', hue='Sex', data=train_data, split=True, inner='quart', palette={'male': 'b', 'female': 'y'})\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "s4", "l4", "hint" ] }, "source": [ "

replace Pclass with Person.

\n", "\n", "

train_data['Person'] = train_data['Age'].apply(person_type)
\n", " 

" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": true, "tags": [ "s4", "l4", "ans" ] }, "outputs": [], "source": [ "train_data['Person'] = train_data['Age'].apply(person_type)\n", "g = sns.violinplot(x='Person', y='Survived', hue='Sex', data=train_data, split=True, inner='quart', palette={'male': 'b', 'female': 'y'})\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "tags": [ "s4", "hid", "l4" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " g_ = sns.violinplot(x='Person', y='Survived', hue='Sex', data=train_data, split=True, \n", " inner='quart', palette={'male': 'b', 'female': 'y'})\n", " \n", " if len(g.__dict__) == len(g_.__dict__):\n", " ref_assert_var = True\n", " out = g.get_figure()\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] } ], "metadata": { "executed_sections": [], "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" }, "rf_version": 1 }, "nbformat": 4, "nbformat_minor": 2 }