{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": [
"s1",
"content",
"l1"
]
},
"source": [
"# Logistic Regression\n",
"\n",
"## Logistic Regression\n",
"\n",
"### Introduction to Classification\n",
"\n",
"Classification refers to associating a target variable uniquely into one of the known classes. When number of classes are more than 2, then it is called multiclass classification. When the number of classes are two, then it is called binary classification. Logistic Regression is one of the popular type of classification techniques used. \n",
"\n",
"Consider the following type of scenario where the output y, varies with x as shown in the figure below where linear regression is used for fitting the model. \n",
"\n",
"\n",
"\n",
"As you can see above the line does not pass through the data points. If they were to pass through the data points and 'fit' the data approximately well, then a Linear Regression model would be useful. Also, there are a good number of overlaps of 0s and 1s for the same value of X. This means that for a given age, it is difficult to tell if the person aboard the Titanic survived or dead easily with the above graph. This problem can be clearly identified as a classification problem as the output is a finite number, either a 0 or a 1. This is also an example of how Exploratory Data Analysis (EDA) can be useful by plotting the output variable, 'Survived' against one of the independent variables, 'Age'.\n",
"\n",
"\n",
"We cannot use a linear regression to model a binary response, because the predicted values of $y$ will not be limited to 0 or 1. Instead of $y$, we could model probability of $y$, but even then, the predicted values cannot be limited to values between 0 and 1. Also, the relationship is not linear, but sigmoidal or S-shaped. In such cases, a function of the probability is used. The most frequently used is the logit function which is the natural log of the odds of success. The logistic regression model is expressed as shown below.\n",
"\n",
"$Ln \\big(\\frac{p(y)}{1-p(y)}\\big) = \\beta_0 + \\sum\\limits_{i=0}^{n}X_i\\beta_i \\quad$ where $p(y)$ is the probability of success.\n",
"\n",
"\n",
"The sigmoid function that is used in logistic regression is shown below:\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"
\n",
"## Exercise:\n",
"\n",
"Given the data from Kaggle related to survivors of the Titanic, identify if the problem is a classification problem or regression problem without visualization.\n",
"\n",
"- What command would you use to identify the problem? \n",
"- assign the result to the variable titanic_stats"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"tags": [
"s1",
"ce",
"l1"
]
},
"outputs": [
{
"data": {
"text/html": [
"
\n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "NaN | \n", "S | \n", "
1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "71.2833 | \n", "C85 | \n", "C | \n", "
2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "NaN | \n", "S | \n", "
3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "
4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "NaN | \n", "S | \n", "
Think of a command that can be used to print the unique values in a column.
" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "s1", "l1", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[0 1]\n" ] } ], "source": [ "titanic_stats = train_data['Survived'].unique()\n", "print(titanic_stats)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "tags": [ "s1", "hid", "l1" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " titanic_stats_ = train_data['Survived'].unique()\n", " \n", " if np.all(titanic_stats == titanic_stats_):\n", " ref_assert_var = True\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\n", "\n", "\n", "train_data[train_data['Age'].notnull() shows which rows have Age columns with values and not NaNs. Now reference the age over this dataframe and pass this into the function.
" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "tags": [ "l2", "s2", "ans" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n" ] } ], "source": [ "sns.plt.ylabel('Probability')\n", "dist_plot = sns.distplot(train_data[train_data['Age'].notnull()].Age)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "l2", "hid", "s2" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " dist_plot_ = sns.distplot(train_data[train_data['Age'].notnull()].Age)\n", " \n", " if len(dist_plot_.get_lines()[0].get_data()) == len(dist_plot.get_lines()[0].get_data()):\n", " ref_assert_var = True\n", " out = dist_plot.get_figure()\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l3", "s3", "content" ] }, "source": [ "\n", "\n", "\n", "Remove hue column as it is optional.
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true, "tags": [ "l3", "s3", "ans" ] }, "outputs": [], "source": [ "g = sns.factorplot(x='Person', y='Survived', data=train_data, size=5, kind='bar', palette='muted')\n", " " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "l3", "s3", "hid" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " g_ = sns.factorplot(x='Person', y='Survived', data=train_data, size=5, kind='bar', palette='muted')\n", " \n", " if np.all(g.data.Person == g_.data.Person) and np.all(g.data.Survived == g_.data.Survived):\n", " ref_assert_var = True\n", " out = g\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l4", "s4", "content" ] }, "source": [ "\n", "\n", "\n", "replace Pclass with Person.
\n", "\n", "train_data['Person'] = train_data['Age'].apply(person_type)
\n",
"