{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"tags": [
"s1",
"content",
"l1"
]
},
"source": [
"# Logistic Regression: Model Building and Implementation\n",
"\n",
"
\n",
"## Titanic Survivors - Data Selection & Preparation\n",
"\n",
"Prior to fitting a logistic regression model for classifying who would likely survive, we have to examine the dataset with information from EDA as well as using other statistical methods. The logistic regression algorithm is also a supervised learning technique. \n",
"\n",
"The dataset from training and testing has data that cannot be directly used due to many issues including but not limited to:\n",
"\n",
"- Sparse column entries in certain columns such as Cabin.\n",
"- NaN entries in columns.\n",
"- Categorical variables with string entries.\n",
"- Selection of right columns.\n",
"\n",
"\n",
"### Sparsity\n",
"\n",
"Let us examine sparse columns by counting the ratio of NaNs to all the values. describe() function on dataframe provides information about mean, median and the number of values ignoring NaNs only for float/integer columns. \n",
"\n",
"train_data.describe()\n",
"`
\n", " PassengerId Survived Pclass Age SibSp \\\n", "count 891.000000 891.000000 891.000000 714.000000 891.000000 \n", "mean 446.000000 0.383838 2.308642 29.699118 0.523008 \n", "std 257.353842 0.486592 0.836071 14.526497 1.102743 \n", "min 1.000000 0.000000 1.000000 0.420000 0.000000 \n", "25% 223.500000 0.000000 2.000000 20.125000 0.000000 \n", "50% 446.000000 0.000000 3.000000 28.000000 0.000000 \n", "75% 668.500000 1.000000 3.000000 38.000000 1.000000 \n", "max 891.000000 1.000000 3.000000 80.000000 8.000000 \n", "\n", " Parch Fare \n", "count 891.000000 891.000000 \n", "mean 0.381594 32.204208 \n", "std 0.806057 49.693429 \n", "min 0.000000 0.000000 \n", "25% 0.000000 7.910400 \n", "50% 0.000000 14.454200 \n", "75% 0.000000 31.000000 \n", "max 6.000000 512.329200\n", "\n", "The count row provides information about how many values exist with rest being NaNs.\n", "\n", "### Age\n", "\n", "We can see that Age column has 714 entries with missing (891 - 714 = 177) values. This would be 177/891 = 0.19 or approximately 20% of missing values. If this percentage was small, we could choose to ignore those rows for fitting a logistic regression model. There are various methods to fill the missing values. But before we discuss the ways to fix the issue of Age column sparsity, let us examine other columns as well.\n", "\n", "### PassengerId, Name, Ticket\n", "\n", "We can see that PassengerId, Name and Ticket are all unique to each person and hence will not serve as columns for modeling. Logistic Regression or any supervised or unsupervised learning methods need to understand patterns in the dataset. This is a necessary condition, so that algorithms can make sense of the data available by mathematically recording these patterns. Hence, Ids, Names are usually candidates that aren't useful for modeling. They are needed for identifying the person post recommendation, prediction or classification. They are also going to be useful later for other columns, thereby improving the overall dataset. \n", "\n", "### Cabin\n", "\n", "The Cabin column is really sparse with just (148/891 = 0.16) 16% data available. You can use len(train_data.Cabin.unique()) to determine the total length. When data is very sparse, we can ignore it for modeling in the first iteration. Later for improving the fit, this column can be investigated deeper to extract more information.\n", "\n", "### Embarked\n", "\n", "This data shows the point where passengers Embarked. It has very less sparsity with train_data[train_data.Embarked.notnull()] = 889 which is nearly all the data (891). Hence, it can be useful for modeling.\n", "\n", "### Person\n", "\n", "This is a column we created ourselves by splitting up age into different bands of Child, Adult, Senior and Unknown. We have to determine how any Unknown people are there so that we can build better models. Since, this variable depends directly on age, if we can fix the sparsity of age, this will be fixed as well.\n", "\n", "
You don't need to look for NaNs as Fare is a complete set.
" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "s1", "l1", "ans" ] }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "C:\\Users\\Kshitij\\Anaconda3\\lib\\site-packages\\statsmodels\\nonparametric\\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future\n", " y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j\n" ] } ], "source": [ "ind_var = train_data['Fare']\n", "fare_plot = sns.distplot(ind_var)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [ "s1", "hid", "l1" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " ind_var_ = train_data['Fare']\n", " \n", " if np.all(ind_var == ind_var_):\n", " ref_assert_var = True\n", " out = fare_plot.get_figure()\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] }, { "cell_type": "markdown", "metadata": { "tags": [ "l2", "content", "s2" ] }, "source": [ "\n", "\n", "\n", "Male | \n", "\t\t\tFemale | \n", "\t\t
0 | \n", "\t\t\t1 | \n", "\t\t
1 | \n", "\t\t\t0 | \n", "\t\t
Female | \n", "\t\t
0 | \n", "\t\t
1 | \n", "\t\t
\n", "girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()\n", "3.696\n", "\n", "boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n", "12.0\n", "\n", "woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()\n", "27.763\n", "\n", "man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()\n", "12.0\n", "\n", "woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()\n", "35.898\n", "\n", "man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()\n", "32.332089552238806\n", "\n", "We shall use the above estimates with an imputation function that we will build based on the same rules as above. Math function is imported as to check for NaNs outside of dataframe methods, it is needed. \n", "```python\n", "import math\n", "def impute_age(row):\n", " if math.isnan(row[5]):\n", " if ((('Miss. ') in row[3]) and (row[7] == 1)):\n", " return girl_child_est\n", " elif ((('Master. ') in row[3]) and (row[7] == 1)):\n", " return boy_child_est\n", " elif ((('Miss. ') in row[3]) and (row[7] == 0)):\n", " return woman_adult_est\n", " elif (('Mrs. ') in row[3]):\n", " return woman_married_est\n", " else:\n", " return man_married_est\n", " else:\n", " return row[5]\n", "```\n", " \n", "Dataframe has an apply method that can apply this function to either each element in the dataframe or to each row. To specify row instead of applying the function to each element, pass axis=1 to the function:\n", "```python\n", "train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)\n", "test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)\n", "```\n", "
use train_data.head() or .columns to determine the index of age in each row and use that with math.isnan(x)
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "tags": [ "l2", "s2", "ans" ] }, "outputs": [ { "data": { "text/html": [ "\n", " | PassengerId | \n", "Survived | \n", "Pclass | \n", "Name | \n", "Sex | \n", "Age | \n", "SibSp | \n", "Parch | \n", "Ticket | \n", "Fare | \n", "Cabin | \n", "Embarked | \n", "Imputed_Age | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "0 | \n", "3 | \n", "Braund, Mr. Owen Harris | \n", "male | \n", "22.0 | \n", "1 | \n", "0 | \n", "A/5 21171 | \n", "7.2500 | \n", "NaN | \n", "S | \n", "22.0 | \n", "
1 | \n", "2 | \n", "1 | \n", "1 | \n", "Cumings, Mrs. John Bradley (Florence Briggs Th... | \n", "female | \n", "38.0 | \n", "1 | \n", "0 | \n", "PC 17599 | \n", "71.2833 | \n", "C85 | \n", "C | \n", "38.0 | \n", "
2 | \n", "3 | \n", "1 | \n", "3 | \n", "Heikkinen, Miss. Laina | \n", "female | \n", "26.0 | \n", "0 | \n", "0 | \n", "STON/O2. 3101282 | \n", "7.9250 | \n", "NaN | \n", "S | \n", "26.0 | \n", "
3 | \n", "4 | \n", "1 | \n", "1 | \n", "Futrelle, Mrs. Jacques Heath (Lily May Peel) | \n", "female | \n", "35.0 | \n", "1 | \n", "0 | \n", "113803 | \n", "53.1000 | \n", "C123 | \n", "S | \n", "35.0 | \n", "
4 | \n", "5 | \n", "0 | \n", "3 | \n", "Allen, Mr. William Henry | \n", "male | \n", "35.0 | \n", "0 | \n", "0 | \n", "373450 | \n", "8.0500 | \n", "NaN | \n", "S | \n", "35.0 | \n", "
\n", "\t\t\t | Class A | \n", "\t\t\tNot Class A | \n", "\t\t
---|---|---|
\"Class A\" Prediction | \n", "\t\t\tTrue Positive | \n", "\t\t\tFalse Positive | \n", "\t\t
\"Not Class A\" Prediction | \n", "\t\t\tFalse Negative | \n", "\t\t\tTrue Negative | \n", "\t\t
Include all features and generate the ROC curve.
" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "tags": [ "l3", "s3", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Optimization terminated successfully.\n", " Current function value: 0.450489\n", " Iterations 6\n" ] }, { "data": { "text/plain": [ "[change features list
" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "tags": [ "s4", "l4", "ans" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.796857463524\n" ] } ], "source": [ "features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n", "\n", "log_sci_model = LogisticRegression()\n", "log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])\n", "log_score = log_sci_model.score(train_data[features], train_data['Survived'])\n", "print(log_score)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "tags": [ "s4", "hid", "l4" ] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "continue\n" ] } ], "source": [ "ref_tmp_var = False\n", "\n", "\n", "try:\n", " ref_assert_var = False\n", " features_ = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']\n", " \n", " if np.all(features == features_):\n", " ref_assert_var = True\n", " else:\n", " ref_assert_var = False\n", " \n", "except Exception:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "else:\n", " if ref_assert_var:\n", " ref_tmp_var = True\n", " else:\n", " print('Please follow the instructions given and use the same variables provided in the instructions.')\n", "\n", "\n", "assert ref_tmp_var" ] } ], "metadata": { "executed_sections": [], "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }