{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Pandas and Scikit-learn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Kaggle" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.\n", "\n", "https://www.kaggle.com/competitions\n", "\n", "We will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 1-0 - First Cut" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We will start by splitting the data into a training set and a test set. Next we process the training data, at which point the data will be used to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we then compare our predictions against the 'ground truth' to see how well our model performed." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Extracting data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website: \n", "\n", "https://www.kaggle.com/c/titanic/data" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "\n", "df = pd.read_csv('data.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We review the size of the data." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "(891, 12)" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now split the data into an 80% training set and 20% test set." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df.iloc[:712, :]\n", "df_test = df.iloc[712:, :]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas - Cleaning data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We review a selection of the data. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale2210A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female3810PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale2600STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female351011380353.1000C123S
4503Allen, Mr. William Henrymale35003734508.0500NaNS
5603Moran, Mr. JamesmaleNaN003308778.4583NaNQ
6701McCarthy, Mr. Timothy Jmale54001746351.8625E46S
7803Palsson, Master. Gosta Leonardmale23134990921.0750NaNS
8913Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)female270234774211.1333NaNS
91012Nasser, Mrs. Nicholas (Adele Achem)female141023773630.0708NaNC
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "0 1 0 3 \n", "1 2 1 1 \n", "2 3 1 3 \n", "3 4 1 1 \n", "4 5 0 3 \n", "5 6 0 3 \n", "6 7 0 1 \n", "7 8 0 3 \n", "8 9 1 3 \n", "9 10 1 2 \n", "\n", " Name Sex Age SibSp \\\n", "0 Braund, Mr. Owen Harris male 22 1 \n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 \n", "2 Heikkinen, Miss. Laina female 26 0 \n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 \n", "4 Allen, Mr. William Henry male 35 0 \n", "5 Moran, Mr. James male NaN 0 \n", "6 McCarthy, Mr. Timothy J male 54 0 \n", "7 Palsson, Master. Gosta Leonard male 2 3 \n", "8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 \n", "9 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 \n", "\n", " Parch Ticket Fare Cabin Embarked \n", "0 0 A/5 21171 7.2500 NaN S \n", "1 0 PC 17599 71.2833 C85 C \n", "2 0 STON/O2. 3101282 7.9250 NaN S \n", "3 0 113803 53.1000 C123 S \n", "4 0 373450 8.0500 NaN S \n", "5 0 330877 8.4583 NaN Q \n", "6 0 17463 51.8625 E46 S \n", "7 1 349909 21.0750 NaN S \n", "8 2 347742 11.1333 NaN S \n", "9 0 237736 30.0708 NaN C " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Exercise**:\n", "- Write the code to review the tail-end section of the data. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we review the type of data in the columns, and their respective counts." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 712 entries, 0 to 711\n", "Data columns (total 9 columns):\n", "PassengerId 712 non-null int64\n", "Survived 712 non-null int64\n", "Pclass 712 non-null int64\n", "Sex 712 non-null object\n", "Age 565 non-null float64\n", "SibSp 712 non-null int64\n", "Parch 712 non-null int64\n", "Fare 712 non-null float64\n", "Embarked 711 non-null object\n", "dtypes: float64(2), int64(5), object(2)\n", "memory usage: 55.6+ KB\n" ] } ], "source": [ "df_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train = df_train.dropna()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "\n", "- If you were to fill in the missing values, with what values would you fill them with? Why?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and map the string values to numbers." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['male', 'female'], dtype=object)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train['Sex'].unique()" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train['Sex'] = df_train['Sex'].map({'female':0, 'male':1})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Similarly for Embarked, we review the range of values and map the string values to a numerical value that represents where the passenger embarked from." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array(['S', 'C', 'Q'], dtype=object)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train['Embarked'].unique()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Question**\n", "- What problems might we encounter by mapping C, S, and Q in the column Embarked to the values 1, 2, and 3? In other words, what does the ordering imply? Does the same problem exist for the column Sex?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our final review of our training data, we check that (1) there are no NaN values, and (2) all the values are in numerical form." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassSexAgeSibSpParchFareEmbarked
0103122107.25002
12110381071.28331
2313026007.92502
34110351053.10002
4503135008.05002
67011540051.86252
7803123121.07502
89130270211.13332
910120141030.07081
101113041116.70002
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass Sex Age SibSp Parch Fare Embarked\n", "0 1 0 3 1 22 1 0 7.2500 2\n", "1 2 1 1 0 38 1 0 71.2833 1\n", "2 3 1 3 0 26 0 0 7.9250 2\n", "3 4 1 1 0 35 1 0 53.1000 2\n", "4 5 0 3 1 35 0 0 8.0500 2\n", "6 7 0 1 1 54 0 0 51.8625 2\n", "7 8 0 3 1 2 3 1 21.0750 2\n", "8 9 1 3 0 27 0 2 11.1333 2\n", "9 10 1 2 0 14 1 0 30.0708 1\n", "10 11 1 3 0 4 1 1 16.7000 2" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_train.head(10)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Int64Index: 564 entries, 0 to 710\n", "Data columns (total 9 columns):\n", "PassengerId 564 non-null int64\n", "Survived 564 non-null int64\n", "Pclass 564 non-null int64\n", "Sex 564 non-null int64\n", "Age 564 non-null float64\n", "SibSp 564 non-null int64\n", "Parch 564 non-null int64\n", "Fare 564 non-null float64\n", "Embarked 564 non-null int64\n", "dtypes: float64(2), int64(7)\n", "memory usage: 44.1 KB\n" ] } ], "source": [ "df_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array, and create a column from the outcomes of the training data." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X_train = df_train.iloc[:, 2:].values\n", "y_train = df_train['Survived']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Training the model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In particular, we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.\n", "\n", "http://en.wikipedia.org/wiki/Random_forest" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestClassifier\n", "\n", "model = RandomForestClassifier(n_estimators=100, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use the processed training data to 'train' (or 'fit') our model." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "model = model.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Scikit-learn - Making predictions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now review a selection of the test data." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
71271311Taylor, Mr. Elmer Zebleymale48101999652.0000C126S
71371403Larsson, Mr. August Viktormale290075459.4833NaNS
71471502Greenberg, Mr. Samuelmale520025064713.0000NaNS
71571603Soholt, Mr. Peter Andreas Lauritz Andersenmale19003481247.6500F G73S
71671711Endres, Miss. Caroline Louisefemale3800PC 17757227.5250C45C
71771812Troutt, Miss. Edwina Celia \"Winnie\"female27003421810.5000E101S
71871903McEvoy, Mr. MichaelmaleNaN003656815.5000NaNQ
71972003Johnson, Mr. Malkolm Joackimmale33003470627.7750NaNS
72072112Harper, Miss. Annie Jessie \"Nina\"female60124872733.0000NaNS
72172203Jensen, Mr. Svend Lauritzmale17103500487.0542NaNS
\n", "
" ], "text/plain": [ " PassengerId Survived Pclass \\\n", "712 713 1 1 \n", "713 714 0 3 \n", "714 715 0 2 \n", "715 716 0 3 \n", "716 717 1 1 \n", "717 718 1 2 \n", "718 719 0 3 \n", "719 720 0 3 \n", "720 721 1 2 \n", "721 722 0 3 \n", "\n", " Name Sex Age SibSp Parch \\\n", "712 Taylor, Mr. Elmer Zebley male 48 1 0 \n", "713 Larsson, Mr. August Viktor male 29 0 0 \n", "714 Greenberg, Mr. Samuel male 52 0 0 \n", "715 Soholt, Mr. Peter Andreas Lauritz Andersen male 19 0 0 \n", "716 Endres, Miss. Caroline Louise female 38 0 0 \n", "717 Troutt, Miss. Edwina Celia \"Winnie\" female 27 0 0 \n", "718 McEvoy, Mr. Michael male NaN 0 0 \n", "719 Johnson, Mr. Malkolm Joackim male 33 0 0 \n", "720 Harper, Miss. Annie Jessie \"Nina\" female 6 0 1 \n", "721 Jensen, Mr. Svend Lauritz male 17 1 0 \n", "\n", " Ticket Fare Cabin Embarked \n", "712 19996 52.0000 C126 S \n", "713 7545 9.4833 NaN S \n", "714 250647 13.0000 NaN S \n", "715 348124 7.6500 F G73 S \n", "716 PC 17757 227.5250 C45 C \n", "717 34218 10.5000 E101 S \n", "718 36568 15.5000 NaN Q \n", "719 347062 7.7750 NaN S \n", "720 248727 33.0000 NaN S \n", "721 350048 7.0542 NaN S " ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_test.head(10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As before, we process the test data in a similar fashion to what we did to the training data." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)\n", "\n", "df_test = df_test.dropna()\n", "\n", "df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male':1})\n", "df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})\n", "\n", "X_test = df_test.iloc[:, 2:]\n", "y_test = df_test['Survived']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [], "source": [ "y_prediction = model.predict(X_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluation" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Comparing our predictions against the actual values gives us a list of 0s and 1s, and adding up the elements of the list gives us the number of correct predictions." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "123" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_prediction == y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To get a sense of how good our prediction is, we calculate the model's accuracy by dividing the number of correct predictions by the length of the array of actual values." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.83108108108108103" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_prediction == y_test) / float(len(y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence our predictions are 84% accurate. We now compare this against our best guess, by looking at the proportion of 0s and 1s." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "0.39189189189189189" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.sum(y_test) / float(len(y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence 39% of the passengers survived (with value 1) and 61% did not survive. If we were to guess that all the passengers did not survive, we would have a 61% accuracy. Hence our model gives an improvement of 23%!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this section, we took the simplest approach of ignoring missing values. We look to build on this approach in Section 1-1." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }