# Pandas and Scikit-learn

Pandas is a Python library that contains high-level data structures and manipulation tools designed for data analysis. Think of Pandas as a Python version of Excel. Scikit-learn, on the other hand, is an open-source machine learning library for Python.

While Scikit-learn does a lot of the heavy lifting, what's equally important is ensuring that raw data is processed in such a way that we are able to 'feed' it to Scikit-learn. Hence, the ability to manipulate raw data with Pandas makes it an indispensible part of our toolkit.

# Kaggle

Kaggle is the leading platform for data science competitions. Participants compete for cash prizes by submitting the best predictive model to problems posted on the competition website.

https://www.kaggle.com/competitions

We will be reviewing the data from the Kaggle Titanic competition. Our aim is to make predictions on whether or not specific passengers on the Titanic survived, based on characteristics such as age, sex and class.

# Section 1-0 - First Cut

We will start by splitting the data into a training set and a test set. Next we process the training data, at which point the data will be used to 'train' (or 'fit') our model. With the trained model, we apply it to the test data to make the predictions. Finally, we then compare our predictions against the 'ground truth' to see how well our model performed.

It is very common to encounter missing values in a data set. In this section, we will take the simplest (or perhaps, simplistic) approach of ignoring the whole row if any part of it contains an NaN value. We will build on this approach in later sections.

## Pandas - Extracting data

First, we load the training data from a .csv file. This is the similar to the data found on the Kaggle website: 

https://www.kaggle.com/c/titanic/data

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('data.csv')

We review the size of the data.

In [2]:
df.shape

(891, 12)

We now split the data into an 80% training set and 20% test set.

In [3]:
df_train = df.iloc[:712, :]
df_test = df.iloc[712:, :]

## Pandas - Cleaning data

We review a selection of the data. 

In [4]:
df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


We notice that the columns describe features of the Titanic passengers, such as age, sex, and class. Of particular interest is the column Survived, which describes whether or not the passenger survived. When training our model, what we are essentially doing is assessing how each feature impacts whether or not the passenger survived (or if the feature makes an impact at all).

**Exercise**:
- Write the code to review the tail-end section of the data. 

We observe that the columns Name, Ticket and Cabin are, for our current purposes, irrelevant. We proceed to remove them from our data set.

In [5]:
df_train = df_train.drop(['Name', 'Ticket', 'Cabin'], axis=1)

Next, we review the type of data in the columns, and their respective counts.

In [6]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 712 entries, 0 to 711
Data columns (total 9 columns):
PassengerId    712 non-null int64
Survived       712 non-null int64
Pclass         712 non-null int64
Sex            712 non-null object
Age            565 non-null float64
SibSp          712 non-null int64
Parch          712 non-null int64
Fare           712 non-null float64
Embarked       711 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 55.6+ KB


We notice that the columns Age and Embarked have NaNs or missing values. As previously discussed, we take the approach of simply removing the rows with missing values.

In [7]:
df_train = df_train.dropna()

**Question**

- If you were to fill in the missing values, with what values would you fill them with? Why?

Scikit-learn only takes numerical arrays as inputs. As such, we would need to convert the categorical columns Sex and Embarked into numerical ones. We first review the range of values for the column Sex, and map the string values to numbers.

In [8]:
df_train['Sex'].unique()

array(['male', 'female'], dtype=object)

In [9]:
df_train['Sex'] = df_train['Sex'].map({'female':0, 'male':1})

Similarly for Embarked, we review the range of values and map the string values to a numerical value that represents where the passenger embarked from.

In [10]:
df_train['Embarked'].unique()

array(['S', 'C', 'Q'], dtype=object)

In [11]:
df_train['Embarked'] = df_train['Embarked'].map({'C':1, 'S':2, 'Q':3})

**Question**
- What problems might we encounter by mapping C, S, and Q in the column Embarked to the values 1, 2, and 3? In other words, what does the ordering imply? Does the same problem exist for the column Sex?

In our final review of our training data, we check that (1) there are no NaN values, and (2) all the values are in numerical form.

In [12]:
df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,1,22,1,0,7.25,2
1,2,1,1,0,38,1,0,71.2833,1
2,3,1,3,0,26,0,0,7.925,2
3,4,1,1,0,35,1,0,53.1,2
4,5,0,3,1,35,0,0,8.05,2
6,7,0,1,1,54,0,0,51.8625,2
7,8,0,3,1,2,3,1,21.075,2
8,9,1,3,0,27,0,2,11.1333,2
9,10,1,2,0,14,1,0,30.0708,1
10,11,1,3,0,4,1,1,16.7,2


In [13]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 564 entries, 0 to 710
Data columns (total 9 columns):
PassengerId    564 non-null int64
Survived       564 non-null int64
Pclass         564 non-null int64
Sex            564 non-null int64
Age            564 non-null float64
SibSp          564 non-null int64
Parch          564 non-null int64
Fare           564 non-null float64
Embarked       564 non-null int64
dtypes: float64(2), int64(7)
memory usage: 44.1 KB


Finally, we convert the processed training data from a Pandas dataframe into a numerical (Numpy) array, and create a column from the outcomes of the training data.

In [14]:
X_train = df_train.iloc[:, 2:].values
y_train = df_train['Survived']

## Scikit-learn - Training the model

In this section, we'll simply use the model as a black box. We'll review more sophisticated techniques in later sections.

In particular, we'll be using the Random Forest model. The intuition is as follows: each feature is reviewed to see how much impact it makes to the outcome. The most prominent feature is segmented into a 'branch'. A collection of branches is a 'tree'. The Random Forest model, broadly speaking, creates a 'forest' of trees and aggregates the results.

http://en.wikipedia.org/wiki/Random_forest

In [15]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, random_state=0)

We use the processed training data to 'train' (or 'fit') our model.

In [16]:
model = model.fit(X_train, y_train)

## Scikit-learn - Making predictions

We now review a selection of the test data.

In [17]:
df_test.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
712,713,1,1,"Taylor, Mr. Elmer Zebley",male,48.0,1,0,19996,52.0,C126,S
713,714,0,3,"Larsson, Mr. August Viktor",male,29.0,0,0,7545,9.4833,,S
714,715,0,2,"Greenberg, Mr. Samuel",male,52.0,0,0,250647,13.0,,S
715,716,0,3,"Soholt, Mr. Peter Andreas Lauritz Andersen",male,19.0,0,0,348124,7.65,F G73,S
716,717,1,1,"Endres, Miss. Caroline Louise",female,38.0,0,0,PC 17757,227.525,C45,C
717,718,1,2,"Troutt, Miss. Edwina Celia ""Winnie""",female,27.0,0,0,34218,10.5,E101,S
718,719,0,3,"McEvoy, Mr. Michael",male,,0,0,36568,15.5,,Q
719,720,0,3,"Johnson, Mr. Malkolm Joackim",male,33.0,0,0,347062,7.775,,S
720,721,1,2,"Harper, Miss. Annie Jessie ""Nina""",female,6.0,0,1,248727,33.0,,S
721,722,0,3,"Jensen, Mr. Svend Lauritz",male,17.0,1,0,350048,7.0542,,S


As before, we process the test data in a similar fashion to what we did to the training data.

In [18]:
df_test = df_test.drop(['Name', 'Ticket', 'Cabin'], axis=1)

df_test = df_test.dropna()

df_test['Sex'] = df_test['Sex'].map({'female': 0, 'male':1})
df_test['Embarked'] = df_test['Embarked'].map({'C':1, 'S':2, 'Q':3})

X_test = df_test.iloc[:, 2:]
y_test = df_test['Survived']

We now apply the trained model to the test data (omitting the column PassengerId) to produce an output of predictions.

In [19]:
y_prediction = model.predict(X_test)

## Evaluation

Comparing our predictions against the actual values gives us a list of 0s and 1s, and adding up the elements of the list gives us the number of correct predictions.

In [20]:
np.sum(y_prediction == y_test)

123

To get a sense of how good our prediction is, we calculate the model's accuracy by dividing the number of correct predictions by the length of the array of actual values.

In [21]:
np.sum(y_prediction == y_test) / float(len(y_test))

0.83108108108108103

Hence our predictions are 84% accurate. We now compare this against our best guess, by looking at the proportion of 0s and 1s.

In [22]:
np.sum(y_test) / float(len(y_test))

0.39189189189189189

Hence 39% of the passengers survived (with value 1) and 61% did not survive. If we were to guess that all the passengers did not survive, we would have a 61% accuracy. Hence our model gives an improvement of 23%!

In this section, we took the simplest approach of ignoring missing values. We look to build on this approach in Section 1-1.