## Let's learn to train a machine learning algorithm and test it

This notebook will teach you to use one of the popular machine learning package called Scikit Learn to train a simple machine learning algorith to do a classification task.

Take a look at this website and different examples to explore further.

I encourage you to play around with the code and see what happens !


We will start by loading the necessary libraries to the workspace. 

In [None]:
# You don't want to change anything here now

import numpy as np # For some numerical stuff
import matplotlib.pyplot as plt # For making beautiful plots
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier # A simple machine learning model known as KNN
from sklearn.cross_validation import train_test_split # A utility to split data
from sklearn.metrics import precision_score
%pylab inline

# You may see some messages in the next line, don't worry about them

#### Now lets load the data to our workspace 

In [None]:
dataset = load_iris() # Load the complete iris data structure to this variable

# Now lets get the features
features = dataset['data']

# Lets also get the name of the features
feature_names = dataset['feature_names']

# The class labels
labels = dataset['target']



In [None]:
# Lets have a look at the names of the features and dimensions (shape) of the feature array and also see how many classes are present.
# Verify if the number of feature names are equal to the number of columns

print 'Feature names are :', feature_names

print '\nThe feature array has %d rows and %d columns'%(features.shape[0],features.shape[1])

print '\nThere are %d classes of objects in the dataset'%(len(np.unique(labels)))

#### Lets plot the data in a two dimensional space with the first feature on the x-axis and second on the y-axis 

In [None]:
index_1 = 0 # Modify this to change the x-axis . Now it will take the first column. [In python index 1 starts at '0']
index_2 = 1 # Modify this to change the y-axis

plt.scatter(features[:,index_1],features[:,index_2],c=labels) # Make the scatter plot
plt.xlabel(feature_names[index_1])
plt.ylabel(feature_names[index_2])

#### Split the data into train and test sample

Generally when training a machine learning algorithm, we have to validate its learning accuracy againts a set of test data whose labels are known. Performing this test will help us evalute how good the algorithm has learned. As a general practise we split our data into training and test samples. Usually 70% of the total data is used for training and the rest 30% for validation.

The following peice of code splits the data into training and test sets.

In [None]:
# train_data --> feature samples for training
# test_data --> feature samples to evaluate / test
# train_labels --> class labels for the training data
# test_labels --> class labels for the test data

train_data,test_data,train_labels,test_labels = train_test_split(features,labels,test_size=0.3,random_state=0)

In [None]:
# Lets have a look at the size of the train and test data

print 'Train data has %d samples'%(train_data.shape[0])
print 'Test data has %d samples'%(test_data.shape[0])

#### Training the machine 

In this example we will train a simple machine learning algorithm called K-nearest neighbors to classify the 3 different classes in the data we have loaded.

In [None]:
mymodel = KNeighborsClassifier(n_neighbors=5,) # Create the classifier object to a variable 'mymodel'

mymodel = mymodel.fit(train_data,train_labels) # Train the algorithm and save the model mymodel 

 That's it ! We have trained our first machine learning algorithm. Now lets test it.


#### Testing the algorithm 

Testing the algorithm is simple as training it. To evaluate the performance we will use an evaluation metric 
called 'Percision Score'. The precision score is defined as

$\mathrm{precision = \frac{Number \ of \ correctly \ classified \ samples}{Number \ of \ correctly \ classified \ samples \ + \ Number \ of \ incorrectly \ classified \ samples}}$

The higher this number better the performance of the machine learning algorithm. This simply means the machine learning algorithm has learnt the pattern well.

In [None]:
# Test the performance of the algorithm on the test data which was generated through the splitting before.

predictions = mymodel.predict(test_data)

# Now we have the class labels predicted by the algorithm for each test samples in the variable 'predictions'



In [None]:
# Time to check the precision score

score = precision_score(predictions,test_labels,average='micro')

print 'The precision score is %f'%(score*100)

As an excercise change the values of the following parameters in the above code and check how it affects 
the precision score.

* test_size=0.2 in test_train_split [ Change it values like 0.5, 0.2 etc] 

* "n_neighbours=5" in clf = KNeighborsClassifier(n_neighbors=5) [Change the value between 1 and 25] 
