{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Andreas Mueller, Kyle Kastner, Sebastian Raschka \n",
"last updated: 2016-06-23 \n",
"\n",
"CPython 3.5.1\n",
"IPython 4.2.0\n",
"\n",
"numpy 1.11.0\n",
"scipy 0.17.1\n",
"matplotlib 1.5.1\n",
"pillow 3.2.0\n",
"scikit-learn 0.17.1\n"
]
}
],
"source": [
"%load_ext watermark\n",
"%watermark -d -u -a 'Andreas Mueller, Kyle Kastner, Sebastian Raschka' -v -p numpy,scipy,matplotlib,pillow,scikit-learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# SciPy 2016 Scikit-learn Tutorial"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Supervised Learning Part 1 -- Classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To visualize the workings of machine learning algorithms, it is often helpful to study two-dimensional or one-dimensional data, that is data with only one or two features. While in practice, datasets usually have many more features, it is hard to plot high-dimensional data in on two-dimensional screens.\n",
"\n",
"We will illustrate some very simple examples before we move on to more \"real world\" data sets."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"First, we will look at a two class classification problem in two dimensions. We use the synthetic data generated by the ``make_blobs`` function."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.datasets import make_blobs\n",
"\n",
"X, y = make_blobs(centers=2, random_state=0)\n",
"\n",
"print('X ~ n_samples x n_features:', X.shape)\n",
"print('y ~ n_samples:', y.shape)\n",
"\n",
"print('\\nFirst 5 samples:\\n', X[:5, :])\n",
"print('\\nFirst 5 labels:', y[:5])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As the data is two-dimensional, we can plot each sample as a point in a two-dimensional coordinate system, with the first feature being the x-axis and the second feature being the y-axis."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
" c='blue', s=40, label='0')\n",
"plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
" c='red', s=40, label='1', marker='s')\n",
"\n",
"plt.xlabel('first feature')\n",
"plt.ylabel('second feature')\n",
"plt.legend(loc='upper right');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Classification is a supervised task, and since we are interested in its performance on unseen data, we split our data into two parts:\n",
"\n",
"1. a training set that the learning algorithm uses to fit the model\n",
"2. a test set to evaluate the generalization performance of the model\n",
"\n",
"The ``train_test_split`` function from the ``cross_validation`` module does that for us -- we will use it to split a dataset into 75% training data and 25% test data.\n",
"\n",
"
\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.cross_validation import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(X, y,\n",
" test_size=0.25,\n",
" random_state=1234,\n",
" stratify=y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### The scikit-learn estimator API\n",
"
\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Every algorithm is exposed in scikit-learn via an ''Estimator'' object. (All models in scikit-learn have a very consistent interface). For instance, we first import the logistic regression class."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we instantiate the estimator object."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"classifier = LogisticRegression()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"X_train.shape"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"y_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To built the model from our data, that is to learn how to classify new points, we call the ``fit`` function with the training data, and the corresponding training labels (the desired output for the training data point):"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"classifier.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"(Some estimator methods such as `fit` return `self` by default. Thus, after executing the code snippet above, you will see the default parameters of this particular instance of `LogisticRegression`. Another way of retrieving the estimator's ininitialization parameters is to execute `classifier.get_params()`, which returns a parameter dictionary.)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then apply the model to unseen data and use the model to predict the estimated outcome using the ``predict`` method:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"prediction = classifier.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can compare these against the true labels:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(prediction)\n",
"print(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can evaluate our classifier quantitatively by measuring what fraction of predictions is correct. This is called **accuracy**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"np.mean(prediction == y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There is also a convenience function , ``score``, that all scikit-learn classifiers have to compute this directly from the test data:\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"classifier.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is often helpful to compare the generalization performance (on the test set) to the performance on the training set:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"classifier.score(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"LogisticRegression is a so-called linear model,\n",
"that means it will create a decision that is linear in the input space. In 2d, this simply means it finds a line to separate the blue from the red:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from figures import plot_2d_separator\n",
"\n",
"plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
" c='blue', s=40, label='0')\n",
"plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
" c='red', s=40, label='1', marker='s')\n",
"\n",
"plt.xlabel(\"first feature\")\n",
"plt.ylabel(\"second feature\")\n",
"plot_2d_separator(classifier, X)\n",
"plt.legend(loc='upper right');"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Estimated parameters**: All the estimated model parameters are attributes of the estimator object ending by an underscore. Here, these are the coefficients and the offset of the line:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"print(classifier.coef_)\n",
"print(classifier.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another classifier: K Nearest Neighbors\n",
"------------------------------------------------\n",
"Another popular and easy to understand classifier is K nearest neighbors (kNN). It has one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class.\n",
"\n",
"The interface is exactly the same as for ``LogisticRegression above``."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This time we set a parameter of the KNeighborsClassifier to tell it we only want to look at one nearest neighbor:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"knn = KNeighborsClassifier(n_neighbors=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We fit the model with out training data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"knn.fit(X_train, y_train)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"plt.scatter(X[y == 0, 0], X[y == 0, 1], \n",
" c='blue', s=40, label='0')\n",
"plt.scatter(X[y == 1, 0], X[y == 1, 1], \n",
" c='red', s=40, label='1', marker='s')\n",
"\n",
"plt.xlabel(\"first feature\")\n",
"plt.ylabel(\"second feature\")\n",
"plot_2d_separator(knn, X)\n",
"plt.legend(loc='upper right');"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"knn.score(X_test, y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exercise\n",
"=========\n",
"Apply the KNeighborsClassifier to the ``iris`` dataset. Play with different values of the ``n_neighbors`` and observe how training and test score change."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 0
}