{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Training a machine learning model with scikit-learn\n",
"*From the video series: [Introduction to machine learning with scikit-learn](https://github.com/justmarkham/scikit-learn-videos)*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Agenda\n",
"\n",
"- What is the **K-nearest neighbors** classification model?\n",
"- What are the four steps for **model training and prediction** in scikit-learn?\n",
"- How can I apply this pattern to **other machine learning models**?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reviewing the iris dataset"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.display import IFrame\n",
"IFrame('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', width=300, height=200)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- 150 **observations**\n",
"- 4 **features** (sepal length, sepal width, petal length, petal width)\n",
"- **Response** variable is the iris species\n",
"- **Classification** problem since response is categorical\n",
"- More information in the [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml/datasets/Iris)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## K-nearest neighbors (KNN) classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Pick a value for K.\n",
"2. Search for the K observations in the training data that are \"nearest\" to the measurements of the unknown iris.\n",
"3. Use the most popular response value from the K nearest neighbors as the predicted response value for the unknown iris."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Example training data\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### KNN classification map (K=1)\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### KNN classification map (K=5)\n",
"\n",
""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*Image Credits: [Data3classes](http://commons.wikimedia.org/wiki/File:Data3classes.png#/media/File:Data3classes.png), [Map1NN](http://commons.wikimedia.org/wiki/File:Map1NN.png#/media/File:Map1NN.png), [Map5NN](http://commons.wikimedia.org/wiki/File:Map5NN.png#/media/File:Map5NN.png) by Agor153. Licensed under CC BY-SA 3.0*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Loading the data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# import load_iris function from datasets module\n",
"from sklearn.datasets import load_iris\n",
"\n",
"# save \"bunch\" object containing iris dataset and its attributes\n",
"iris = load_iris()\n",
"\n",
"# store feature matrix in \"X\"\n",
"X = iris.data\n",
"\n",
"# store response vector in \"y\"\n",
"y = iris.target"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(150L, 4L)\n",
"(150L,)\n"
]
}
],
"source": [
"# print the shapes of X and y\n",
"print(X.shape)\n",
"print(y.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## scikit-learn 4-step modeling pattern"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 1:** Import the class you plan to use"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.neighbors import KNeighborsClassifier"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 2:** \"Instantiate\" the \"estimator\"\n",
"\n",
"- \"Estimator\" is scikit-learn's term for model\n",
"- \"Instantiate\" means \"make an instance of\""
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"knn = KNeighborsClassifier(n_neighbors=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Name of the object does not matter\n",
"- Can specify tuning parameters (aka \"hyperparameters\") during this step\n",
"- All parameters not specified are set to their defaults"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
" weights='uniform')\n"
]
}
],
"source": [
"print(knn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 3:** Fit the model with data (aka \"model training\")\n",
"\n",
"- Model is learning the relationship between X and y\n",
"- Occurs in-place"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',\n",
" metric_params=None, n_jobs=1, n_neighbors=1, p=2,\n",
" weights='uniform')"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"knn.fit(X, y)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Step 4:** Predict the response for a new observation\n",
"\n",
"- New observations are called \"out-of-sample\" data\n",
"- Uses the information it learned during the model training process"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2])"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"knn.predict([[3, 5, 4, 2]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- Returns a NumPy array\n",
"- Can predict for multiple observations at once"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 1])"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]\n",
"knn.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using a different value for K"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 1])"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# instantiate the model (using the value K=5)\n",
"knn = KNeighborsClassifier(n_neighbors=5)\n",
"\n",
"# fit the model with data\n",
"knn.fit(X, y)\n",
"\n",
"# predict the response for new observations\n",
"knn.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using a different classification model"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"array([2, 0])"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# import the class\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# instantiate the model (using the default parameters)\n",
"logreg = LogisticRegression()\n",
"\n",
"# fit the model with data\n",
"logreg.fit(X, y)\n",
"\n",
"# predict the response for new observations\n",
"logreg.predict(X_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Resources\n",
"\n",
"- [Nearest Neighbors](http://scikit-learn.org/stable/modules/neighbors.html) (user guide), [KNeighborsClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) (class documentation)\n",
"- [Logistic Regression](http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) (user guide), [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) (class documentation)\n",
"- [Videos from An Introduction to Statistical Learning](http://www.dataschool.io/15-hours-of-expert-machine-learning-videos/)\n",
" - Classification Problems and K-Nearest Neighbors (Chapter 2)\n",
" - Introduction to Classification (Chapter 4)\n",
" - Logistic Regression and Maximum Likelihood (Chapter 4)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Comments or Questions?\n",
"\n",
"- Email: \n",
"- Website: http://dataschool.io\n",
"- Twitter: [@justmarkham](https://twitter.com/justmarkham)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/html": [
"\n",
""
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from IPython.core.display import HTML\n",
"def css_styling():\n",
" styles = open(\"styles/custom.css\", \"r\").read()\n",
" return HTML(styles)\n",
"css_styling()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.11"
}
},
"nbformat": 4,
"nbformat_minor": 0
}