{ "metadata": { "name": "", "signature": "sha256:9d7cd4701939f9b715a1e75dfd7abc7f3b44a621bf511bcd7c18056b39122193" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "Preface" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Welcome! Allow me to be the first to offer my congratulations on your decision to take an interest in Applied Predictive Modeling with Python! This is a collection of IPython Notebooks that provides an interactive way to reproduce this awesome [book](http://www.amazon.com/Applied-Predictive-Modeling-Max-Kuhn/dp/1461468485) by Kuhn and Johnson. \n", "\n", "If you experience any problems along the way or have any feedback at all, please reach out to me.\n", "\n", "Best Regards,
\n", "Lei Gong
\n", "Email: LeiG.inbox@gmail.com
\n", "Twitter: @_LeiG" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Setups" ] }, { "cell_type": "code", "collapsed": false, "input": [ "import numpy\n", "import scipy\n", "import pandas\n", "import sklearn\n", "import matplotlib\n", "import rpy2\n", "import pyearth\n", "import statsmodels" ], "language": "python", "metadata": {}, "outputs": [], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Prepare Datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Thanks to the authors, all datasets that are necessary in order to reproduce the examples in the book are available in the *.RData* format from their R package $\\texttt{caret}$ and $\\texttt{AppliedPredictiveModeling}$. To prepare them for our purpose, I did a little hack so that you can download all the datasets and convert them from *.RData* to *.csv* by running this script \"[fetch_data.py](https://github.com/LeiG/Applied-Predictive-Modeling-with-Python/blob/master/fetch_data.py)\"." ] }, { "cell_type": "code", "collapsed": false, "input": [ "%run ../fetch_data.py" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "Using existing datasets folder:/Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets\n", "Downloading AppliedPredictiveModeling from http://cran.r-project.org/src/contrib/AppliedPredictiveModeling_1.1-6.tar.gz (2 MB)\n", "Decomposing /Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets/AppliedPredictiveModeling_1.1-6.tar.gz" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Checking that the AppliedPredictiveModeling file exists..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "=> Success!\n", "Downloading Caret from http://cran.r-project.org/src/contrib/caret_6.0-37.tar.gz (2 MB)\n", "Decomposing /Users/leigong/Documents/Research/DataScience/Applied-Predictive-Modeling/datasets/caret_6.0-37.tar.gz" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "Checking that the Caret file exists..." ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n", "=> Success!\n", "Extract .RData files from the package...\n", "Convert .RData to .csv and clean up .RData files...\n", "=> Success!" ] }, { "output_type": "stream", "stream": "stdout", "text": [ "\n" ] } ], "prompt_number": 2 }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "1. Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Predictive modeling**: the process of developing a mathematical tool or model that generates an accurate prediction." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are a number of common reasons why predictive models fail, e.g,\n", "\n", "- inadequante pre-processing of the data\n", "- inadequate model validation\n", "- unjustified extrapolation\n", "- over-fitting the model to the existing data\n", "- explore relatively few models when searching for relationships" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1.1 Prediction Versus Interpretation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The trade-off between prediction and interpretation depends on the primary goal of the task. The unfortunate reality is that as we push towards higher accuracy, models become more complex and their interpretability becomes more difficult." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1.2 Key Ingredients of Predictive Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The foundation of an effective predictive model is laid with *intuition* and *deep knowledge of the problem context*, which are entirely vital for driving decisions about model development. The process begins with *relevant* data. " ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1.3 Terminology" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- The *sample*, *data point*, *observation*, or *instance* refer to a single independent unit of data\n", "- The *training* set consists of the data used to develop models while the *test* or *validation* set is used solely for evaluating the performance of a final set of candidate models. **NOTE**: usually people refer to the *validation* set for evaluating candidates and divide *training* set using cross-validation into several sub-*training* and *test* sets to tune parameters in model development.\n", "- The *predictors*, *independent variables*, *attributes*, or *descriptors* are the data used as input for the prediction equation.\n", "- The *outcome*, *dependent variable*, *target*, *class*, or *response* refer to the outcome event or quantity that is being predicted." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1.4 Example Data Sets and Typical Data Scenarios" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "1.5 Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Part I General Strategies**\n", " - Ch.2 A short tour of the predictive modeling process\n", " - Ch.3 Data pre-processing\n", " - Ch.4 Over-fitting and model tuning\n", "- **Part II Regression Models**\n", " - Ch.5 Measuring performance in regression models\n", " - Ch.6 Linear regression and its cousins\n", " - Ch.7 Nonlinear regression models\n", " - Ch.8 Regression trees and rule-based models\n", " - Ch.9 A summary of solubility models\n", " - Ch.10 Case study: compressive strength of concrete\n", "- **Part III Classification Models**\n", " - Ch.11 Measuring performance in classification models\n", " - Ch.12 Discriminant analysis and other linear classification models\n", " - Ch.13 Nonlinear classification models\n", " - Ch.14 Classification trees and rule-based models\n", " - Ch.15 A summary of grant application models\n", " - Ch.16 Remedies for severe class imbalance\n", " - Ch.17 Case study: job scheduling\n", "- **Part IV Other Considerations**\n", " - Ch.18 Measuring predictor importance\n", " - Ch.19 An introduction to feature selection\n", " - Ch.20 Factors that can affect model performance" ] } ], "metadata": {} } ] }