{
 "metadata": {
  "name": ""
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%autosave 10"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "javascript": [
        "IPython.notebook.set_autosave_interval(10000)"
       ],
       "metadata": {},
       "output_type": "display_data"
      },
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "Autosaving every 10 seconds\n"
       ]
      }
     ],
     "prompt_number": 1
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Intro\n",
      "\n",
      "- Used by search engines, Google, and Yandex.\n",
      "- Many machine learning competition winners use it.\n",
      "\n",
      "## Basics 101\n",
      "\n",
      "- Examples with samples.\n",
      "- Features, n of them.\n",
      "- Response is real (regression) or -1/1 (classification)\n",
      "- Goal\n",
      "    - Find some function that minimises error on unseen data.\n",
      "\n",
      "##\u00a0Decision trees\n",
      "\n",
      "- Classification and Regression Trees (CART), Breiman et al 1984\n",
      "- Binary, splits features on thresholds, output real (regression).\n",
      "- `sklearn.tree.DecisionTreeClassifier/Regressor`\n",
      "- leaves contain constant predictions\n",
      "- decision trees are very interpretable\n",
      "    - can plot or see\n",
      "- but they have very poor predicive performance\n",
      "    - seldom used alone.\n",
      "    - usually use ensembles (random forests, bagging, boosting)\n",
      "    - `sklearn.ensemble`\n",
      "\n",
      "## GBRTs\n",
      "\n",
      "### Advantages\n",
      "\n",
      "- can work on features with different scales\n",
      "    - e.g. face detection or text classification have similar scales\n",
      "- can change loss function\n",
      "    - robust loss functions like `huber`\n",
      "- non-linear feature interactions\n",
      "    - don't need prior knowledge in kernel\n",
      "- not a black box (like SVM or neural networks)\n",
      "\n",
      "## disadvantage\n",
      "\n",
      "- lots of turning\n",
      "- slow to train (fast to use)\n",
      "- like other tree-based methods, can't extrapolate\n",
      "\n",
      "## Boosting\n",
      "\n",
      "- AdaBoost\n",
      "    - Each member of ensemble is expert on the *errors* of its *predecessor*.\n",
      "    - Iterative: reweight based on errors.\n",
      "    - `sklearn.ensemble.AdaBoostClassifer/Regressor`\n",
      "- Viola-Jones Face Detector (2001) used it very successfully, seminal usage\n",
      "\n",
      "## J Friedman, 1999\n",
      "\n",
      "- Generalise to arbitrary loss functions\n",
      "- `sklearn.ensemble.GradientBoostingClassifier/Regressor`\n",
      "- Written in pure Python/Numpy, easy to extend.\n",
      "- Very shallow trees using custom node splitter and pre-sorting."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "from sklearn.ensemble import GradientBoostingClassifier\n",
      "from sklearn.datasets import make_hastie_10_2\n",
      "\n",
      "X, y = make_hastie_10_2(n_samples=10000)\n",
      "est = GradientBoostingClassifier(n_estimators=200, max_depth=3)\n",
      "est.fit(X, y)\n",
      "\n",
      "pred = est.predict(X)\n",
      "est.predict_proba(X)[0]  #\u00a0class probabilities"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 2,
       "text": [
        "array([ 0.0325271,  0.9674729])"
       ]
      }
     ],
     "prompt_number": 2
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "for pred in est.staged_predict(X):\n",
      "    plt.plot(X[:, 0], pred, color='r', alpha=0.1)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stderr",
       "text": [
        "\n",
        "KeyboardInterrupt\n"
       ]
      }
     ],
     "prompt_number": 4
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "- As you add more levels, you reduce variance, overfitting trickles in"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "#\u00a0X_test/Y_test, held back data\n",
      "\n",
      "test_score = np.empty(len(est.estimators_))\n",
      "for i, pred in enumerate(est.staged_predict(X_test)):\n",
      "    test_score[i] = est.loss_(y_test, pred)\n",
      "plt.plot(np.arange(n_estimators) + 1, test_score, label='Test')\n",
      "plt.plot(np.arange(n_estimators) + 1, est.train_score_, label='Train')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": []
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "## Tree structure\n",
      "\n",
      "- `max_depth` controls degree of feature interactions\n",
      "    - e.g. for geo data need at least 2 to capture long/lat interactions.\n",
      "- Friedman suggests max depth of 3-5, presenter uses 3-6.\n",
      "- `min_samples` requires sufficient samples, adds more bias, adds contraint, more general.\n",
      "\n",
      "##\u00a0Shrinkage\n",
      "\n",
      "- Slow learning using `learning_rate`, but needs higher `n_estimators`\n",
      "- Takes longer to train but lowers test error and difference between train and test error.\n",
      "\n",
      "## Stochastic Gradient Boosting\n",
      "\n",
      "- `subsample`: random subset of training set\n",
      "- `max_features`: random subset of features\n",
      "    - Presenter recommends starting with just this.\n",
      "- increased accuracy.\n",
      "\n",
      "## How to tune hyperparameters (best practices)\n",
      "\n",
      "1. Set `n_estimators` high as possible, e.g. 3000\n",
      "2. Tune via grid search.\n",
      "    - `param_grid`\n",
      "    - `gs_cv = GridSearchCV(est, param_grid).fit(X, y)\n",
      "    - `gs_cv.best_params_`\n",
      "    - Can also use `joblib`.\n",
      "3. Set `n_estimators` even higher and tune `learning_rate`\n",
      "\n",
      "## Case study\n",
      "\n",
      "- GBRT can directly minimise Mean Absolute Error (MAE).\n",
      "- Some methods like Random Forests act on MAE through Sum of Squares as a proxy\n",
      "    - This directly emphasises outliers, which increases MAE\n",
      "- GBRT can capture lat/long of geo coordinate interaction\n",
      "- `est.features_importances_` lets you peek in the black box, plot to see what are the most relevant features, great for epxloratory phase.\n",
      "    - But don't say how features interact with each other.\n",
      "    - use `partial_dependence` for PD plots\n",
      "    - `sklearn.ensemble.partial_dependence`\n",
      "    - very convenient, and the computation is cheap\n",
      "    - automatically detect spatial effects\n",
      "    \n",
      "``\n",
      "from sklearn.ensemble import partial_dependence as pd\n",
      "features = ['foo', 'bar']\n",
      "fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names]\n",
      "``\n",
      "\n",
      "- Very flexible, general, non-parametric\n",
      "- Solid, battle tested\n",
      "\n",
      "## Questions\n",
      "\n",
      "- Reference for heuristics?\n",
      "    - R package `gdm` is great, and heavily referenced and source of heuristics.\n",
      "- Why this house pricing case study?\n",
      "    - Just general interest.\n",
      "- Prediction of time-series?\n",
      "    - General problem for tree-based data.\n",
      "    - Try to de-trend data before.\n",
      "    - To be honest haven't been exposed to time-series prediction problems.\n",
      "    - Maybe transform data into prediction of each tree, then run linear model (!!AI ?)\n",
      "- What if you want to do classification?\n",
      "    - Maybe use spline regression to transform the problem into a regression problem.\n",
      "    "
     ]
    }
   ],
   "metadata": {}
  }
 ]
}