{ "metadata": { "name": "" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "code", "collapsed": false, "input": [ "%autosave 10" ], "language": "python", "metadata": {}, "outputs": [ { "javascript": [ "IPython.notebook.set_autosave_interval(10000)" ], "metadata": {}, "output_type": "display_data" }, { "output_type": "stream", "stream": "stdout", "text": [ "Autosaving every 10 seconds\n" ] } ], "prompt_number": 1 }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Intro\n", "\n", "- Used by search engines, Google, and Yandex.\n", "- Many machine learning competition winners use it.\n", "\n", "## Basics 101\n", "\n", "- Examples with samples.\n", "- Features, n of them.\n", "- Response is real (regression) or -1/1 (classification)\n", "- Goal\n", " - Find some function that minimises error on unseen data.\n", "\n", "##\u00a0Decision trees\n", "\n", "- Classification and Regression Trees (CART), Breiman et al 1984\n", "- Binary, splits features on thresholds, output real (regression).\n", "- `sklearn.tree.DecisionTreeClassifier/Regressor`\n", "- leaves contain constant predictions\n", "- decision trees are very interpretable\n", " - can plot or see\n", "- but they have very poor predicive performance\n", " - seldom used alone.\n", " - usually use ensembles (random forests, bagging, boosting)\n", " - `sklearn.ensemble`\n", "\n", "## GBRTs\n", "\n", "### Advantages\n", "\n", "- can work on features with different scales\n", " - e.g. face detection or text classification have similar scales\n", "- can change loss function\n", " - robust loss functions like `huber`\n", "- non-linear feature interactions\n", " - don't need prior knowledge in kernel\n", "- not a black box (like SVM or neural networks)\n", "\n", "## disadvantage\n", "\n", "- lots of turning\n", "- slow to train (fast to use)\n", "- like other tree-based methods, can't extrapolate\n", "\n", "## Boosting\n", "\n", "- AdaBoost\n", " - Each member of ensemble is expert on the *errors* of its *predecessor*.\n", " - Iterative: reweight based on errors.\n", " - `sklearn.ensemble.AdaBoostClassifer/Regressor`\n", "- Viola-Jones Face Detector (2001) used it very successfully, seminal usage\n", "\n", "## J Friedman, 1999\n", "\n", "- Generalise to arbitrary loss functions\n", "- `sklearn.ensemble.GradientBoostingClassifier/Regressor`\n", "- Written in pure Python/Numpy, easy to extend.\n", "- Very shallow trees using custom node splitter and pre-sorting." ] }, { "cell_type": "code", "collapsed": false, "input": [ "from sklearn.ensemble import GradientBoostingClassifier\n", "from sklearn.datasets import make_hastie_10_2\n", "\n", "X, y = make_hastie_10_2(n_samples=10000)\n", "est = GradientBoostingClassifier(n_estimators=200, max_depth=3)\n", "est.fit(X, y)\n", "\n", "pred = est.predict(X)\n", "est.predict_proba(X)[0] #\u00a0class probabilities" ], "language": "python", "metadata": {}, "outputs": [ { "metadata": {}, "output_type": "pyout", "prompt_number": 2, "text": [ "array([ 0.0325271, 0.9674729])" ] } ], "prompt_number": 2 }, { "cell_type": "code", "collapsed": false, "input": [ "for pred in est.staged_predict(X):\n", " plt.plot(X[:, 0], pred, color='r', alpha=0.1)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stderr", "text": [ "\n", "KeyboardInterrupt\n" ] } ], "prompt_number": 4 }, { "cell_type": "markdown", "metadata": {}, "source": [ "- As you add more levels, you reduce variance, overfitting trickles in" ] }, { "cell_type": "code", "collapsed": false, "input": [ "#\u00a0X_test/Y_test, held back data\n", "\n", "test_score = np.empty(len(est.estimators_))\n", "for i, pred in enumerate(est.staged_predict(X_test)):\n", " test_score[i] = est.loss_(y_test, pred)\n", "plt.plot(np.arange(n_estimators) + 1, test_score, label='Test')\n", "plt.plot(np.arange(n_estimators) + 1, est.train_score_, label='Train')" ], "language": "python", "metadata": {}, "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree structure\n", "\n", "- `max_depth` controls degree of feature interactions\n", " - e.g. for geo data need at least 2 to capture long/lat interactions.\n", "- Friedman suggests max depth of 3-5, presenter uses 3-6.\n", "- `min_samples` requires sufficient samples, adds more bias, adds contraint, more general.\n", "\n", "##\u00a0Shrinkage\n", "\n", "- Slow learning using `learning_rate`, but needs higher `n_estimators`\n", "- Takes longer to train but lowers test error and difference between train and test error.\n", "\n", "## Stochastic Gradient Boosting\n", "\n", "- `subsample`: random subset of training set\n", "- `max_features`: random subset of features\n", " - Presenter recommends starting with just this.\n", "- increased accuracy.\n", "\n", "## How to tune hyperparameters (best practices)\n", "\n", "1. Set `n_estimators` high as possible, e.g. 3000\n", "2. Tune via grid search.\n", " - `param_grid`\n", " - `gs_cv = GridSearchCV(est, param_grid).fit(X, y)\n", " - `gs_cv.best_params_`\n", " - Can also use `joblib`.\n", "3. Set `n_estimators` even higher and tune `learning_rate`\n", "\n", "## Case study\n", "\n", "- GBRT can directly minimise Mean Absolute Error (MAE).\n", "- Some methods like Random Forests act on MAE through Sum of Squares as a proxy\n", " - This directly emphasises outliers, which increases MAE\n", "- GBRT can capture lat/long of geo coordinate interaction\n", "- `est.features_importances_` lets you peek in the black box, plot to see what are the most relevant features, great for epxloratory phase.\n", " - But don't say how features interact with each other.\n", " - use `partial_dependence` for PD plots\n", " - `sklearn.ensemble.partial_dependence`\n", " - very convenient, and the computation is cheap\n", " - automatically detect spatial effects\n", " \n", "``\n", "from sklearn.ensemble import partial_dependence as pd\n", "features = ['foo', 'bar']\n", "fig, axs = pd.plot_partial_dependence(est, X_train, features, feature_names=names]\n", "``\n", "\n", "- Very flexible, general, non-parametric\n", "- Solid, battle tested\n", "\n", "## Questions\n", "\n", "- Reference for heuristics?\n", " - R package `gdm` is great, and heavily referenced and source of heuristics.\n", "- Why this house pricing case study?\n", " - Just general interest.\n", "- Prediction of time-series?\n", " - General problem for tree-based data.\n", " - Try to de-trend data before.\n", " - To be honest haven't been exposed to time-series prediction problems.\n", " - Maybe transform data into prediction of each tree, then run linear model (!!AI ?)\n", "- What if you want to do classification?\n", " - Maybe use spline regression to transform the problem into a regression problem.\n", " " ] } ], "metadata": {} } ] }