{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Machine learning: \n",
    "\n",
    "### model fitting where we care about the output, not parameters"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### In a lot of scientific model fitting, parameters are meaningful\n",
    "\n",
    "#### If you fit a line to enzyme rate vs substrate data, you can extract things like $k_{cat}$ and $K_{M}$.\n",
    "\n",
    "<div style=\"margin:auto\" align=\"center\">\n",
    "    <img src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/Lineweaver-Burke_plot.svg/500px-Lineweaver-Burke_plot.svg.png\" />\n",
    "</div>\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### In some fits, we don't care about parameters at all\n",
    "\n",
    "#### With a standard curve, we relate an observable to an unobservable quantity of interest\n",
    "\n",
    "<img align=\"center\" style=\"margin:auto\" src=\"http://www.news-medical.net/image.axd?picture=2016%2f2%2fEpp3.jpg\" />\n",
    "\n",
    "$$\\hat{Conc} = \\frac{abs - b}{m}$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### To this point, we've talked about the first type of regression. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Do\n",
    "+ Choose the model that fits the data best with the fewest parameters\n",
    "+ Check your residuals for randomness\n",
    "+ Use a biologically/physically informed model with independently testable/interpretable parameters  \n",
    "\n",
    "### Don't\n",
    "+ Transform your data (e.g. take a log before fitting).  Most regression approaches assume measurement uncertainty is normally distributed. \n",
    "+ Try only one set of parameter guesses\n",
    "+ Overfit your data (fit a model with more parameters than observations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Machine learning is much more *pragmatic*: find a model that lets you predict a quantity given some set of observations.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "+ Useful for: \n",
    "  + Building models that predict from macromolecular sequence (transcription factor binding sites, protein secondary structure, etc.  \n",
    "  + Ojectively separating groups of bacteria into different classes based on growth\n",
    "  + Image analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Rules: \n",
    " + Don't throw *any* data away\n",
    " + Choose a model that gives you power to observe what you want\n",
    " + Ignore the fit parameters\n",
    " + Check your model behavior using cross-validation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Two main types:\n",
    "\n",
    "+ **Supervised:** Give a dataset with $X$ and $Y$ and figure out a relationship between them.  \n",
    "  + Classification\n",
    "  + Regression (to predict quantiative values -- think standard curve)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "+ **Unsupervised:** Figure out structure in the data given only the data. \n",
    "  + Clustering \n",
    "  + Dimensionality reduction (e.g. principle component analysis)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "+ **Supervised:** \n",
    "  + Classification - [Random Forests](http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)\n",
    "  + Regression - [Support Vector Machines](http://scikit-learn.org/stable/modules/svm.html)\n",
    "+ **Unsupervised:**\n",
    "  + Clustering - [K-means](http://scikit-learn.org/stable/modules/clustering.html#clustering)\n",
    "  + Dimensionality reduction - [Principle Component Analysis](http://scikit-learn.org/stable/modules/decomposition.html#pca)\n",
    "  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "We'll be using [sklearn](http://scikit-learn.org/) for this analysis.  \n",
    "\n",
    "**Check out their docs.  They're *amazing*.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<img style=\"margin:auto\" src=\"http://scikit-learn.org/stable/_images/sphx_glr_plot_cluster_comparison_0011.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Random Forests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Built from binary decision trees.\n",
    "\n",
    "<img style=\"margin:auto\" src=\"https://upload.wikimedia.org/wikipedia/commons/f/f3/CART_tree_titanic_survivors.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Random forests are \"bootstrap\" methods\n",
    " \n",
    "+ Sample (with replacement) over observations to include in the analysis\n",
    "+ Sample (with replacement) over parameters to include in the analysis\n",
    "\n",
    "This removes bias in the analysis. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Only as effective as the categories you choose.\n",
    "\n",
    "```\n",
    "              is space alien? \n",
    "                     |\n",
    "                     |\n",
    "              Yes----------No\n",
    "               |            |\n",
    "              0.0         100.0\n",
    "                            |\n",
    "                      died-------survived\n",
    "                        |           |\n",
    "                       62%         38%\n",
    "```\n",
    "                       \n",
    "                          \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Decision of what classifiers to use is pragmatic\n",
    "\n",
    "<img align=\"center\" style=\"margin:auto\" src=\"http://scikit-learn.org/stable/_images/sphx_glr_plot_forest_iris_001.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## How you decide if you're classifier is working well?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "### Cross Validation\n",
    "\n",
    "+ Take a subset of your data out before fitting your model. \n",
    "+ Fit the model.\n",
    "+ See how well you do at predicting the data you did not include in your model. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<h3><span style=\"color:red\">Training set</span><br/>\n",
    "<span style=\"color:blue\">Test set</span></h3>\n",
    "\n",
    "![img](https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/04_machine-learning/img/bad-cross-validation.png)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "<h3><span style=\"color:red\">Training set</span><br/>\n",
    "<span style=\"color:blue\">Test set</span></h3>\n",
    "\n",
    "![img](https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/04_machine-learning/img/better-cross-validation.png)\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### What about unsupervised methods?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Really common one: *principle component analysis* (PCA)\n",
    "\n",
    "<img style=\"margin:auto\" align=\"center\" src=\"http://scikit-learn.org/stable/_images/sphx_glr_plot_pca_vs_lda_0011.png\" />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Summary:\n",
    "\n",
    "+ In machine learning, the goal is to classify/fit models that act as \"black boxes\" -- not to interpret parameters.\n",
    "+ In supervised learning, you give categories to the observables\n",
    "+ In unsupervised learning, you let the data speak for itself.\n",
    "+ Practically: keep all your data, check your model with cross-validation"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}