{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Machine learning: \n", "\n", "### model fitting where we care about the output, not parameters" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### In a lot of scientific model fitting, parameters are meaningful\n", "\n", "#### If you fit a line to enzyme rate vs substrate data, you can extract things like $k_{cat}$ and $K_{M}$.\n", "\n", "
\n", " \n", "
\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### In some fits, we don't care about parameters at all\n", "\n", "#### With a standard curve, we relate an observable to an unobservable quantity of interest\n", "\n", "\n", "\n", "$$\\hat{Conc} = \\frac{abs - b}{m}$$\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### To this point, we've talked about the first type of regression. \n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Do\n", "+ Choose the model that fits the data best with the fewest parameters\n", "+ Check your residuals for randomness\n", "+ Use a biologically/physically informed model with independently testable/interpretable parameters \n", "\n", "### Don't\n", "+ Transform your data (e.g. take a log before fitting). Most regression approaches assume measurement uncertainty is normally distributed. \n", "+ Try only one set of parameter guesses\n", "+ Overfit your data (fit a model with more parameters than observations)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Machine learning is much more *pragmatic*: find a model that lets you predict a quantity given some set of observations. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ Useful for: \n", " + Building models that predict from macromolecular sequence (transcription factor binding sites, protein secondary structure, etc. \n", " + Ojectively separating groups of bacteria into different classes based on growth\n", " + Image analysis" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Rules: \n", " + Don't throw *any* data away\n", " + Choose a model that gives you power to observe what you want\n", " + Ignore the fit parameters\n", " + Check your model behavior using cross-validation" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Two main types:\n", "\n", "+ **Supervised:** Give a dataset with $X$ and $Y$ and figure out a relationship between them. \n", " + Classification\n", " + Regression (to predict quantiative values -- think standard curve)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "+ **Unsupervised:** Figure out structure in the data given only the data. \n", " + Clustering \n", " + Dimensionality reduction (e.g. principle component analysis)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "+ **Supervised:** \n", " + Classification - [Random Forests](http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees)\n", " + Regression - [Support Vector Machines](http://scikit-learn.org/stable/modules/svm.html)\n", "+ **Unsupervised:**\n", " + Clustering - [K-means](http://scikit-learn.org/stable/modules/clustering.html#clustering)\n", " + Dimensionality reduction - [Principle Component Analysis](http://scikit-learn.org/stable/modules/decomposition.html#pca)\n", " " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "We'll be using [sklearn](http://scikit-learn.org/) for this analysis. \n", "\n", "**Check out their docs. They're *amazing*.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Random Forests" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Built from binary decision trees.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Random forests are \"bootstrap\" methods\n", " \n", "+ Sample (with replacement) over observations to include in the analysis\n", "+ Sample (with replacement) over parameters to include in the analysis\n", "\n", "This removes bias in the analysis. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Only as effective as the categories you choose.\n", "\n", "```\n", " is space alien? \n", " |\n", " |\n", " Yes----------No\n", " | |\n", " 0.0 100.0\n", " |\n", " died-------survived\n", " | |\n", " 62% 38%\n", "```\n", " \n", " \n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Decision of what classifiers to use is pragmatic\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## How you decide if you're classifier is working well?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "### Cross Validation\n", "\n", "+ Take a subset of your data out before fitting your model. \n", "+ Fit the model.\n", "+ See how well you do at predicting the data you did not include in your model. " ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Training set
\n", "Test set

\n", "\n", "![img](https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/04_machine-learning/img/bad-cross-validation.png)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "

Training set
\n", "Test set

\n", "\n", "![img](https://raw.githubusercontent.com/harmsm/pythonic-science/master/chapters/04_machine-learning/img/better-cross-validation.png)\n", "\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### What about unsupervised methods?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "### Really common one: *principle component analysis* (PCA)\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Summary:\n", "\n", "+ In machine learning, the goal is to classify/fit models that act as \"black boxes\" -- not to interpret parameters.\n", "+ In supervised learning, you give categories to the observables\n", "+ In unsupervised learning, you let the data speak for itself.\n", "+ Practically: keep all your data, check your model with cross-validation" ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.3" } }, "nbformat": 4, "nbformat_minor": 1 }