{ "metadata": { "name": "", "signature": "sha256:a156c8244671d5e167da919dd6d411d1f08ed05955fddb3bb6b062db1cd886d4" }, "nbformat": 3, "nbformat_minor": 0, "worksheets": [ { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook was put together by [Jake Vanderplas](http://www.vanderplas.com) for PyCon 2015. Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/)." ] }, { "cell_type": "heading", "level": 1, "metadata": {}, "source": [ "An Introduction to scikit-learn: Machine Learning in Python" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Goals of this Tutorial" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **Introduce the basics of Machine Learning**, and some skills useful in practice.\n", "- **Introduce the syntax of scikit-learn**, so that you can make use of the rich toolset available." ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Schedule:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Outline:\n", "\n", "**9:00 - 9:15** Preliminaries: Setup & introduction\n", "* Making sure your computer is set-up\n", "\n", "**9:15 - 10:00** Basic Principles of Machine Learning and the Scikit-learn Interface\n", "* What is Machine Learning?\n", "* Machine learning data layout\n", "* Supervised Learning\n", " - Classification\n", " - Regression\n", " - Measuring performance\n", "* Unsupervised Learning\n", " - Clustering\n", " - Dimensionality Reduction\n", " - Density Estimation\n", "* Evaluation of Learning Models\n", "* Choosing the right algorithm for your dataset\n", "\n", "**10:00 - 10:45** Supervised learning in-depth\n", "* Support Vector Machines\n", "* Decision Trees and Random Forests\n", "\n", "**10:45 - 11:00**: break\n", "\n", "**11:00 - 11:45** Unsupervised learning in-depth\n", "* Dimensionality Reduction: Principal Component Analysis\n", "* Clustering: K Means\n", "* Density Estimation: Gaussian Mixture Models\n", "* Application: image color compression\n", "\n", "**11:45 - 12:20** Validation and Model Selection\n", "* Overfitting, Underfitting, bias, and variance\n", "* Improving your fit: validation curves and learning curves\n", "* Application: facial recognition" ] }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Preliminaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This tutorial requires the following packages:\n", "\n", "- Python version 2.6-2.7 or 3.3-3.4\n", "- `numpy` version 1.5 or later: http://www.numpy.org/\n", "- `scipy` version 0.10 or later: http://www.scipy.org/\n", "- `matplotlib` version 1.3 or later: http://matplotlib.org/\n", "- `scikit-learn` version 0.14 or later: http://scikit-learn.org\n", "- `ipython` version 2.0 or later, with notebook support: http://ipython.org\n", "- `seaborn`: version 0.5 or later, used mainly for plot styling\n", "\n", "The easiest way to get these is to use the [conda](http://store.continuum.io/) environment manager.\n", "I suggest downloading and installing [miniconda](http://conda.pydata.org/miniconda.html).\n", "\n", "The following command will install all required packages:\n", "```\n", "$ conda install numpy scipy matplotlib scikit-learn ipython-notebook\n", "```\n", "\n", "Alternatively, you can download and install the (very large) Anaconda software distribution, found at https://store.continuum.io/." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Checking your installation\n", "\n", "You can run the following code to check the versions of the packages on your system:\n", "\n", "(in IPython notebook, press `shift` and `return` together to execute the contents of a cell)" ] }, { "cell_type": "code", "collapsed": false, "input": [ "from __future__ import print_function\n", "\n", "import IPython\n", "print('IPython:', IPython.__version__)\n", "\n", "import numpy\n", "print('numpy:', numpy.__version__)\n", "\n", "import scipy\n", "print('scipy:', scipy.__version__)\n", "\n", "import matplotlib\n", "print('matplotlib:', matplotlib.__version__)\n", "\n", "import sklearn\n", "print('scikit-learn:', sklearn.__version__)\n", "\n", "import seaborn\n", "print('seaborn', seaborn.__version__)" ], "language": "python", "metadata": {}, "outputs": [ { "output_type": "stream", "stream": "stdout", "text": [ "IPython: 2.4.1\n", "numpy:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 1.9.2\n", "scipy: 0.15.1\n", "matplotlib: 1.4.3\n", "scikit-learn:" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.15.2\n", "seaborn" ] }, { "output_type": "stream", "stream": "stdout", "text": [ " 0.5.1\n" ] } ], "prompt_number": 1 }, { "cell_type": "heading", "level": 2, "metadata": {}, "source": [ "Useful Resources" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- **scikit-learn:** http://scikit-learn.org (see especially the narrative documentation)\n", "- **matplotlib:** http://matplotlib.org (see especially the gallery section)\n", "- **IPython:** http://ipython.org (also check out http://nbviewer.ipython.org)" ] } ], "metadata": {} } ] }