{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introducing Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "There are several Python libraries that provide solid implementations of a range of machine learning algorithms.\n", "One of the best known is [Scikit-Learn](http://scikit-learn.org), a package that provides efficient versions of a large number of common algorithms.\n", "Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete online documentation.\n", "A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is straightforward.\n", "\n", "This chapter provides an overview of the Scikit-Learn API. A solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following chapters.\n", "\n", "We will start by covering data representation in Scikit-Learn, then delve into the Estimator API, and finally go through a more interesting example of using these tools for exploring a set of images of handwritten digits." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "## Data Representation in Scikit-Learn" ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "Machine learning is about creating models from data: for that reason, we'll start by discussing how data can be represented.\n", "The best way to think about data within Scikit-Learn is in terms of *tables*." ] }, { "cell_type": "markdown", "metadata": { "deletable": true, "editable": true }, "source": [ "A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements.\n", "For example, consider the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set), famously analyzed by Ronald Fisher in 1936.\n", "We can download this dataset in the form of a Pandas `DataFrame` using the [Seaborn](http://seaborn.pydata.org/) library, and take a look at the first few items:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "deletable": true, "editable": true, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", " | sepal_length | \n", "sepal_width | \n", "petal_length | \n", "petal_width | \n", "species | \n", "
---|---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "setosa | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "setosa | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "setosa | \n", "