{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Iris Classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As a first foray into the world of data science and machine learning, we will explore a simple classification problem: identify flowers (irises, to be exact) by their physical characteristics. The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/Iris).\n", "\n", "So, the very first thing we're going to want to do is download the dataset." ] }, { "cell_type": "code", "execution_count": 279, "metadata": {}, "outputs": [], "source": [ "%%bash\n", "for i in iris.data iris.names;\n", "do\n", " wget -nv -nc https://archive.ics.uci.edu/ml/machine-learning-databases/iris/$i\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's take a peek to see what it looks like:" ] }, { "cell_type": "code", "execution_count": 280, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "=== iris.data ===\n", "5.1,3.5,1.4,0.2,Iris-setosa\n", "4.9,3.0,1.4,0.2,Iris-setosa\n", "4.7,3.2,1.3,0.2,Iris-setosa\n", "4.6,3.1,1.5,0.2,Iris-setosa\n", "5.0,3.6,1.4,0.2,Iris-setosa\n", "5.4,3.9,1.7,0.4,Iris-setosa\n", "4.6,3.4,1.4,0.3,Iris-setosa\n", "5.0,3.4,1.5,0.2,Iris-setosa\n", "4.4,2.9,1.4,0.2,Iris-setosa\n", "4.9,3.1,1.5,0.1,Iris-setosa\n", "\n", "=== iris.names ===\n", "1. Title: Iris Plants Database\n", "\tUpdated Sept 21 by C.Blake - Added discrepency information\n", "\n", "2. Sources:\n", " (a) Creator: R.A. Fisher\n", " (b) Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)\n", " (c) Date: July, 1988\n", "\n", "3. Past Usage:\n", " - Publications: too many to mention!!! Here are a few.\n", "\n" ] } ], "source": [ "%%bash\n", "for i in iris.data iris.names;\n", "do\n", " echo === $i ===\n", " head $i\n", " echo\n", "done" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So the `iris.names` file contains the description of the dataset, while `iris.data` IS the dataset. I won't reiterate everything in the file, but here are the important points:\n", "* The 5 features in the dataset are 1) sepal length in cm, 2) sepal width in cm, 3) petal length in cm, 4) petal width in cm, 5) class (one of Iris Setosa, Iris Versicolour, Iris Virginica)\n", "* There are 150 feature vectors, with 50 from each class (so the data is balanced)\n", "Let's load the data into pandas so that we can manipulate it." ] }, { "cell_type": "code", "execution_count": 281, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | SepLen | \n", "SepWid | \n", "PetLen | \n", "PetWid | \n", "Class | \n", "
---|---|---|---|---|---|
0 | \n", "5.1 | \n", "3.5 | \n", "1.4 | \n", "0.2 | \n", "Iris-setosa | \n", "
1 | \n", "4.9 | \n", "3.0 | \n", "1.4 | \n", "0.2 | \n", "Iris-setosa | \n", "
2 | \n", "4.7 | \n", "3.2 | \n", "1.3 | \n", "0.2 | \n", "Iris-setosa | \n", "
3 | \n", "4.6 | \n", "3.1 | \n", "1.5 | \n", "0.2 | \n", "Iris-setosa | \n", "
4 | \n", "5.0 | \n", "3.6 | \n", "1.4 | \n", "0.2 | \n", "Iris-setosa | \n", "