{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Gathering\n", "\n", "This recipe shows how we will be accessing the datasets necessary for the rest of the book.\n", "\n", "We start by loading the necessary libraries and resetting the computational graph." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import tensorflow as tf\n", "from tensorflow.python.framework import ops\n", "ops.reset_default_graph()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The Iris Dataset (R. Fisher / Scikit-Learn)\n", "\n", "One of the most frequently used ML datasets is the iris flower dataset. We will use the easy import tool, `datasets` from scikit-learn. You can read more about it here: http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "150\n", "150\n", "[ 5.1 3.5 1.4 0.2]\n", "{0, 1, 2}\n" ] } ], "source": [ "from sklearn.datasets import load_iris\n", "\n", "iris = load_iris()\n", "print(len(iris.data))\n", "print(len(iris.target))\n", "print(iris.data[0])\n", "print(set(iris.target))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Low Birthrate Dataset (Hosted on Github)\n", "\n", "The 'Low Birthrate Dataset' is a dataset from a famous study by Hosmer and Lemeshow in 1989 called, \"Low Infant Birth Weight Risk Factor Study\". It is a very commonly used academic dataset mostly for logistic regression. We will host this dataset on the public Github here:\n", "https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "189\n", "9\n" ] } ], "source": [ "import requests\n", "\n", "birthdata_url = 'https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat'\n", "birth_file = requests.get(birthdata_url)\n", "birth_data = birth_file.text.split('\\r\\n')\n", "birth_header = birth_data[0].split('\\t')\n", "birth_data = [[float(x) for x in y.split('\\t') if len(x)>=1] for y in birth_data[1:] if len(y)>=1]\n", "print(len(birth_data))\n", "print(len(birth_data[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Housing Price Dataset (UCI)\n", "\n", "We will also use a housing price dataset from the University of California at Irvine (UCI) Machine Learning Database Repository. It is a great regression dataset to use. You can read more about it here:\n", "https://archive.ics.uci.edu/ml/datasets/Housing" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "506\n", "14\n" ] } ], "source": [ "import requests\n", "\n", "housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'\n", "housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV']\n", "housing_file = requests.get(housing_url)\n", "housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\\n') if len(y)>=1]\n", "print(len(housing_data))\n", "print(len(housing_data[0]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## MNIST Handwriting Dataset (Yann LeCun)\n", "\n", "The MNIST Handwritten digit picture dataset is the `Hello World` of image recognition. The famous scientist and researcher, Yann LeCun, hosts it on his webpage here, http://yann.lecun.com/exdb/mnist/ . But because it is so commonly used, many libraries, including TensorFlow, host it internally. We will use TensorFlow to access this data as follows.\n", "\n", "If you haven't downloaded this before, please wait a bit while it downloads" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.\n", "Extracting MNIST_data/train-images-idx3-ubyte.gz\n", "Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.\n", "Extracting MNIST_data/train-labels-idx1-ubyte.gz\n", "Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.\n", "Extracting MNIST_data/t10k-images-idx3-ubyte.gz\n", "Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.\n", "Extracting MNIST_data/t10k-labels-idx1-ubyte.gz\n", "55000\n", "10000\n", "5000\n", "[ 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]\n" ] } ], "source": [ "from tensorflow.examples.tutorials.mnist import input_data\n", "\n", "mnist = input_data.read_data_sets(\"MNIST_data/\", one_hot=True)\n", "print(len(mnist.train.images))\n", "print(len(mnist.test.images))\n", "print(len(mnist.validation.images))\n", "print(mnist.train.labels[1,:])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## CIFAR-10 Data\n", "\n", "The CIFAR-10 data ( https://www.cs.toronto.edu/~kriz/cifar.html ) contains 60,000 32x32 color images of 10 classes collected by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Alex Krizhevsky maintains the page referenced here. This is such a common dataset, that there are built in functions in TensorFlow to access this data (the keras wrapper has these commands). Note that the keras wrapper for these functions automatically splits the images into a 50,000 training set and a 10,000 test set." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Downloading data from http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz\n" ] } ], "source": [ "from PIL import Image\n", "# Running this command requires an internet connection and a few minutes to download all the images.\n", "(X_train, y_train), (X_test, y_test) = tf.contrib.keras.datasets.cifar10.load_data()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The ten categories are (in order):\n", "\n", "