{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Sebastian Raschka 12/29/2015 \n", "\n", "CPython 3.5.1\n", "IPython 4.0.1\n" ] } ], "source": [ "%load_ext watermark\n", "%watermark -a 'Sebastian Raschka' -v -d" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reading MNIST into NumPy arrays" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, I provide some instructions for reading in the MNIST dataset of handwritten digits into NumPy arrays.\n", "The dataset consists of the following files:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- Training set images: train-images-idx3-ubyte.gz (9.9 MB, 47 MB unzipped, 60,000 samples)\n", "- Training set labels: train-labels-idx1-ubyte.gz (29 KB, 60 KB unzipped, 60,000 labels) \n", "- Test set images: t10k-images-idx3-ubyte.gz (1.6 MB, 7.8 MB, 10,000 samples)\n", "- Test set labels: t10k-labels-idx1-ubyte.gz (5 KB, 10 KB unzipped, 10,000 labels) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dataset source: [http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "After downloading the files, I recommend to unzip the files using the Unix/Linux gzip tool from the terminal for efficiency, e.g., using the command \n", " `gzip *ubyte.gz -d` \n", "in your local MNIST download directory." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we define a simple function to read in the training or test images and corresponding labels." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import os\n", "import struct\n", "import numpy as np\n", " \n", "def load_mnist(path, which='train'):\n", " \n", " if which == 'train':\n", " labels_path = os.path.join(path, 'train-labels-idx1-ubyte')\n", " images_path = os.path.join(path, 'train-images-idx3-ubyte')\n", " elif which == 'test':\n", " labels_path = os.path.join(path, 't10k-labels-idx1-ubyte')\n", " images_path = os.path.join(path, 't10k-images-idx3-ubyte')\n", " else:\n", " raise AttributeError('`which` must be \"train\" or \"test\"')\n", " \n", " with open(labels_path, 'rb') as lbpath:\n", " magic, n = struct.unpack('>II', lbpath.read(8))\n", " labels = np.fromfile(lbpath, dtype=np.uint8)\n", "\n", " with open(images_path, 'rb') as imgpath:\n", " magic, n, rows, cols = struct.unpack('>IIII', imgpath.read(16))\n", " images = np.fromfile(imgpath, dtype=np.uint8).reshape(len(labels), 784)\n", " \n", " return images, labels" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The returned `images` NumPy array will have the shape $n \\times m$, where $n$ is the number of samples, and $m$ is the number of features. The images in the MNIST dataset consist of $28 \\times 28$ pixels, and each pixel is represented by a grayscale intensity value. Here, we unroll the $28 \\times 28$ images into 1D row vectors, which represent the rows in our matrix; thus $m=784$.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You may wonder why we read in the labels in such a strange way:\n", "\n", " magic, n = struct.unpack('>II', lbpath.read(8))\n", " labels = np.fromfile(lbpath, dtype=np.int8)\n", "\n", "This is to accomodate the way the labels where stored, which is described in the excerpt from the MNIST website:\n", "\n", "
[offset] [type] [value] [description] \n", "0000 32 bit integer 0x00000801(2049) magic number (MSB first) \n", "0004 32 bit integer 60000 number of items \n", "0008 unsigned byte ?? label \n", "0009 unsigned byte ?? label \n", "........ \n", "xxxx unsigned byte ?? label\n", "\n", "So, we first read in the \"magic number\" (describes a file format or protocol) and the \"number of items\" from the file buffer before we read the following bytes into a NumPy array using the `fromfile` method.\n", "\n", "The `fmt` parameter value `'>II'` that we passed as an argument to `struct.unpack` can be composed into:\n", "\n", "- '>': big-endian (defines the order in which a sequence of bytes is stored)\n", "- 'I': unsigned int" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If everything executed correctly, we should now have a label vector of $60,000$ instances, and a $60,000 \\times 784$ image feature matrix." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Labels: 60000\n", "Rows: 60000, columns: 784\n" ] } ], "source": [ "X, y = load_mnist(path='./', which='train')\n", "print('Labels: %d' % y.shape[0])\n", "print('Rows: %d, columns: %d' % (X.shape[0], X.shape[1]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To check if the pixels were retrieved correctly, let us print a few images:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAP4AAAEKCAYAAAAy4ujqAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAEYRJREFUeJzt3X2wXHV9x/H3J4SAEMSYkpsplKTagq0MTWGgw4TQNWBE\nagUNpUIdUVt1xEe0VgytuRc6DDKFEqrOFAwYrI4IJA3pOC1Q3bYZSWFaCGKI0akJiSSXUEIwGQgP\n+faPPUn3rvf+du/dp8P9fV4zOzl3v+fs+d7N/ew5Z8+TIgIzy8uUfjdgZr3n4JtlyME3y5CDb5Yh\nB98sQw6+WYYc/MxJ+n1JW1sc91JJ/zHB+Ux4Wus8B79kJP1M0sIez3Y8B3O0c+BHy9NK2i/pF8Xj\nOUk3tzFfazC13w3Y+Eg6JCJe6XcfPRDAyRHxs343Mhl5iV8ikm4HjgfWFEu5P5c0p1j6fVDSFuBf\nR1s9r19TUM0Vkn4qaaekb0t6XYs9fL6Y7jlJj0m6oGGUKZL+TtKzkjbUr51Ieq2kr0l6UtJWSVdL\n0kTfDvz32TV+Y0skIt4HPAG8IyJeGxF/U1c+C3gT8LYDoyde6pPAO4EFwK8Cu4CvttjGT4H5EfFa\nYAj4B0kDdfXfA34CzAQGgZV1HyorgBeBNwC/C7wV+LPRZiJpjaS/aNLLvxUfIndJmtNi/9YCB7+c\nGpeSASyNiOcjYl8L038EuDIitkfES8BVwIWSmv5/R8TdETFcDN9JLeSn140yHBE3RcQrEfEd4MfA\nH0iaBbwduDwiXoiIp4EbgYvHmM8fRsR1iVbOAuZS+7DbDvxTK/1ba7yN/+qxbRzjzgFWSdpf/Czg\nJWCAWojGJOl9wOXUQgdwJPArdaP8vGGSLdTWKuYAhwLbi7V7FY8nxtH3QRGxthh8TtKngN3AbwE/\nmsjr2UgOfvmMtQpf//xe4IgDP0g6BDimrv4E8MGIeGA8M5Z0PHAz8JYD00p6mJFrIMc2THY8sBrY\nCrwAzIzOn/Kphn+tTV51Kp8d1LaR6zX+wW8CDpf0dklTgb8EptXV/x64pggyko6R9M4W5n0ksB94\nWtIUSR8ATmoYZ0DSJyRNlfRH1FbFvxsRO4B7gb+VdFTxBeMbJJ3VwnxH/rLSb0v6naKH6cAN1NZ4\nHh/va9noHPzyuRb4K0nPSPpM8dyIJWhEPAdcBiynFohfMHJTYBm1pfC9knYDP2DkdvqoIuJx4Hpg\nHbUPoDcDaxtGWwf8JvA0cDWwOCJ2FbX3UfsA2gA8A9wJzB5tXpK+K+mKMVoZAO6gtnr/U+DXqH3h\nmcNuzJ6QL8Rhlh8v8c0y5OCbZcjBN8tQW8GXdK6kjZI2Sfp8p5oys+6a8Jd7xVFUm4CzgSeBh4D3\nRMTGhvH87aFZn0TEqMc+tLPEPx34SURsKQ4L/TZw/hgzP/hYunTpiJ/L9nB/k7e/MvfWjf5S2gn+\nsdSO1jpgG798VJeZlZC/3DPLUDvH6v+c2nHaBxzHL5/AAcDg4ODB4de9rqXTwvumUqn0u4Uk9zdx\nZe4N2u+vWq1SrVZbGredL/cOoXZK5tnUzvh6ELg4aod91o8XE52HmU2cJGKML/cmvMSPiFckfZza\niRlTgOWNoTezcur6sfpe4pv1R2qJ7y/3zDLk4JtlyME3y5CDb5YhB98sQw6+WYYcfLMMOfhmGXLw\nzTLk4JtlyME3y5CDb5YhB98sQw6+WYYcfLMMOfhmGXLwzTLk4JtlyME3y5CDb5YhB98sQw6+WYYc\nfLMMOfhmGXLwzTLk4JtlyME3y5CDb5YhB98sQw6+WYYcfLMMTW1nYkmbgd3AfuCliDi9E01Z5+zf\nvz9Z37dvX1fnv2LFimR97969yfqGDRuS9RtvvDFZX7JkSbL+5S9/OVl/zWtek6xff/31yfpHP/rR\nZL1f2go+tcBXImJXJ5oxs95od1VfHXgNM+uxdkMbwH2SHpL0oU40ZGbd1+6q/vyI2C7pGGofAI9H\nxNrGkQYHBw8OVyoVKpVKm7M1s0bVapVqtdrSuG0FPyK2F//ulLQKOB1IBt/MuqNxoTo0NDTmuBNe\n1Zd0hKTpxfCRwCLgsYm+npn1TjtL/AFglaQoXuebEXFvZ9oys26acPAj4mfAvA72Mint3r07WX/l\nlVeS9fXr1yfr996b/qx99tlnk/Wbb745We+3uXPnJuuf/exnk/Xly5cn60cffXSyvmDBgmR94cKF\nyXpZeVecWYYcfLMMOfhmGXLwzTLk4JtlyME3y5CDb5YhRUR3ZyBFt+fRT9u2bUvW581LH+qwa1fe\nZzRPmZJe9tx3333JerPz5ZuZNWtWsj59+vRk/Zhjjmlr/t0kiYjQaDUv8c0y5OCbZcjBN8uQg2+W\nIQffLEMOvlmGHHyzDLV7zb3szZw5M1kfGBhI1su+H3/RokXJerPff+XKlcn6YYcdlqz7+ozd4SW+\nWYYcfLMMOfhmGXLwzTLk4JtlyME3y5CDb5Yh78dvU7Pzwb/+9a8n63fddVeyfsYZZyTrixcvTtab\nOfPMM5P11atXJ+vTpk1L1nfs2JGsL1u2LFm37vAS3yxDDr5Zhhx8sww5+GYZcvDNMuTgm2XIwTfL\nUNPr6ktaDrwDGI6Ik4vnZgB3AHOAzcBFETHqjeAn+3X127Vv375kvdl+8iVLliTr1113XbL+/e9/\nP1k/66yzknUrr3avq38b8LaG564A7o+IE4HvAV9or0Uz66WmwY+ItUDjZWLOB1YUwyuACzrcl5l1\n0US38WdFxDBAROwA0vchMrNS6dSx+smN+MHBwYPDlUrF11Ez64JqtUq1Wm1p3IkGf1jSQEQMS5oN\nPJUauT74ZtYdjQvVoaGhMcdtdVVfxeOAe4D3F8OXAulTuMysVJoGX9K3gB8AJ0h6QtIHgGuBt0r6\nMXB28bOZvUo0XdWPiEvGKJ3T4V6y1Oy68s3MmDGjrelvuummZH3BggXJujTqbmIrOR+5Z5YhB98s\nQw6+WYYcfLMMOfhmGXLwzTLk4JtlqOn5+G3PwOfjd9WLL76YrF9yyViHYdSsWrUqWV+/fn2yftJJ\nJyXr1j/tno9vZpOMg2+WIQffLEMOvlmGHHyzDDn4Zhly8M0y5P34k9wzzzyTrL/xjW9M1l//+tcn\n6xdckL7A8vz585P1d73rXcm6z/efOO/HN7MRHHyzDDn4Zhly8M0y5OCbZcjBN8uQg2+WIe/Hz9yD\nDz6YrJ977rnJ+u7du9ua/6233pqsL168OFmfPn16W/OfzLwf38xGcPDNMuTgm2XIwTfLkINvliEH\n3yxDDr5Zhprux5e0HHgHMBwRJxfPLQU+BDxVjLYkIv55jOm9H/9VbPv27cn65Zdfnqzfeeedbc3/\nyiuvTNY/97nPJetHHXVUW/N/NWt3P/5twNtGef6GiDileIwaejMrp6bBj4i1wK5RSr40itmrVDvb\n+B+X9Iikr0k6umMdmVnXTZ3gdF8FroqIkPTXwA3An4418uDg4MHhSqVCpVKZ4GzNbCzVapVqtdrS\nuBMKfkTsrPvxFmBNavz64JtZdzQuVIeGhsYct9VVfVG3TS9pdl3t3cBj4+rQzPqq6RJf0reACjBT\n0hPAUuAtkuYB+4HNwEe62KOZdZjPx7e2vPDCC8n6unXrkvVzzjknWW/2t3PhhRcm63fccUeyPpn5\nfHwzG8HBN8uQg2+WIQffLEMOvlmGHHyzDDn4Zhnyfnzrq8MOOyxZf/nll5P1qVPTx6A9+uijyfqJ\nJ56YrL+aeT++mY3g4JtlyME3y5CDb5YhB98sQw6+WYYcfLMMTfSae5aJJ598MllfuXJlsv7AAw8k\n68320zdz2mmnJesnnHBCW68/WXmJb5YhB98sQw6+WYYcfLMMOfhmGXLwzTLk4JtlyPvxJ7mdO3cm\n61/5yleS9dtuuy1Z37Zt27h7Go9DDjkkWZ87d26yLvmmzqPxEt8sQw6+WYYcfLMMOfhmGXLwzTLk\n4JtlyME3y1DT/fiSjgNuBwaA/cAtEXGTpBnAHcAcYDNwUUTs7mKvWdqzZ0+yvmbNmmT9qquuStY3\nbdo07p46aeHChcn6tddem6yfeuqpnWwnG60s8V8GPhMRbwbOAD4m6U3AFcD9EXEi8D3gC91r08w6\nqWnwI2JHRDxSDO8BHgeOA84HVhSjrQAu6FaTZtZZ49rGlzQXmAesAwYiYhhqHw7ArE43Z2bd0fKx\n+pKmA3cBn4qIPZIab4g35g3yBgcHDw5XKhUqlcr4ujSzpqrVKtVqtaVxWwq+pKnUQv+NiFhdPD0s\naSAihiXNBp4aa/r64JtZdzQuVIeGhsYct9VV/VuBDRGxrO65e4D3F8OXAqsbJzKzcmpld9584E+A\nH0p6mNoq/RLgS8B3JH0Q2AJc1M1Gzaxz1O1710uKbs+jzPbu3Zusb926NVl/73vfm6w//PDD4+6p\nkxYtWpSsp1Y3ofl18X0+/cRJIiJGfQN95J5Zhhx8sww5+GYZcvDNMuTgm2XIwTfLkINvliFfV7+J\n559/Pln/9Kc/nayvXbs2Wd+4ceO4e+qk8847L1n/4he/mKzPmzcvWT/00EPH3ZN1n5f4Zhly8M0y\n5OCbZcjBN8uQg2+WIQffLEMOvlmGJv1+/M2bNyfr11xzTbJ+//33J+tbtmwZb0sddcQRRyTrV199\ndbJ+2WWXJevTpk0bd09Wfl7im2XIwTfLkINvliEH3yxDDr5Zhhx8sww5+GYZmvT78e++++5kffny\n5V2d/ymnnJKsX3zxxcn61Knp/6IPf/jDyfrhhx+erFuevMQ3y5CDb5YhB98sQw6+WYYcfLMMOfhm\nGWoafEnHSfqepB9J+qGkTxTPL5W0TdJ/F49zu9+umXWCmt27XtJsYHZEPCJpOvBfwPnAHwO/iIgb\nmkwfzeZhZp0niYjQaLWmB/BExA5gRzG8R9LjwLEHXrtjXZpZz4xrG1/SXGAe8J/FUx+X9Iikr0k6\nusO9mVmXtBz8YjX/LuBTEbEH+CrwhoiYR22NILnKb2bl0dKx+pKmUgv9NyJiNUBE7Kwb5RZgzVjT\nDw4OHhyuVCpUKpUJtGpmKdVqlWq12tK4Tb/cA5B0O/B0RHym7rnZxfY/ki4HTouIS0aZ1l/umfVB\n6su9Vr7Vnw/8O/BDIIrHEuASatv7+4HNwEciYniU6R18sz5oK/gdmLmDb9YHqeD7yD2zDDn4Zhly\n8M0y5OCbZcjBN8uQg2+WIQffLEMOvlmGHHyzDDn4Zhly8M0y5OCbZajnwW/1fOF+cX/tKXN/Ze4N\netufg9/A/bWnzP2VuTeY5ME3s/5z8M0y1JMLcXR1BmY2pr5dgcfMyser+mYZcvDNMtSz4Es6V9JG\nSZskfb5X822VpM2S1kt6WNKDJehnuaRhSY/WPTdD0r2SfizpX/p596Ix+ivNjVRHudnrJ4vnS/Ee\n9vtmtD3Zxpc0BdgEnA08CTwEvCciNnZ95i2S9D/AqRGxq9+9AEg6E9gD3B4RJxfPfQn434i4rvjw\nnBERV5Sov6W0cCPVXkjc7PUDlOA9bPdmtO3q1RL/dOAnEbElIl4Cvk3tlywTUaJNn4hYCzR+CJ0P\nrCiGVwAX9LSpOmP0ByW5kWpE7IiIR4rhPcDjwHGU5D0co7+e3Yy2V3/oxwJb637exv//kmURwH2S\nHpL0oX43M4ZZB25aUtzFaFaf+xlN6W6kWnez13XAQNnew37cjLY0S7gSmB8RpwDnAR8rVmXLrmz7\nYkt3I9VRbvba+J719T3s181oexX8nwPH1/18XPFcaUTE9uLfncAqapsnZTMsaQAObiM+1ed+RoiI\nnXW3TboFOK2f/Yx2s1dK9B6OdTPaXryHvQr+Q8BvSJojaRrwHuCeHs27KUlHFJ+8SDoSWAQ81t+u\ngNq2Xv323j3A+4vhS4HVjRP02Ij+iiAd8G76/x7eCmyIiGV1z5XpPfyl/nr1HvbsyL1it8Qyah82\nyyPi2p7MuAWSfp3aUj6o3Tr8m/3uT9K3gAowExgGlgL/CNwJ/BqwBbgoIp4tUX9voYUbqfaov7Fu\n9vog8B36/B62ezPatufvQ3bN8uMv98wy5OCbZcjBN8uQg2+WIQffLEMOvlmGHHyzDDn4Zhn6PyWk\n17mut7UnAAAAAElFTkSuQmCC\n", "text/plain": [ "