{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Loading Data\n", "\n", "This tutorial we focus on how to feeding data into a training and inference program. We can manually copy data into a binded symbol as shown in the [mixed programming](./mixed.ipynb). Most training and inference modules in MXNet accepts data iterators, which simplifies this procedure, especially when reading large datasets from filesystems. Here we discuss the API conventions and several provided iterators. \n", "\n", "## Basic Data Iterator\n", "\n", "Data iterators in MXNet is similar to the iterator in Python. In Python, we can use the built-in function `iter` with an iterable object (such as list) to return an iterator. For example, in `x = iter([1, 2, 3])` we obtain an iterator on the list `[1,2,3]`. If we repeatedly call `x.next()` (`__next__()` for Python 3), then we will get elements from the list one by one, and end with a `StopIteration` exception. \n", "\n", "MXNet's data iterator returns a batch of data in each `next` call. We first introduce what a data batch looks like and then how to write a basic data iterator.\n", "\n", "### Data Batch\n", "\n", "A data batch often contains *n* examples and the according labels. Here *n* is often called as the batch size. \n", "\n", "The following codes defines a valid data batch is able to be read by most training/inference modules. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class SimpleBatch(object):\n", " def __init__(self, data, label, pad=0):\n", " self.data = data\n", " self.label = label\n", " self.pad = pad" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We explain what each attribute means:\n", "\n", "- **`data`** is a list of NDArray, each of them has $n$ length first dimension. For example, if an example is an image with size $224 \\times 224$ and RGB channels, then the array shape should be `(n, 3, 224, 244)`. Note that the image batch format used by MXNet is \n", "\n", " $$\\textrm{batch_size} \\times \\textrm{num_channel} \\times \\textrm{height} \\times \\textrm{width}$$\n", " The channels are often in RGB order.\n", "\n", " Each array will be copied into a free variable of the Symbol later. The mapping from arrays to free variables should be given by the `provide_data` attribute of the iterator, which will be discussed shortly. \n", " \n", "- **`label`** is also a list of NDArray. Often each NDArray is a 1-dimensional array with shape `(n,)`. For classification, each class is represented by an integer starting from 0.\n", "\n", "- **`pad`** is an integer shows how many examples are for merely used for padding, which should be ignored in the results. A nonzero padding is often used when we reach the end of the data and the total number of examples cannot be divided by the batch size. \n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Symbol and Data Variables \n", "\n", "Before moving the iterator, we first look at how to find which variables in a Symbol are for input data. In MXNet, an operator (`mx.sym.*`) has one or more input variables and output variables; some operators may have additional auxiliary variables for internal states. For an input variable of an operator, if do not assign it with an output of another operator during creating this operator, then this input variable is free. We need to assign it with external data before running. \n", "\n", "The following codes define a simple multilayer perceptron (MLP) and then print all free variables." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['data', 'fc1_weight', 'fc1_bias', 'fc2_weight', 'fc2_bias', 'softmax_label']\n", "['softmax_output']\n" ] } ], "source": [ "import mxnet as mx\n", "num_classes = 10\n", "net = mx.sym.Variable('data')\n", "net = mx.sym.FullyConnected(data=net, name='fc1', num_hidden=64)\n", "net = mx.sym.Activation(data=net, name='relu1', act_type=\"relu\")\n", "net = mx.sym.FullyConnected(data=net, name='fc2', num_hidden=num_classes)\n", "net = mx.sym.SoftmaxOutput(data=net, name='softmax')\n", "print(net.list_arguments())\n", "print(net.list_outputs())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As can be seen, we name a variable either by its operator's name if it is atomic (e.g. `sym.Variable`) or by the `opname_varname` convention. The `varname` often means what this variable is for:\n", "- `weight` : the weight parameters\n", "- `bias` : the bias parameters\n", "- `output` : the output\n", "- `label` : input label\n", "\n", "On the above example, now we know that there are 4 variables for parameters, and two for input data: `data` for examples and `softmax_label` for the according labels. \n", "\n", "The following example define a matrix factorization object function with rank 10 for recommendation systems. It has three input variables, `user` for user IDs, `item` for item IDs, and `score` is the rating `user` gives to `item`. " ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "num_users = 1000\n", "num_items = 1000\n", "k = 10 \n", "user = mx.symbol.Variable('user')\n", "item = mx.symbol.Variable('item')\n", "score = mx.symbol.Variable('score')\n", "# user feature lookup\n", "user = mx.symbol.Embedding(data = user, input_dim = num_users, output_dim = k) \n", "# item feature lookup\n", "item = mx.symbol.Embedding(data = item, input_dim = num_items, output_dim = k)\n", "# predict by the inner product, which is elementwise product and then sum\n", "pred = user * item\n", "pred = mx.symbol.sum_axis(data = pred, axis = 1)\n", "pred = mx.symbol.Flatten(data = pred)\n", "# loss layer\n", "pred = mx.symbol.LinearRegressionOutput(data = pred, label = score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data Iterators\n", "\n", "Now we are ready to show how to create a valid MXNet data iterator. An iterator should \n", "1. return a data batch or raise a `StopIteration` exception if reaching the end when call `next()` in python 2 or `__next()__` in python 3\n", "2. has `reset()` method to restart reading from the beginning\n", "3. has `provide_data` and `provide_label` attributes, the former returns a list of `(str, tuple)` pairs, each pair stores an input data variable name and its shape. It is similar for `provide_label`, which provides information about input labels.\n", "\n", "The following codes define a simple iterator that return some random data each time. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "class SimpleIter:\n", " def __init__(self, data_names, data_shapes, data_gen,\n", " label_names, label_shapes, label_gen, num_batches=10):\n", " self._provide_data = zip(data_names, data_shapes)\n", " self._provide_label = zip(label_names, label_shapes)\n", " self.num_batches = num_batches\n", " self.data_gen = data_gen\n", " self.label_gen = label_gen\n", " self.cur_batch = 0\n", "\n", " def __iter__(self):\n", " return self\n", "\n", " def reset(self):\n", " self.cur_batch = 0 \n", "\n", " def __next__(self):\n", " return self.next()\n", "\n", " @property\n", " def provide_data(self):\n", " return self._provide_data\n", "\n", " @property\n", " def provide_label(self):\n", " return self._provide_label\n", "\n", " def next(self):\n", " if self.cur_batch < self.num_batches:\n", " self.cur_batch += 1\n", " data = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_data, self.data_gen)]\n", " assert len(data) > 0, \"Empty batch data.\"\n", " label = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_label, self.label_gen)]\n", " assert len(label) > 0, \"Empty batch label.\"\n", " return SimpleBatch(data, label)\n", " else:\n", " raise StopIteration" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "Now we can feed the data iterator into a training problem. Here we used the `Module` class, more details about this class is discussed in [module.ipynb](./module.ipynb)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Epoch[0] Train-accuracy=0.090625\n", "INFO:root:Epoch[0] Time cost=0.011\n", "INFO:root:Epoch[1] Train-accuracy=0.078125\n", "INFO:root:Epoch[1] Time cost=0.014\n", "INFO:root:Epoch[2] Train-accuracy=0.087500\n", "INFO:root:Epoch[2] Time cost=0.014\n", "INFO:root:Epoch[3] Train-accuracy=0.100000\n", "INFO:root:Epoch[3] Time cost=0.014\n", "INFO:root:Epoch[4] Train-accuracy=0.115625\n", "INFO:root:Epoch[4] Time cost=0.014\n" ] } ], "source": [ "# @@@ AUTOTEST_OUTPUT_IGNORED_CELL\n", "import logging\n", "logging.basicConfig(level=logging.INFO)\n", "\n", "n = 32\n", "data = SimpleIter(['data'], [(n, 100)], \n", " [lambda s: np.random.uniform(-1, 1, s)],\n", " ['softmax_label'], [(n,)], \n", " [lambda s: np.random.randint(0, num_classes, s)])\n", "\n", "mod = mx.mod.Module(symbol=net)\n", "mod.fit(data, num_epoch=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While for Symbol `pred`, we need to provide three inputs, two for examples and one for label." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "INFO:root:Epoch[0] Train-accuracy=0.190625\n", "INFO:root:Epoch[0] Time cost=0.009\n", "INFO:root:Epoch[1] Train-accuracy=0.209375\n", "INFO:root:Epoch[1] Time cost=0.009\n", "INFO:root:Epoch[2] Train-accuracy=0.187500\n", "INFO:root:Epoch[2] Time cost=0.011\n", "INFO:root:Epoch[3] Train-accuracy=0.246875\n", "INFO:root:Epoch[3] Time cost=0.009\n", "INFO:root:Epoch[4] Train-accuracy=0.175000\n", "INFO:root:Epoch[4] Time cost=0.009\n" ] } ], "source": [ "# @@@ AUTOTEST_OUTPUT_IGNORED_CELL\n", "data = SimpleIter(['user', 'item'],\n", " [(n,), (n,)],\n", " [lambda s: np.random.randint(0, num_users, s),\n", " lambda s: np.random.randint(0, num_items, s)],\n", " ['score'], [(n,)],\n", " [lambda s: np.random.randint(0, 5, s)])\n", "\n", "mod = mx.mod.Module(symbol=pred, data_names=['user', 'item'], label_names=['score'])\n", "mod.fit(data, num_epoch=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More Iterators \n", "\n", "MXNet provides multiple efficient data iterators.\n", "\n", "TODO. Explain more here.\n", "\n", "## Implementation\n", "\n", "Iterators can be implemented in either C++ or front-end languages such as Python. The C++ definition is at [include/mxnet/io.h](https://github.com/dmlc/mxnet/blob/master/include/mxnet/io.h), all C++ implementations are located in [src/io](https://github.com/dmlc/mxnet/tree/master/src/io). These implementations heavily rely on [dmlc-core](https://github.com/dmlc/dmlc-core), which supports reading data from various data format and filesystems. \n", "\n", "\n", "## Further Readings\n", "\n", "- [Data loading API](http://mxnet.io/packages/python/io.html)\n", "- [Design of efficient data format](http://mxnet.io/system/note_data_loading.html)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [], "source": [ "train = mx.io.NDArrayIter(data=np.zeros((1200, 3, 224, 224), dtype='float32'), \n", " label={'annot': np.zeros((1200, 80), dtype='int8'),\n", " 'label': np.zeros((1200, 1), dtype='int8')}, \n", " batch_size=10)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a\n" ] } ], "source": [ "print \"a\"" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 1 }