{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data generators\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In Python, a generator is a function that behaves like an iterator. It will return the next item. Here is a [link](https://wiki.python.org/moin/Generators) to review python generators. In many AI applications, it is advantageous to have a data generator to handle loading and transforming data for different applications. \n", "\n", "You will now implement a custom data generator, using a common pattern that you will use during all assignments of this course.\n", "In the following example, we use a set of samples `a`, to derive a new set of samples, with more elements than the original set.\n", "\n", "**Note: Pay attention to the use of list `lines_index` and variable `index` to traverse the original list.**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1, 2, 3, 4, 1, 2, 3, 4, 1, 2]\n" ] } ], "source": [ "import random \n", "import numpy as np\n", "\n", "# Example of traversing a list of indexes to create a circular list\n", "a = [1, 2, 3, 4]\n", "b = [0] * 10\n", "\n", "a_size = len(a)\n", "b_size = len(b)\n", "lines_index = [*range(a_size)] # is equivalent to [i for i in range(0,a_size)], the difference being the advantage of using * to pass values of range iterator to list directly\n", "index = 0 # similar to index in data_generator below\n", "for i in range(b_size): # `b` is longer than `a` forcing a wrap\n", " # We wrap by resetting index to 0 so the sequences circle back at the end to point to the first index\n", " if index >= a_size:\n", " index = 0\n", " \n", " b[i] = a[lines_index[index]] # `indexes_list[index]` point to a index of a. Store the result in b\n", " index += 1\n", " \n", "print(b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Shuffling the data order\n", "\n", "In the next example, we will do the same as before, but shuffling the order of the elements in the output list. Note that here, our strategy of traversing using `lines_index` and `index` becomes very important, because we can simulate a shuffle in the input data, without doing that in reality.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Original order of index: [0, 1, 2, 3]\n", "Shuffled order of index: [2, 3, 1, 0]\n", "New value order for first batch: [3, 4, 2, 1]\n", "\n", "Shuffled Indexes for Batch No.2 :[3, 1, 0, 2]\n", "Values for Batch No.2 :[4, 2, 1, 3]\n", "\n", "Shuffled Indexes for Batch No.3 :[3, 2, 1, 0]\n", "Values for Batch No.3 :[4, 3, 2, 1]\n", "\n", "Final value of b: [3, 4, 2, 1, 4, 2, 1, 3, 4, 3]\n" ] } ], "source": [ "# Example of traversing a list of indexes to create a circular list\n", "a = [1, 2, 3, 4]\n", "b = []\n", "\n", "a_size = len(a)\n", "b_size = 10\n", "lines_index = [*range(a_size)]\n", "print(\"Original order of index:\",lines_index)\n", "\n", "# if we shuffle the index_list we can change the order of our circular list\n", "# without modifying the order or our original data\n", "random.shuffle(lines_index) # Shuffle the order\n", "print(\"Shuffled order of index:\",lines_index)\n", "\n", "print(\"New value order for first batch:\",[a[index] for index in lines_index])\n", "batch_counter = 1\n", "index = 0 # similar to index in data_generator below\n", "for i in range(b_size): # `b` is longer than `a` forcing a wrap\n", " # We wrap by resetting index to 0\n", " if index >= a_size:\n", " index = 0\n", " batch_counter += 1\n", " random.shuffle(lines_index) # Re-shuffle the order\n", " print(\"\\nShuffled Indexes for Batch No.{} :{}\".format(batch_counter,lines_index))\n", " print(\"Values for Batch No.{} :{}\".format(batch_counter,[a[index] for index in lines_index]))\n", " \n", " b.append(a[lines_index[index]]) # `indexes_list[index]` point to a index of a. Store the result in b\n", " index += 1\n", "print() \n", "print(\"Final value of b:\",b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Note: We call an epoch each time that an algorithm passes over all the training examples. Shuffling the examples for each epoch is known to reduce variance, making the models more general and overfit less.**\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "### Exercise\n", "\n", "**Instructions:** Implement a data generator function that takes in `batch_size, x, y shuffle` where x could be a large list of samples, and y is a list of the tags associated with those samples. Return a subset of those inputs in a tuple of two arrays `(X,Y)`. Each is an array of dimension (`batch_size`). If `shuffle=True`, the data will be traversed in a random form.\n", "\n", "**Details:**\n", "\n", "This code as an outer loop \n", "```\n", "while True: \n", "... \n", "yield((X,Y)) \n", "```\n", "\n", "Which runs continuously in the fashion of generators, pausing when yielding the next values. We will generate a batch_size output on each pass of this loop. \n", "\n", "It has an inner loop that stores in temporal lists (X, Y) the data samples to be included in the next batch.\n", "\n", "There are three slightly out of the ordinary features. \n", "\n", "1. The first is the use of a list of a predefined size to store the data for each batch. Using a predefined size list reduces the computation time if the elements in the array are of a fixed size, like numbers. If the elements are of different sizes, it is better to use an empty array and append one element at a time during the loop.\n", "\n", "2. The second is tracking the current location in the incoming lists of samples. Generators variables hold their values between invocations, so we create an `index` variable, initialize to zero, and increment by one for each sample included in a batch. However, we do not use the `index` to access the positions of the list of sentences directly. Instead, we use it to select one index from a list of indexes. In this way, we can change the order in which we traverse our original list, keeping untouched our original list. \n", "\n", "3. The third also relates to wrapping. Because `batch_size` and the length of the input lists are not aligned, gathering a batch_size group of inputs may involve wrapping back to the beginning of the input loop. In our approach, it is just enough to reset the `index` to 0. We can re-shuffle the list of indexes to produce different batches each time." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def data_generator(batch_size, data_x, data_y, shuffle=True):\n", " '''\n", " Input: \n", " batch_size - integer describing the batch size\n", " data_x - list containing samples\n", " data_y - list containing labels\n", " shuffle - Shuffle the data order\n", " Output:\n", " a tuple containing 2 elements:\n", " X - list of dim (batch_size) of samples\n", " Y - list of dim (batch_size) of labels\n", " '''\n", " \n", " data_lng = len(data_x) # len(data_x) must be equal to len(data_y)\n", " index_list = [*range(data_lng)] # Create a list with the ordered indexes of sample data\n", " \n", " \n", " # If shuffle is set to true, we traverse the list in a random way\n", " if shuffle:\n", " random.shuffle(index_list) # Inplace shuffle of the list\n", " \n", " index = 0 # Start with the first element\n", " # START CODE HERE \n", " # Fill all the None values with code taking reference of what you learned so far\n", " while True:\n", " X = [0] * batch_size # We can create a list with batch_size elements. \n", " Y = [0] * batch_size # We can create a list with batch_size elements. \n", " \n", " for i in range(batch_size):\n", " \n", " # Wrap the index each time that we reach the end of the list\n", " if index >= data_lng:\n", " index = 0\n", " # Shuffle the index_list if shuffle is true\n", " if shuffle:\n", " rnd.shuffle(index_list) # re-shuffle the order\n", " \n", " X[i] = data_x[index_list[index]] # We set the corresponding element in x\n", " Y[i] = data_y[index_list[index]] # We set the corresponding element in x\n", " # END CODE HERE \n", " index += 1\n", " \n", " yield((X, Y))\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your function is correct, all the tests must pass. " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\u001b[92mAll tests passed!\n" ] } ], "source": [ "def test_data_generator():\n", " x = [1, 2, 3, 4]\n", " y = [xi ** 2 for xi in x]\n", " \n", " generator = data_generator(3, x, y, shuffle=False)\n", "\n", " assert np.allclose(next(generator), ([1, 2, 3], [1, 4, 9])), \"First batch does not match\"\n", " assert np.allclose(next(generator), ([4, 1, 2], [16, 1, 4])), \"Second batch does not match\"\n", " assert np.allclose(next(generator), ([3, 4, 1], [9, 16, 1])), \"Third batch does not match\"\n", " assert np.allclose(next(generator), ([2, 3, 4], [4, 9, 16])), \"Fourth batch does not match\"\n", "\n", " print(\"\\033[92mAll tests passed!\")\n", "\n", "test_data_generator()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you could not solve the exercise, just run the next code to see the answer." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "def data_generator(batch_size, data_x, data_y, shuffle=True):\n", "\n", " data_lng = len(data_x) # len(data_x) must be equal to len(data_y)\n", " index_list = [*range(data_lng)] # Create a list with the ordered indexes of sample data\n", " \n", " # If shuffle is set to true, we traverse the list in a random way\n", " if shuffle:\n", " rnd.shuffle(index_list) # Inplace shuffle of the list\n", " \n", " index = 0 # Start with the first element\n", " while True:\n", " X = [0] * batch_size # We can create a list with batch_size elements. \n", " Y = [0] * batch_size # We can create a list with batch_size elements. \n", " \n", " for i in range(batch_size):\n", " \n", " # Wrap the index each time that we reach the end of the list\n", " if index >= data_lng:\n", " index = 0\n", " # Shuffle the index_list if shuffle is true\n", " if shuffle:\n", " rnd.shuffle(index_list) # re-shuffle the order\n", " \n", " X[i] = data_x[index_list[index]] \n", " Y[i] = data_y[index_list[index]] \n", " \n", " index += 1\n", " \n", " yield((X, Y))\n" ] } ], "source": [ "import base64\n", "\n", "solution = \"ZGVmIGRhdGFfZ2VuZXJhdG9yKGJhdGNoX3NpemUsIGRhdGFfeCwgZGF0YV95LCBzaHVmZmxlPVRydWUpOgoKICAgIGRhdGFfbG5nID0gbGVuKGRhdGFfeCkgIyBsZW4oZGF0YV94KSBtdXN0IGJlIGVxdWFsIHRvIGxlbihkYXRhX3kpCiAgICBpbmRleF9saXN0ID0gWypyYW5nZShkYXRhX2xuZyldICMgQ3JlYXRlIGEgbGlzdCB3aXRoIHRoZSBvcmRlcmVkIGluZGV4ZXMgb2Ygc2FtcGxlIGRhdGEKICAgIAogICAgIyBJZiBzaHVmZmxlIGlzIHNldCB0byB0cnVlLCB3ZSB0cmF2ZXJzZSB0aGUgbGlzdCBpbiBhIHJhbmRvbSB3YXkKICAgIGlmIHNodWZmbGU6CiAgICAgICAgcm5kLnNodWZmbGUoaW5kZXhfbGlzdCkgIyBJbnBsYWNlIHNodWZmbGUgb2YgdGhlIGxpc3QKICAgIAogICAgaW5kZXggPSAwICMgU3RhcnQgd2l0aCB0aGUgZmlyc3QgZWxlbWVudAogICAgd2hpbGUgVHJ1ZToKICAgICAgICBYID0gWzBdICogYmF0Y2hfc2l6ZSAjIFdlIGNhbiBjcmVhdGUgYSBsaXN0IHdpdGggYmF0Y2hfc2l6ZSBlbGVtZW50cy4gCiAgICAgICAgWSA9IFswXSAqIGJhdGNoX3NpemUgIyBXZSBjYW4gY3JlYXRlIGEgbGlzdCB3aXRoIGJhdGNoX3NpemUgZWxlbWVudHMuIAogICAgICAgIAogICAgICAgIGZvciBpIGluIHJhbmdlKGJhdGNoX3NpemUpOgogICAgICAgICAgICAKICAgICAgICAgICAgIyBXcmFwIHRoZSBpbmRleCBlYWNoIHRpbWUgdGhhdCB3ZSByZWFjaCB0aGUgZW5kIG9mIHRoZSBsaXN0CiAgICAgICAgICAgIGlmIGluZGV4ID49IGRhdGFfbG5nOgogICAgICAgICAgICAgICAgaW5kZXggPSAwCiAgICAgICAgICAgICAgICAjIFNodWZmbGUgdGhlIGluZGV4X2xpc3QgaWYgc2h1ZmZsZSBpcyB0cnVlCiAgICAgICAgICAgICAgICBpZiBzaHVmZmxlOgogICAgICAgICAgICAgICAgICAgIHJuZC5zaHVmZmxlKGluZGV4X2xpc3QpICMgcmUtc2h1ZmZsZSB0aGUgb3JkZXIKICAgICAgICAgICAgCiAgICAgICAgICAgIFhbaV0gPSBkYXRhX3hbaW5kZXhfbGlzdFtpbmRleF1dIAogICAgICAgICAgICBZW2ldID0gZGF0YV95W2luZGV4X2xpc3RbaW5kZXhdXSAKICAgICAgICAgICAgCiAgICAgICAgICAgIGluZGV4ICs9IDEKICAgICAgICAKICAgICAgICB5aWVsZCgoWCwgWSkp\"\n", "\n", "# Print the solution to the given assignment\n", "print(base64.b64decode(solution).decode(\"utf-8\"))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Hope you enjoyed this tutorial on data generators which will help you with the assignments in this course." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 4 }