{ "cells": [ { "cell_type": "markdown", "metadata": { "toc": true }, "source": [ "

Table of Contents

\n", "
" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# code for loading the format for the notebook\n", "import os\n", "\n", "# path : store the current path to convert back to it later\n", "path = os.getcwd()\n", "os.chdir(os.path.join('..', '..', 'notebook_format'))\n", "\n", "from formats import load_style\n", "load_style(plot_style=False)" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [] }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Author: Ethen\n", "\n", "Last updated: 2023-11-14\n", "\n", "Python implementation: CPython\n", "Python version : 3.9.12\n", "IPython version : 8.2.0\n", "\n" ] } ], "source": [ "os.chdir(path)\n", "\n", "# magic to print version\n", "%load_ext watermark\n", "%watermark -a 'Ethen' -d -u -v" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Understanding Iterables, Iterators and Generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Iterables, Iterators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's begin by looking at a simple for loop." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n" ] } ], "source": [ "x = [1, 2, 3]\n", "for element in x:\n", " print(element)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It functions as expected. Our code prints the number 1, 2, 3 and then terminates. In this document, we'll dive deeper into what's happening behind the scenes. i.e. How does the loop construct fetch individual elements from object it's looping over and determine when to stop?\n", "\n", "The short answer to this question is Python's iterator protocol:\n", "\n", "> Objects that support __iter__ and __next__ dunder methods automatically work with for-in loops.\n", "\n", "Let's first introduce some terminologies.\n", "\n", "- An **iterator** is:\n", " - Object which defines a `__next__` method and will produce the next value when we call `next()` on it. If there are no more items, it raises `StopIteration` exception.\n", " - Object that is self-iterable (meaning that it has an __iter__ method that returns self).\n", "- An **iterable** is anything that can be looped over. It either:\n", " - Has an `__iter__` method which returns an iterator for that object when you call iter() on it, or implicitly in a for loop.\n", " - Defines a `__getitem__` method that can take sequential indexes starting from zero (and raises an IndexError when the indexes are no longer valid)\n", "\n", "Don't worry if it's a bit abstract at first glance. It will becomes much clearer as we work through a couple of examples. When Python sees a statement like `for obj in object` it will first call `iter(object)` to make it a iterator." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "\n" ] } ], "source": [ "# the built-in function iter takes an\n", "# iterable object and returns an iterator\n", "iterator = iter(x)\n", "\n", "# here x is the iterable\n", "print(type(x))\n", "print(type(iterator))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we can loop through all available elements using `next` built-in function." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\n", "2\n", "3\n" ] } ], "source": [ "# each time we call the next method on the iterator,\n", "# it will give us the next element\n", "print(next(iterator))\n", "print(next(iterator))\n", "print(next(iterator))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But notice what happens if we call `next` on the iterator again." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "StopIteration raised\n" ] } ], "source": [ "# if there are no more elements in the iterator,\n", "# it raises a StopIteration exception\n", "try:\n", " print(next(iterator))\n", "except StopIteration as e:\n", " print('StopIteration raised')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It raises a StopIteration exception to signal we've exhausted all available values in the iterator. Based on this experiment, we now know iterators use exceptions to structure control flow. To signal the end of iteration, a Python iterator raises the built-in StopIteration exception. To sum it up, when we write:\n", "\n", "```python\n", "x = [1, 2, 3]\n", "for element in x:\n", " ...\n", "\n", "```\n", "\n", "This is what's actually happening under the hood:\n", "\n", "\n", "\n", "Now, let's dive deeper. We'll implement a class that enables us to prints a value for some maximum number of times and use it in a `for obj in object` loop.\n", "\n", "- When we invoke `iter` function on our object, it effectively translates to calling `.__iter__()` dunder method which returns the iterator object.\n", "- Subsequent loop will iteratively call the iterator object's `__next__` method to fetch values from it.\n", "- For practical reasons, iterable class can implement both `__iter__()` and `__next__()` in the same class, and have `__iter__()` return self. This makes the class both an iterable and its own iterator. However, it is perfectly valid to return a different object as the iterator." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "class Repeat:\n", " def __init__(self, value, max_repeats):\n", " self.value = value\n", " self.max_repeats = max_repeats\n", " self._count = 0\n", "\n", " def __iter__(self):\n", " # simply return the iterator object, since\n", " # all that matters is that __iter__ returns a\n", " # object with a __next__ method, which we\n", " # will implement below\n", " return self\n", "\n", " def __next__(self):\n", " # implement the stopping criterion\n", " if self._count >= self.max_repeats:\n", " raise StopIteration\n", " \n", " self._count += 1\n", " \n", " # simply returns the same value after iteration\n", " return self.value" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello\n", "Hello\n", "Hello\n" ] } ], "source": [ "repeater = Repeat(value='Hello', max_repeats=3)\n", "for item in repeater:\n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This implementation gives us the desired result. Iteration stops after max_repeats is met. This mental model will seem familiar to readers that have worked with database cursors: We first initialize the cursor and prepare it for reading, then we can fetch our data into local variables as needed from it, one element at a time. Because there will never be more than one element in memory, this approach is highly memory-efficient.\n", "\n", "Note that we can also take this class and implement it using the `iter` and `next` function way." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello\n", "Hello\n", "Hello\n" ] } ], "source": [ "repeater = Repeat(value='Hello', max_repeats=3)\n", "iterator = iter(repeater)\n", "while True:\n", " try:\n", " item = next(iterator)\n", " except StopIteration:\n", " break\n", " \n", " print(item)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But as we can see being able to write a three-line for-in loop instead of an eight lines long while loop is quite a nice improvement. It makes the code easier to read and more maintainable. And this is the reason why iterators in Python are such a powerful tool." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Generators" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A generator is an object that lazily produces values, i.e. generates values on demand. It can be either:\n", "\n", "1. A function that incorporates the `yield` keyword (yield expression).\n", " - When called, it does not execute immediately, but returns a generator object.\n", " - Upon involing `next()` method on the function, it starts the actual execution. Once this function encounters the `yield` keyword, it pauses execution at that point, saves its context and returns its value to the caller\n", " - Subsequent calls to `next()` resume execution until another yield is encountered or end of function is reached.\n", "2. A generator expression, which is syntactic construct for creating an anonymous generator object. These are like list comprehensions but enclosed in `()` instead of `[]`." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "4\n", "0\n", "1\n", "4\n" ] } ], "source": [ "# generator expression:\n", "# note that unlike list comprehension;\n", "# the elements are lazily evaluated, i.e. they\n", "# take up less memory since they are not created\n", "# all at once, but instead return one element at\n", "# a time whenever needed\n", "nloop = 3\n", "\n", "# The generator object is now created, ready to\n", "# be iterated over\n", "generator = (x ** 2 for x in range(nloop))\n", "\n", "# the real processing happens during the iteration\n", "for value in generator:\n", " print(value)\n", " \n", "\n", "# generator function:\n", "def gen(nloop):\n", " for x in range(nloop):\n", " yield x ** 2\n", "\n", "generator = gen(nloop)\n", "for value in generator:\n", " print(value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One important thing to note about generator is that it only produces the result a single time. In other words, once we're done iterating through our generator for the first time, we won't get any results the second time around when we iterate over an already-exhausted generator." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "# no results\n", "for value in generator:\n", " print(value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is something that's extremely important to keep in mind when working with generators. If we wish to iterate over its content for more than once, we can always make a copy by converting it to a list." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "4\n", "0\n", "1\n", "4\n" ] } ], "source": [ "# we can now loop through it for more than once\n", "iterable = list(gen(nloop))\n", "for value in iterable:\n", " print(value)\n", " \n", "for value in iterable:\n", " print(value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This approach solves the problem, but to understand why this is not always ideal. We need to step back and understand the rationale behind using generators.\n", "\n", "The real advantage or true power of using generator is it gives us the ability to iterate over sequence lazily, which in turn reduces memory usage. For example, imagine a simulator producing gigabytes of data per second. Clearly we can't put everything neatly into a Python list first and then start munching, since this copy could cause our program to run out of memory and crash. Ideally, we must process the information as it comes in. The recommended way to deal with this, is to have a class that implements the `__iter__` dunder method." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "class Gen:\n", " \n", " def __init__(self, nloop):\n", " self.nloop = nloop\n", "\n", " def __iter__(self):\n", " # the method will create a iterator object\n", " # every time it is looped over, or technically\n", " # every time this __iter__ method is called, such\n", " # as when the object hits a for loop\n", " for x in range(self.nloop):\n", " yield x ** 2 " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "1\n", "4\n", "0\n", "1\n", "4\n" ] } ], "source": [ "iterable = Gen(nloop = nloop)\n", "for value in iterable:\n", " print(value)\n", " \n", "for value in iterable:\n", " print(value)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This type of streaming approach is used a lot in the `Gensim` library. e.g. with helper class such as [LineSetence](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.LineSentence) we can feed chunks/batches of documents into the memory to train the model instead of having to load the entire corpus into memory.\n", "\n", "To wrap up, the following diagram summarizes the relationship between iterable, iterator and generator.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Reference" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- [[1]](http://www.effectivepython.com/2015/01/03/be-defensive-when-iterating-over-arguments/) Blog: Be Defensive When Iterating Over Arguments\n", "- [[2]](https://dbader.org/blog/python-iterators) Blog: Python Iterators: A Step-By-Step Introduction\n", "- [[3]](https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/) Blog: Data streaming in Python: generators, iterators, iterables\n", "- [[4]](http://nbviewer.jupyter.org/github/akittas/presentations/blob/master/pythess/func_py/func_py.ipynb) Notebook: Using functional programming in Python like a boss: Generators, Iterators and Decorators" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12" }, "toc": { "nav_menu": { "height": "84px", "width": "252px" }, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": true, "toc_position": {}, "toc_section_display": "block", "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }