{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Revisit nested generators from\n", "[2013-06-07 道場 Scribbles](https://mail.python.org/pipermail/centraloh/2013-June/001718.html)\n", "which was used to solve\n", "[Problem #2](https://projecteuler.net/problem=2)\n", "of [Project Euler](https://en.wikipedia.org/wiki/Project_Euler).\n", "\n", "[XY](http://www.meetup.com/Central-Ohio-Python-Users-Group/members/50349262/)\n", "mentioned that I could break up the complicated ugly even_fibonacci() generators\n", "into three simple generators." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def even_fibonacci(maximum):\n", " a, b = 0, 1\n", " while True:\n", " if a > maximum:\n", " break\n", " if a % 2 == 0:\n", " yield a\n", " a, b = b, a + b" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "maximum = 4000000 # per Problem #2 of Project Euler" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "even_fibonacci(maximum)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(even_fibonacci(maximum))" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(even_fibonacci(maximum))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "even_fibonacci() works. It is correct. It is fast. But it is complicated, hard to read, and ugly. So it is broken up into three little generator functions below." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def fibonacci():\n", " a, b = 0, 1\n", " while True:\n", " yield a\n", " a, b = b, a + b\n", "\n", "def even(numbers):\n", " for i in numbers:\n", " if i % 2 == 0:\n", " yield i\n", "\n", "def lte(numbers, maximum):\n", " for i in numbers:\n", " if i > maximum:\n", " break\n", " yield i" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lte(even(fibonacci()), maximum)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(lte(even(fibonacci()), maximum))" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(lte(even(fibonacci()), maximum))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The individual generators are simple and easy to understand.\n", "It is easy to create many simple generators.\n", "Each little generator does just one thing (and does it well).\n", "By combining these little generators,\n", "one can do powerful things.\n", "The magic is in how one chooses to combine the little generators.\n", "\n", "# This is the [UNIX Philosophy](https://en.wikipedia.org/wiki/UNIX_philosophy)!!!\n", "\n", "However the Python syntax for chaining (nesting) generators gets ugly.\n", "One normally reads from left to right, but in\n", "\n", " sum(lte(even(fibonacci()), maximum))\n", " \n", "One starts in the middle:\n", "\n", " fibonacci()\n", " \n", "reads left, adding the even() generator:\n", "\n", " even(fibonacci())\n", " \n", "then the lte() generator:\n", "\n", " lte(even(fibonacci()), maximum)\n", " \n", "then sum():\n", "\n", " sum(lte(even(fibonacci()), maximum))\n", " \n", "Even worse, is that one has to think about which generator the *maximum* argument is for.\n", "\n", "So although chaining generators in Python can be amazingly powerful,\n", "the syntax can be hard to read." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "UNIX solved this over forty years ago with\n", "[pipes](https://en.wikipedia.org/wiki/Unix_pipe).\n", "In UNIX, an obvious solution would look like:\n", "\n", " fibonacci | even | lte $maximum | sum\n", " \n", "It reads from left to right, and the argument for lte is right after it.\n", "This is easy to read.\n", "So I implemented the pipe for generators in Python.\n", "Passing arguments requires parentheses in Python,\n", "so the Python syntax is:\n", "\n", " fibonacci() | even() | lte(maximum) | fsum()\n", "\n", "Below is my code to implement that.\n", "It is very green and naïve.\n", "There is plenty that it can not do,\n", "but this crude naïve code is a start." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "class Filter():\n", " def __init__(self, g, *args, **kwargs):\n", " # print('Filter: g %r, args %r, kwargs %r' % (g, args, kwargs))\n", " # print(' self %r' % (self))\n", " self.g = g\n", " self.args = args\n", " self.kwargs = kwargs\n", " # self.gen = g(*args, **kwargs)\n", " \n", " def not__or__(self, other):\n", " print('__or__ self %s, other %s' % (self.g, other.g))\n", " return other.g(self.g(*self.args, **self.kwargs), *other.args, **other.kwargs) \n", " \n", " def __ror__(self, left):\n", " # print('__ror__ self %s, left %s' % (self.g, left))\n", " # print('__ror__ left %s, right (self) %s' % (left, self.g))\n", " g = self.g(left, *self.args, **self.kwargs)\n", " # print(' -> %r' % g)\n", " return g" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def piperize(f):\n", " def g(*args, **kwargs):\n", " return Filter(f, *args, **kwargs)\n", " return g" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def fibonacci():\n", " a, b = 0, 1\n", " while True:\n", " yield a\n", " a, b = b, a + b\n", "\n", "@piperize\n", "def even(numbers):\n", " for i in numbers:\n", " if i % 2 == 0:\n", " yield i\n", "\n", "@piperize\n", "def lte(numbers, maximum):\n", " for i in numbers:\n", " if i > maximum:\n", " break\n", " yield i" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fibonacci() | even() | lte(maximum)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "[0, 2, 8, 34, 144, 610, 2584, 10946, 46368, 196418, 832040, 3524578]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(fibonacci() | even() | lte(maximum))" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sum(fibonacci() | even() | lte(maximum))" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "@piperize\n", "def fsum(iterable):\n", " return sum(iterable) # I am surprised I got away with a return statement." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false, "scrolled": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fibonacci() | even() | lte(maximum) | fsum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the readability of the above cell with the original invocation repeated below:\n", "\n", " sum(lte(even(fibonacci()), maximum))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "Let's play a little bit more with sum()." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": true }, "outputs": [], "source": [ "F = Filter" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fibonacci() | even() | lte(maximum) | F(sum)" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": true }, "outputs": [], "source": [ "sum = F(sum)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "4613732" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fibonacci() | even() | lte(maximum) | sum" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hmmm. That's curious. Are no parentheses a good thing or a bad thing?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "\n", "Now to revisit a previous month's word counting challenge.\n", "In UNIX shell, it can be done in one long line as follows.\n", "\n", " cat pg84.txt | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\\n' | sort | uniq -c | sort -n -r | head -n 5\n", " \n", "Now let's implement that in python. First we'll do it with regular nested generators." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import islice\n", "\n", "def tolower(text):\n", " for s in text:\n", " yield s.lower()\n", "\n", "def chop_into_words(text):\n", " for s in text:\n", " for word in s.split():\n", " yield word.strip()\n", "\n", "def sort(words, key=None, reverse=False):\n", " yield from sorted(words, key=key, reverse=reverse)\n", "\n", "def uniq_count(words):\n", " counts = {}\n", " old_word = None\n", " for word in words:\n", " if word != old_word:\n", " counts[word] = 0\n", " old_word = word\n", " counts[word] += 1\n", " for word, count in counts.items():\n", " yield count, word\n", "\n", "def head(iterable, n=10):\n", " yield from islice(iterable, n)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(4327, 'the'), (3004, 'and'), (2754, 'of'), (2720, 'i'), (2160, 'to')]" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_filename = 'pg84.txt'\n", "\n", "list(head(sort(\n", " uniq_count(sort(chop_into_words(tolower(open(text_filename))))),\n", " key=lambda x: x[0],\n", " reverse=True\n", "), 5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Yuch!** It works correctly, but it sure is hard to read. It is ugly!!!\n", "\n", "Let's try it with the pipe syntax." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from itertools import islice\n", "\n", "@piperize\n", "def tolower(text):\n", " for s in text:\n", " yield s.lower()\n", "\n", "@piperize\n", "def chop_into_words(text):\n", " for s in text:\n", " for word in s.split():\n", " yield word.strip()\n", "\n", "@piperize\n", "def sort(words, key=None, reverse=False):\n", " yield from sorted(words, key=key, reverse=reverse)\n", "\n", "@piperize\n", "def uniq_count(words):\n", " counts = {}\n", " old_word = None\n", " for word in words:\n", " if word != old_word:\n", " counts[word] = 0\n", " old_word = word\n", " counts[word] += 1\n", " for word, count in counts.items():\n", " yield count, word\n", "\n", "@piperize\n", "def head(iterable, n=10):\n", " yield from islice(iterable, n)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[(4327, 'the'), (3004, 'and'), (2754, 'of'), (2720, 'i'), (2160, 'to')]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "text_filename = 'pg84.txt'\n", "\n", "list(\n", " open(text_filename) |\n", " tolower() |\n", " chop_into_words() |\n", " sort() |\n", " uniq_count() |\n", " sort(key=lambda x: x[0], reverse=True) |\n", " head(5)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Compare the readability of the pipe syntax with that of the nested parentheses." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "The last four filters can be replaced with a Counter object and its most_common() method." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from collections import Counter\n", "\n", "@piperize\n", "def count_words(words, n=10):\n", " yield from Counter(words).most_common(n)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('the', 4327), ('and', 3004), ('of', 2754), ('i', 2720), ('to', 2160)]" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(\n", " open(text_filename) |\n", " tolower() |\n", " chop_into_words() |\n", " count_words(5)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That's short enough to fit on one line, so:" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "[('the', 4327), ('and', 3004), ('of', 2754), ('i', 2720), ('to', 2160)]" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(open(text_filename) | tolower() | chop_into_words() | count_words(5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Thoughts\n", "\n", "Using syntax similar to UNIX pipe is so obvious to me\n", "that I wonder why I have not seen other people do this.\n", "Why have other people not already done this?\n", "Have other people already done this and I just don't know about it?\n", "Have other people not done this because it is a bad idea?\n", "What am I missing?\n", "\n", "Could generalize the even() generator by passing arbitrary modulus and remainder,\n", "or even more broadly by passing a function (which could be a lambda function).\n", "That reinvents filter() with the arguments swapped.\n", "\n", "Could generalize lte() to be generic quitter generator\n", "by passing a (lambda) function.\n", "That reinvents itertools.takewhile() with the arguments swapped." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# TODO\n", "\n", "- Need to be able to mix with standard generators like islice,\n", " without needing to create wrapper functions\n", " like I did above with head().\n", " Perhaps create function or class for in-line wrapping.\n", " \n", "- Consider using the extended generator function protocol: send versus next\n", " (and not pass iterable as first argument).\n", " \n", "- How about implementing something for\n", " [Hartmann pipelines](https://en.wikipedia.org/wiki/Hartmann_pipelines)\n", " (later)? (pull not push?)\n", " \n", "- Pay attention to iter() and functools.partial().\n", "\n", "- Ponder whether parentheses are good or bad." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Scraps (ignore stuff below)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\r\n", "1\r\n", "1\r\n", "2\r\n", "3\r\n", "5\r\n", "8\r\n", "13\r\n", "21\r\n", "34\r\n" ] } ], "source": [ "!fibonacci | head" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\r\n", "2\r\n", "3\r\n", "4\r\n", "5\r\n", "6\r\n", "7\r\n", "8\r\n", "9\r\n", "10\r\n" ] } ], "source": [ "!seq 10" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2\r\n", "4\r\n", "6\r\n", "8\r\n", "10\r\n" ] } ], "source": [ "!seq 10 | even" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1\r\n", "2\r\n", "3\r\n", "4\r\n" ] } ], "source": [ "!seq 10 | lte 4" ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "10\r\n" ] } ], "source": [ "!seq 10 | lte 4 | sum" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4613732\r\n" ] } ], "source": [ "!fibonacci | even | lte 4000000 | sum" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.4.3" } }, "nbformat": 4, "nbformat_minor": 0 }