{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Building a Unicode `.format` Nanny with Python's AST\n", "\n", "*Detecting problematic `.format` calls quickly and painlessly using Python's powerful `ast` module. Follow along by [downloading the Jupyter notebook of this post!](https://raw.githubusercontent.com/drocco007/blog/master/ast_format/ast_format.ipynb)* \n", "\n", "The issue: where I work, we are still a Python 2 shop owing to technical debt and legacy dependencies. Since Python 2's default string type is a byte sequence, painful encoding-related problems surface in our applications from time to time. A fairly typical example runs something like this: a developer writes some code to format data into a string, using the convenient and powerful `.format` method supported by all string objects:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def pretty_format(some_data):\n", " return 'Hello, {}!'.format(some_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Through the course of our exhaustive testing, we prove that this function is correct over a wide range of inputs:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, world!\n" ] } ], "source": [ "print pretty_format('world')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The code ships. Months pass without incident, our `pretty_format` routine prettily formatting every bit of data thrown its way. Lulled into complacency through enjoyment of our apparent success, we move on to other tasks. One day, everything comes to a screeching halt as, completely unprepared, we receive one of the most dreaded error messages in all of software development:\n", "\n", " UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 8: ordinal not in range(128)\n", "\n", "What happened? Why did code that worked flawlessly for months suddenly detonate, without warning, without mercy?\n", "\n", "Much of the data that flows through this format template and others like it is simple ASCII-valued information: dates, addresses, phone numbers, and the like. Having used Python 2 for many years, we are habituated to spell strings, including our template formatting strings, using the simple single quote\n", "\n", "```python\n", "'a typical string'\n", "```\n", "\n", "What happens, though, when our user data contains an accented character or other non-ASCII symbol?" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "ename": "UnicodeEncodeError", "evalue": "'ascii' codec can't encode character u'\\xc9' in position 8: ordinal not in range(128)", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mUnicodeEncodeError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m\u001b[0m in \u001b[0;36m\u001b[1;34m()\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mfull_name\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;34mu'Ariadne Éowyn'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[1;32mprint\u001b[0m \u001b[0mpretty_format\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfull_name\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;32m\u001b[0m in \u001b[0;36mpretty_format\u001b[1;34m(some_data)\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mpretty_format\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msome_data\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 2\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[1;34m'Hello, {}!'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mformat\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0msome_data\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[1;31mUnicodeEncodeError\u001b[0m: 'ascii' codec can't encode character u'\\xc9' in position 8: ordinal not in range(128)" ] } ], "source": [ "full_name = u'Ariadne Éowyn'\n", "print pretty_format(full_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Boom! Python detects the mismatch between the byte string template and the Unicode data containing some multi-byte characters that simply cannot be represented directly in the target format, ASCII. In other words, Python is refusing to guess what we want: do we prefer a binary expansion, and, if so, in what encoding? Should the accent characters simply be dropped? Do we want unexpected symbols to be translated into ASCII error characters? Python has no way of knowing which of these options is appropriate to the present situation, so it takes the only reasonable course and raises an exception.\n", "\n", "Many encoding issues can be quite challenging to reconcile, but this case is rather simple: if the format string is specified in Unicode format — rather than as a plain byte string — this entire class of problem would be avoided:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello, Ariadne Éowyn!\n" ] } ], "source": [ "def pretty_format(some_data):\n", " # Unicode template prepares this routine to handle non-ASCII symbols\n", " return u'Hello, {}!'.format(some_data)\n", "\n", "print pretty_format(full_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But how do we know where the problematic calls to `.format` are lurking in our code base without waiting for the next error to occur? Is there a way we could find these calls proactively, eliminating them from the system before they wreak havoc on our application?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## A brief diversion, or Now you have two problems\n", "\n", "When I first mentioned the idea of automatically detecting problematic `.format` calls to a coworker, he immediately quipped back “have fun writing *that* regex!” Before we look at a more powerful and reliable alternative, let's take a moment to examine his intuition and understand why we might not want to use regular expressions for this task.\n", "\n", "At first glance, regular expressions seem reasonably well suited to this job: what we're looking for are substrings of a given string — that is, parts of a Python source file — containing a certain pattern. Indeed it is fairly easy to construct a first cut at a regular expression for this. Here is mine, including a bunch of explanatory comments:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import re\n", "\n", "# a first draft regular expression for detecting \n", "# problematic .format calls\n", "pattern = re.compile(r'''\n", " (?x) # turn on verbose regex mode\n", " \n", " (?, 'attr': 'format', 'value': <_ast.Str object at 0x7ff15bffe910>, 'lineno': 3}\n", "{'s': 'Hello, {}!', 'lineno': 3, 'col_offset': 0}\n", "\n", "{'col_offset': 5, 'ctx': <_ast.Load object at 0x7ff178b71a10>, 'attr': 'pi', 'value': <_ast.Name object at 0x7ff15bffec10>, 'lineno': 5}\n", "{'ctx': <_ast.Load object at 0x7ff178b71a10>, 'id': 'math', 'col_offset': 5, 'lineno': 5}\n", "\n" ] } ], "source": [ "class AttributeInspector(ast.NodeVisitor):\n", " def visit_Attribute(self, node):\n", " print node.__dict__\n", " print node.value.__dict__\n", " print\n", "\n", "AttributeInspector().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This little class simply prints the instance dictionaries for each `Attribute` node in the tree along with that of its `value` attribute. I often use this trick (`print obj.__dict__`) when exploring a problem to help me understand the attributes and capabilities of unknown objects. For this particular problem, our inspector has revealed that each node type has a different set of attributes defined on it and given us clues to help us advance to a solution.\n", "\n", "As expected, there are two `Attribute` nodes in our little example program, corresponding to the `.format` and `.pi` references in the source. We saw earlier that we are specifically looking for nodes with an `attr` of `'format'`, and our little test class revealed another bit of useful data: the line number on which the attribute access appears in the source, which will be useful information to provide back to the user.\n", "\n", "Our tiny Visitor implementation has already solved the first part of our problem: finding `Attribute` nodes in the source. All that remains is to filter down to `.format` nodes and check the value attribute of each. Here is the rest of the implementation:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "collapsed": true }, "outputs": [], "source": [ "class FormatVisitor(ast.NodeVisitor):\n", " def visit_Attribute(self, node):\n", " if node.attr == 'format':\n", " _str = repr(node.value.s)\n", "\n", " if _str[0] != 'u':\n", " print u'{}: {}'.format(node.lineno, _str)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With just a handful of lines of Python, we have created a program that meets the goals we set out above. This solution harnesses the full power of Python's exposed parser machinery, which allows our code to express a very high level solution with a minimum of syntactic overhead. Run on our example above, it points out that line 3 of the source contains a problematic spelling:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3: 'Hello, {}!'\n" ] } ], "source": [ "FormatVisitor().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "while source with an acceptable spelling yields no warnings:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [], "source": [ "tree = ast.parse(\"\"\"u'Hello, {}!'.format('world')\"\"\")\n", "FormatVisitor().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here is a complex example that demonstrates several of our tricky cases from earlier, including a split string, continuation line, and even a parenthesized expression!" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "3: 'Hello, {}!'\n" ] } ], "source": [ "source = \"\"\"\n", "\n", "('Hello, '\n", " '{}!') \\\n", ".format('world')\n", "\n", "\"\"\"\n", "\n", "tree = ast.parse(source)\n", "FormatVisitor().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The same source but with corrected strings eliminates the warning:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "collapsed": false }, "outputs": [], "source": [ "source = source.replace(\"('\", \"(u'\")\n", "tree = ast.parse(source)\n", "FormatVisitor().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Even the `__future__` import case is handled correctly with no extra effort on our part!" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": true }, "outputs": [], "source": [ "tree = ast.parse(\"\"\"\n", "\n", "from __future__ import unicode_literals\n", "'Hello, {}!'.format('world')\n", "\n", "\"\"\")\n", "FormatVisitor().visit(tree)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Looking back, looking ahead\n", "\n", "Thanks for joining me in this tour of a lesser-used corner of the Python standard library. In my shop, we now use a Git commit hook based on these ideas to help detect these sorts of encoding problems before they even make it into the source tree. Figuring out how to build the hook was a fun exploration with a pragmatic result, and I hope that you have enjoyed walking through it with me and that you learned something along the way!\n", "\n", "If you'd like to learn more, here are a few resources to continue your exploration:\n", "\n", "* [Green Tree Snakes](https://greentreesnakes.readthedocs.org/en/latest/) is Thomas Kluyver's introduction to the AST, billed as “the missing Python AST docs”\n", "* The [`ast` module documentation](https://docs.python.org/2/library/ast.html) contains a grammar specification and module reference\n", "* Andreas Dewes presented [a static analysis tool](https://us.pycon.org/2015/schedule/presentation/341/) at PyCon 2015 based on the AST\n", "* [Hy](http://docs.hylang.org/en/latest/) implements a Lisp interpreter in Python and makes extensive use of the AST" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.8" } }, "nbformat": 4, "nbformat_minor": 0 }