{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "_Sound_ program analyses present conservative overapproximations of program behavior. Consider a _may-alias analysis_, which determines if two reference variables may refer to the same location in memory. Precise may-alias analysis is undecidable, but certain kinds of imprecision are acceptable. Often we're interested in sound analyses to support verification or semantics-preserving program transformations, so false positives are acceptable but false negatives are not. Put another way, the worst that can come of spuriously identifying a pair of variables as potentially-aliasing is that we'd miss an opportunity to optimize our program; the worst that can come of not identifying a pair of potentially-aliasing variables as such is a program tranformation that introduces a behavior change.\n", "\n", "By contrast, _unsound_ analyses are imprecise but not conservative: both false positives and false negatives are possible. These analyses are still useful for program understanding (e.g., in linters or static bug detectors) even if they are not sufficient to support program transformations. In this post (which is also available as an interactive notebook), we'll develop an unsound analysis for Python. The goal of the analysis we'll develop is to identify which modules a function depends on; this post will show one way to build up such a technique. \n", "\n", "\n", "# Background\n", "\n", "Consider the following problem: you'd like to enable users to automatically extract a function from a Jupyter notebook and publish it as a service. Actually serializing a closure from a function in a given environment is not difficult, given the [`cloudpickle`](https://github.com/cloudpipe/cloudpickle) module. But merely constructing a serialized closure isn't enough, since in general this function may require other modules to be available to run in another context. \n", "\n", "Therefore, we need some way to identify the modules required by a function (and, ideally, the packages that provide these modules). Since engineering time is limited, it's probably better to have an optimistic-and-imperfect estimate and allow users to override it (by supplying additional dependencies when they publish a function) than it is to have a sound and conservative module list.\n", "\n", "The [`modulefinder`](https://docs.python.org/3.6/library/modulefinder.html) module in the Python standard library might initially seem like an attractive option, but it is unsuitable because it operates at the level of scripts. In order to use `modulefinder` on a single function from a notebook, we'd either have an imprecise module list (due to running the whole notebook) or we'd need to essentially duplicate a lot of its effort in order to [slice backwards](https://en.wikipedia.org/wiki/Program_slicing) from the function invocation so we could extract a suitably pruned script.\n", "\n", "Fortunately, you can interrogate nearly any aspect of any object in a Python program, and functions are no exception. If we could inspect the captured variables in a closure, we could identify the ones that are functions and figure out which modules they were declared in. That would look something like this:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import inspect\n", "\n", "def module_frontier(f):\n", " worklist = [f]\n", " seen = set()\n", " mods = set()\n", " for fn in worklist:\n", " cvs = inspect.getclosurevars(fn)\n", " gvars = cvs.globals\n", " for k, v in gvars.items():\n", " if inspect.ismodule(v):\n", " mods.add(v.__name__)\n", " elif inspect.isfunction(v) and id(v) not in seen:\n", " seen.add(id(v))\n", " mods.add(v.__module__)\n", " worklist.append(v)\n", " elif hasattr(v, \"__module__\"):\n", " mods.add(v.__module__)\n", " return list(mods)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The [`inspect`](https://docs.python.org/3/library/inspect.html#inspect.Signature.bind) module provides a friendly interface to inspecting object metadata. In the above function, we're constructing a worklist of all of the captured variables in a given function's closure. We're then constructing a set of all of the modules directly or transitively referred to by those captured variables, whether these are modules referred to directly, modules declaring functions referred to by captured variables, or modules declaring other values referred to by captured variables (e.g., native functions). Note that we add any functions we find to the worklist (although we don't handle `eval` or other techniques), so we'll capture at least some of the transitive call graph in this case.\n", "\n", "This approach seems to work pretty sensibly on simple examples:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import numpy as np\n", "from numpy import dot\n", "\n", "def f(a, b):\n", " return np.dot(a, b)\n", "\n", "def g(a, b):\n", " return dot(a, b)\n", "\n", "def h(a, b):\n", " return f(a, b)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'f': ['numpy.core.multiarray', 'numpy'],\n", " 'g': ['numpy.core.multiarray'],\n", " 'h': ['numpy.core.multiarray', 'numpy', '__main__']}" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "{k.__name__ : module_frontier(k) for k in [f,g,h]}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It also works on itself, which is a relief:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['inspect']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "module_frontier(module_frontier)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Problem cases\n", "\n", "While these initial experiments are promising, we shouldn't expect that a simple approach will cover everything we might want to do. Let's look at a (slightly) more involved example to see if it breaks down.\n", "\n", "We'll use the k-means clustering implementation from scikit-learn to optimize some cluster centers in a model object. We'll then capture that model object in a closure and analyze it to see what we might need to import to run it elsewhere." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "import numpy as np\n", "\n", "data = np.random.rand(1000, 2)\n", "\n", "model = KMeans(random_state=0).fit(data)\n", "\n", "def km_predict_one(sample):\n", " sample = np.array(sample).reshape(1,-1)\n", " return model.predict(sample)[0]" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "km_predict_one([0.5, 0.5])" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['sklearn.cluster.k_means_', 'numpy']" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "module_frontier(km_predict_one)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## List comprehensions\n", "\n", "So far, so good. Let's say we want to publish this simple model as a lighter-weight service (without a scikit-learn dependency). We can get that by reimplementing the `predict` method from the k-means model:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": true }, "outputs": [], "source": [ "centers = model.cluster_centers_\n", "from numpy.linalg import norm\n", "\n", "def km_predict_two(sample):\n", " _, idx = min([(norm(sample - center), idx) for idx, center in enumerate(centers)])\n", " return idx\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "km_predict_two([0.5, 0.5])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What do we get if we analyze the second method?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "module_frontier(km_predict_two)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "_This is a problem!_ We'd expect that `norm` would be a captured variable in the body of `km_predict_two` (and thus that `numpy.linalg` would be listed in its module frontier), but that isn't the case. We can inspect the closure variables:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ClosureVars(nonlocals={}, globals={'centers': array([[ 0.15441674, 0.15065163],\n", " [ 0.47375581, 0.78146907],\n", " [ 0.83512659, 0.19018115],\n", " [ 0.16262154, 0.86710792],\n", " [ 0.83007508, 0.83832402],\n", " [ 0.16133578, 0.49974156],\n", " [ 0.49490377, 0.22475294],\n", " [ 0.75499895, 0.51576093]])}, builtins={'min': , 'enumerate': }, unbound=set())" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inspect.getclosurevars(km_predict_two)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the cluster centers as well as the `min` function and the `enumerate` type. But `norm` isn't in the list. Let's dive deeper. We can use the [`dis`](https://docs.python.org/3/library/dis.html) module (and some functionality that was introduced in Python 3.4) to inspect the Python bytecode for a given function:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: LOAD_GLOBAL(min)\n", "2: LOAD_CLOSURE(sample)\n", "4: BUILD_TUPLE()\n", "6: LOAD_CONST( at 0x116dffd20, file \"\", line 5>)\n", "8: LOAD_CONST('km_predict_two..')\n", "10: MAKE_FUNCTION()\n", "12: LOAD_GLOBAL(enumerate)\n", "14: LOAD_GLOBAL(centers)\n", "16: CALL_FUNCTION()\n", "18: GET_ITER()\n", "20: CALL_FUNCTION()\n", "22: CALL_FUNCTION()\n", "24: UNPACK_SEQUENCE()\n", "26: STORE_FAST(_)\n", "28: STORE_FAST(idx)\n", "30: LOAD_FAST(idx)\n", "32: RETURN_VALUE()\n" ] } ], "source": [ "from dis import Bytecode\n", "for inst in Bytecode(km_predict_two):\n", " print(\"%d: %s(%s)\" % (inst.offset, inst.opname, inst.argrepr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ah ha! The body of our list comprehension, which contains the call to `norm`, is a separate code object that has been stored in a constant. Let's look at the constants for our function:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "(None,\n", " at 0x116dffd20, file \"\", line 5>,\n", " 'km_predict_two..')" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "km_predict_two.__code__.co_consts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the code object in the constant list and use `dis` to disassemble it as well:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: BUILD_LIST()\n", "2: LOAD_FAST(.0)\n", "4: FOR_ITER(to 30)\n", "6: UNPACK_SEQUENCE()\n", "8: STORE_FAST(idx)\n", "10: STORE_FAST(center)\n", "12: LOAD_GLOBAL(norm)\n", "14: LOAD_DEREF(sample)\n", "16: LOAD_FAST(center)\n", "18: BINARY_SUBTRACT()\n", "20: CALL_FUNCTION()\n", "22: LOAD_FAST(idx)\n", "24: BUILD_TUPLE()\n", "26: LIST_APPEND()\n", "28: JUMP_ABSOLUTE()\n", "30: RETURN_VALUE()\n" ] } ], "source": [ "for inst in Bytecode(km_predict_two.__code__.co_consts[1]):\n", " print(\"%d: %s(%s)\" % (inst.offset, inst.opname, inst.argrepr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we've done so, we can see that the list comprehension has loaded `norm` from a global, which we can then resolve and inspect:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "km_predict_two.__globals__[\"norm\"]" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'numpy.linalg.linalg'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "_.__module__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nested functions and lambda expressions\n", "\n", "We can see a similar problem if we look at a function with local definitions (note that there is no need for the nesting in this example other than to expose a limitation of our technique):" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import sys\n", "\n", "def km_predict_three(sample):\n", " # unnecessary nested function\n", " def find_best(sample):\n", " (n, i) = (sys.float_info.max, -1)\n", " for idx, center in enumerate(centers):\n", " (n, i) = min((n, i), (norm(sample - center), idx))\n", " return i\n", " return find_best(sample)\n", "\n", "km_predict_three([0.5, 0.5])" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "module_frontier(km_predict_three)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case, we can see that Python compiles these nested functions in essentially the same way it compiles the bodies of list comprehensions:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: LOAD_CONST(\", line 5>)\n", "2: LOAD_CONST('km_predict_three..find_best')\n", "4: MAKE_FUNCTION()\n", "6: STORE_FAST(find_best)\n", "8: LOAD_FAST(find_best)\n", "10: LOAD_FAST(sample)\n", "12: CALL_FUNCTION()\n", "14: RETURN_VALUE()\n" ] } ], "source": [ "from dis import Bytecode\n", "for inst in Bytecode(km_predict_three):\n", " print(\"%d: %s(%s)\" % (inst.offset, inst.opname, inst.argrepr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "But we can inspect the nested function just as we did the list comprehension. Let's do that but just look at the load instructions; we'll see that we've loaded `sys` and `norm` as we'd expect." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: LOAD_GLOBAL(sys)\n", "2: LOAD_ATTR(float_info)\n", "4: LOAD_ATTR(max)\n", "6: LOAD_CONST(-1)\n", "16: LOAD_GLOBAL(enumerate)\n", "18: LOAD_GLOBAL(centers)\n", "32: LOAD_GLOBAL(min)\n", "34: LOAD_FAST(n)\n", "36: LOAD_FAST(i)\n", "40: LOAD_GLOBAL(norm)\n", "42: LOAD_FAST(sample)\n", "44: LOAD_FAST(center)\n", "50: LOAD_FAST(idx)\n", "66: LOAD_FAST(i)\n" ] } ], "source": [ "from dis import Bytecode\n", "for inst in [op for op in Bytecode(km_predict_three.__code__.co_consts[1]) if \"LOAD_\" in op.opname]:\n", " print(\"%d: %s(%s)\" % (inst.offset, inst.opname, inst.argrepr))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Predictably, we can also see a similar problem if we analyze a function with lambda expressions:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "7" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def km_predict_four(sample):\n", " _, idx = min(map(lambda tup: (norm(sample - tup[1]), tup[0]), enumerate(centers)))\n", " return idx\n", "\n", "km_predict_four([0.5, 0.5])" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "module_frontier(km_predict_four)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explicit imports\n", "\n", "Let's look at what happens when we import modules inside the function we're analyzing. Because of the semantics of `import` in Python, the module dependency list for this function will depend on whether or not we've already imported `numpy` and `sys` in the global namespace. If we have, we'll get a reasonable module list; if we haven't, we'll get an empty module list. (If you're running this code in a notebook, you can try it out by restarting the kernel, re-executing the cell with the definition of `module_frontier`, and then executing this cell.)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['numpy.core.numeric',\n", " 'builtins',\n", " 'numpy.core.umath',\n", " 'numpy.linalg.linalg',\n", " 'numpy.core.multiarray',\n", " 'numpy.core.numerictypes',\n", " 'numpy',\n", " 'sys',\n", " 'numpy.core._methods',\n", " 'numpy.core.fromnumeric',\n", " 'numpy.lib.type_check',\n", " 'numpy.linalg._umath_linalg']" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def km_predict_five(sample):\n", " import numpy\n", " import sys\n", " from numpy.linalg import norm\n", " from sys.float_info import max as MAX_FLOAT\n", " \n", " (n, i) = (MAX_FLOAT, -1)\n", " for idx, center in enumerate(centers):\n", " (n, i) = min((n, i), (norm(sample - center), idx))\n", " return i\n", "\n", "module_frontier(km_predict_five)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: LOAD_CONST(0)\n", "2: LOAD_CONST(None)\n", "4: IMPORT_NAME('numpy')\n", "6: STORE_FAST('numpy')\n", "8: LOAD_CONST(0)\n", "10: LOAD_CONST(None)\n", "12: IMPORT_NAME('sys')\n", "14: STORE_FAST('sys')\n", "16: LOAD_CONST(0)\n", "18: LOAD_CONST(('norm',))\n", "20: IMPORT_NAME('numpy.linalg')\n", "22: IMPORT_FROM('norm')\n", "24: STORE_FAST('norm')\n", "26: POP_TOP(None)\n", "28: LOAD_CONST(0)\n", "30: LOAD_CONST(('max',))\n", "32: IMPORT_NAME('sys.float_info')\n", "34: IMPORT_FROM('max')\n", "36: STORE_FAST('MAX_FLOAT')\n", "38: POP_TOP(None)\n", "40: LOAD_FAST('MAX_FLOAT')\n", "42: LOAD_CONST(-1)\n", "44: ROT_TWO(None)\n", "46: STORE_FAST('n')\n", "48: STORE_FAST('i')\n", "50: SETUP_LOOP(102)\n", "52: LOAD_GLOBAL('enumerate')\n", "54: LOAD_GLOBAL('centers')\n", "56: CALL_FUNCTION(1)\n", "58: GET_ITER(None)\n", "60: FOR_ITER(100)\n", "62: UNPACK_SEQUENCE(2)\n", "64: STORE_FAST('idx')\n", "66: STORE_FAST('center')\n", "68: LOAD_GLOBAL('min')\n", "70: LOAD_FAST('n')\n", "72: LOAD_FAST('i')\n", "74: BUILD_TUPLE(2)\n", "76: LOAD_FAST('norm')\n", "78: LOAD_FAST('sample')\n", "80: LOAD_FAST('center')\n", "82: BINARY_SUBTRACT(None)\n", "84: CALL_FUNCTION(1)\n", "86: LOAD_FAST('idx')\n", "88: BUILD_TUPLE(2)\n", "90: CALL_FUNCTION(2)\n", "92: UNPACK_SEQUENCE(2)\n", "94: STORE_FAST('n')\n", "96: STORE_FAST('i')\n", "98: JUMP_ABSOLUTE(60)\n", "100: POP_BLOCK(None)\n", "102: LOAD_FAST('i')\n", "104: RETURN_VALUE(None)\n" ] } ], "source": [ "from dis import Bytecode\n", "for inst in Bytecode(km_predict_five):\n", " print(\"%d: %s(%r)\" % (inst.offset, inst.opname, inst.argval))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see this more clearly by importing a module in a function's scope that we haven't imported into the global namespace:" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def example_six():\n", " import json\n", " return json.loads(\"{'this-sure-is': 'confusing'}\")\n", "\n", "module_frontier(example_six)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ClosureVars(nonlocals={}, globals={}, builtins={}, unbound={'loads', 'json'})" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inspect.getclosurevars(example_six)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`json` is an unbound variable (since it isn't bound in the enclosing environment of the closure). If it were bound in the global namespace, however, the `json` we're referring to in `example_six` would be captured as a global variable:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['json']" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import json\n", "\n", "module_frontier(example_six)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ClosureVars(nonlocals={}, globals={'json': }, builtins={}, unbound={'loads'})" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inspect.getclosurevars(example_six)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Obviously, we'd like to return the same module-dependency results for functions that import modules locally independently of whether those modules have been imported into the global namespace. We can look at the bytecode for this function to see what instructions might be relevant:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0: LOAD_CONST(0)\n", "2: LOAD_CONST(None)\n", "4: IMPORT_NAME('json')\n", "6: STORE_FAST('json')\n", "8: LOAD_FAST('json')\n", "10: LOAD_ATTR('loads')\n", "12: LOAD_CONST(\"{'this-sure-is': 'confusing'}\")\n", "14: CALL_FUNCTION(1)\n", "16: RETURN_VALUE(None)\n" ] } ], "source": [ "from dis import Bytecode\n", "for inst in Bytecode(example_six):\n", " print(\"%d: %s(%r)\" % (inst.offset, inst.opname, inst.argval))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Solving problems\n", "\n", "To address the cases that the closure-inspecting approach misses, we can inspect the bytecode of each function. (We could also inspect abstract syntax trees, using the [`ast`](https://docs.python.org/3.6/library/ast.html) module, but in general it's easier to do this sort of work with a lower-level, more regular representation. ASTs have more cases to treat than bytecode.)\n", "\n", "Python bytecode is stack-based, meaning that each instruction may take one or more arguments from the stack (in addition to explicit arguments encoded in the instruction). For a more involved analysis, we'd probably want to convert Python bytecode to a representation with explicit operands (like three-address code; see [section 2 of Vallée-Rai et al.](https://pdfs.semanticscholar.org/22df/3c0d055b7f65871068dfcd83d10f0a4fe2e4.pdf) for a reasonable approach), but let's see how far we can get by just operating on bytecode.\n", "\n", "## Identifying interesting bytecodes\n", "\n", "We know from the problem cases we examined earlier that we need to worry about a few different kinds of bytecode instructions to find some of the modules that our `inspect`-based approach missed:\n", "\n", "- `LOAD_CONST` instructions that load code objects (e.g., list comprehension bodies, lambda expressions, or nested functions);\n", "- other `LOAD_` instructions that might do the same; and\n", "- `IMPORT_NAME` instructions that import a module into a function's namespace.\n", "\n", "Let's extend our technique to also inspect relevant bytecodes. We'll see how far we can get by just looking at bytecodes in isolation (without modeling the stack or value flow). First, we'll identify the \"interesting\" bytecodes and return the modules, functions, or code blocks that they implicate: " ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "def interesting(inst):\n", " from types import CodeType, FunctionType, ModuleType\n", " from importlib import import_module\n", " from functools import reduce\n", " \n", " if inst.opname == \"IMPORT_NAME\":\n", " path = inst.argval.split(\".\")\n", " path[0] = [import_module(path[0])]\n", " result = reduce(lambda x, a: x + [getattr(x[-1], a)], path)\n", " return (\"modules\", result)\n", " if inst.opname == \"LOAD_GLOBAL\":\n", " if inst.argval in globals() and type(globals()[inst.argval]) in [CodeType, FunctionType]:\n", " return (\"code\", globals()[inst.argval])\n", " if inst.argval in globals() and type(globals()[inst.argval]) == ModuleType:\n", " return (\"modules\", [globals()[inst.argval]])\n", " else:\n", " return None\n", " if \"LOAD_\" in inst.opname and type(inst.argval) in [CodeType, FunctionType]:\n", " return (\"code\", inst.argval)\n", " return None" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can make a revised version of our `module_frontier` function. This starts with the same basic approach as the initial function but it also:\n", "\n", "- processes the bytecode for each code block transitively referred to in each function,\n", "- processes any modules explicitly imported in code." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def mf_revised(f):\n", " worklist = [f]\n", " seen = set()\n", " mods = set()\n", " for fn in worklist:\n", " codeworklist = [fn]\n", " cvs = inspect.getclosurevars(fn)\n", " gvars = cvs.globals\n", " for k, v in gvars.items():\n", " if inspect.ismodule(v):\n", " mods.add(v.__name__)\n", " elif inspect.isfunction(v) and id(v) not in seen:\n", " seen.add(id(v))\n", " mods.add(v.__module__)\n", " worklist.append(v)\n", " elif hasattr(v, \"__module__\"):\n", " mods.add(v.__module__)\n", " for block in codeworklist:\n", " for (k, v) in [interesting(inst) for inst in Bytecode(block) if interesting(inst)]:\n", " if k == \"modules\":\n", " newmods = [mod.__name__ for mod in v if hasattr(mod, \"__name__\")]\n", " mods.update(set(newmods))\n", " elif k == \"code\" and id(v) not in seen:\n", " seen.add(id(v))\n", " if hasattr(v, \"__module__\"):\n", " mods.add(v.__module__)\n", " if(inspect.isfunction(v)):\n", " worklist.append(v)\n", " elif(inspect.iscode(v)):\n", " codeworklist.append(v)\n", " \n", " result = list(mods)\n", " result.sort()\n", " return result" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, this new approach produces sensible results for all of our examples, including hte ones that had confounded the closure-variable approach." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['numpy', 'sklearn.cluster.k_means_']" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf_revised(km_predict_one)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['builtins',\n", " 'numpy',\n", " 'numpy.core._methods',\n", " 'numpy.core.fromnumeric',\n", " 'numpy.core.multiarray',\n", " 'numpy.core.numeric',\n", " 'numpy.core.numerictypes',\n", " 'numpy.core.umath',\n", " 'numpy.lib.type_check',\n", " 'numpy.linalg._umath_linalg',\n", " 'numpy.linalg.linalg']" ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf_revised(km_predict_two)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['builtins',\n", " 'numpy',\n", " 'numpy.core._methods',\n", " 'numpy.core.fromnumeric',\n", " 'numpy.core.multiarray',\n", " 'numpy.core.numeric',\n", " 'numpy.core.numerictypes',\n", " 'numpy.core.umath',\n", " 'numpy.lib.type_check',\n", " 'numpy.linalg._umath_linalg',\n", " 'numpy.linalg.linalg',\n", " 'sys']" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf_revised(km_predict_three)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['builtins',\n", " 'numpy',\n", " 'numpy.core._methods',\n", " 'numpy.core.fromnumeric',\n", " 'numpy.core.multiarray',\n", " 'numpy.core.numeric',\n", " 'numpy.core.numerictypes',\n", " 'numpy.core.umath',\n", " 'numpy.lib.type_check',\n", " 'numpy.linalg._umath_linalg',\n", " 'numpy.linalg.linalg']" ] }, "execution_count": 35, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf_revised(km_predict_four)" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['builtins',\n", " 'numpy',\n", " 'numpy.core._methods',\n", " 'numpy.core.fromnumeric',\n", " 'numpy.core.multiarray',\n", " 'numpy.core.numeric',\n", " 'numpy.core.numerictypes',\n", " 'numpy.core.umath',\n", " 'numpy.lib.type_check',\n", " 'numpy.linalg',\n", " 'numpy.linalg._umath_linalg',\n", " 'numpy.linalg.linalg',\n", " 'sys']" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mf_revised(km_predict_five)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is ongoing work and future notebooks will cover refinements to this technique. It has been fun to learn about the power (and limitations) of the `inspect` module -- as well as a little more about how Python compiles code blocks and nested functions." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }