{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Mining Input Grammars\n", "\n", "So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place. While the grammars we have seen so far have been rather simple, creating a grammar for complex inputs can involve quite some effort. In this chapter, we therefore introduce techniques that _automatically mine grammars from programs_ – by executing the programs and observing how they process which parts of the input. In conjunction with a grammar fuzzer, this allows us to \n", "1. take a program, \n", "2. extract its input grammar, and \n", "3. fuzz it with high efficiency and effectiveness, using the concepts in this book." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:52.730324Z", "iopub.status.busy": "2024-01-18T17:19:52.729866Z", "iopub.status.idle": "2024-01-18T17:19:52.797017Z", "shell.execute_reply": "2024-01-18T17:19:52.796609Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bookutils import YouTubeVideo\n", "YouTubeVideo(\"ddM1oL2LYDI\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Prerequisites**\n", "\n", "* You should have read the [chapter on grammars](Grammars.ipynb).\n", "* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.\n", "* We use the tracer from the [chapter on coverage](Coverage.ipynb).\n", "* The concept of parsing from the [chapter on parsers](Parser.ipynb) is also useful." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Synopsis\n", "\n", "\n", "To [use the code provided in this chapter](Importing.ipynb), write\n", "\n", "```python\n", ">>> from fuzzingbook.GrammarMiner import \n", "```\n", "\n", "and then make use of the following features.\n", "\n", "\n", "This chapter provides a number of classes to mine input grammars from existing programs. The function `recover_grammar()` could be the easiest to use. It takes a function and a set of inputs, and returns a grammar that describes its input language.\n", "\n", "We apply `recover_grammar()` on a `url_parse()` function that takes and decomposes URLs:\n", "\n", "```python\n", ">>> url_parse('https://www.fuzzingbook.org/')\n", ">>> URLS\n", "['http://user:pass@www.google.com:80/?q=path#ref',\n", " 'https://www.cispa.saarland:80/',\n", " 'http://www.fuzzingbook.org/#News']\n", "```\n", "We extract the input grammar for `url_parse()` using `recover_grammar()`:\n", "\n", "```python\n", ">>> grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])\n", ">>> grammar\n", "{'': [''],\n", " '': [':<_splitnetloc@411:url>'],\n", " '': ['https', 'http'],\n", " '<_splitnetloc@411:url>': ['///',\n", " '//'],\n", " '': ['user:pass@www.google.com:80',\n", " 'www.fuzzingbook.org',\n", " 'www.cispa.saarland:80'],\n", " '': ['#',\n", " '/#'],\n", " '': ['/?'],\n", " '': ['q=path'],\n", " '': ['ref', 'News']}\n", "```\n", "The names of nonterminals are a bit technical; but the grammar nicely represents the structure of the input; for instance, the different schemes (`\"http\"`, `\"https\"`) are all identified:\n", "\n", "```python\n", ">>> syntax_diagram(grammar)\n", "start\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-1.svg)\n", "```\n", "urlsplit@437:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-2.svg)\n", "```\n", "urlparse@394:scheme\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-3.svg)\n", "```\n", "_splitnetloc@411:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-4.svg)\n", "```\n", "urlparse@394:netloc\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-5.svg)\n", "```\n", "urlsplit@481:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-6.svg)\n", "```\n", "urlsplit@486:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-7.svg)\n", "```\n", "urlparse@394:query\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-8.svg)\n", "```\n", "urlparse@394:fragment\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-9.svg)\n", "\n", "The grammar can be immediately used for fuzzing, producing arbitrary combinations of input elements, which are all syntactically valid.\n", "\n", "```python\n", ">>> from GrammarCoverageFuzzer import GrammarCoverageFuzzer\n", ">>> fuzzer = GrammarCoverageFuzzer(grammar)\n", ">>> [fuzzer.fuzz() for i in range(5)]\n", "['https://www.fuzzingbook.org/',\n", " 'http://user:pass@www.google.com:80/#News',\n", " 'http://www.cispa.saarland:80/?q=path#ref',\n", " 'https://user:pass@www.google.com:80/#ref',\n", " 'https://www.cispa.saarland:80/']\n", "```\n", "Being able to automatically extract a grammar and to use this grammar for fuzzing makes for very effective test generation with a minimum of manual work.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Grammar Challenge\n", "\n", "Consider the `process_inventory()` method from the [chapter on parsers](Parser.ipynb):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:52.820727Z", "iopub.status.busy": "2024-01-18T17:19:52.820505Z", "iopub.status.idle": "2024-01-18T17:19:52.823026Z", "shell.execute_reply": "2024-01-18T17:19:52.822713Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import bookutils.setup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:52.824965Z", "iopub.status.busy": "2024-01-18T17:19:52.824826Z", "iopub.status.idle": "2024-01-18T17:19:52.826647Z", "shell.execute_reply": "2024-01-18T17:19:52.826358Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from typing import List, Tuple, Callable, Any\n", "from collections.abc import Iterable" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:52.828309Z", "iopub.status.busy": "2024-01-18T17:19:52.828171Z", "iopub.status.idle": "2024-01-18T17:19:53.379184Z", "shell.execute_reply": "2024-01-18T17:19:53.378754Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import process_inventory, process_vehicle, process_car, process_van, lr_graph # minor dependency" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It takes inputs of the following form." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.381966Z", "iopub.status.busy": "2024-01-18T17:19:53.381582Z", "iopub.status.idle": "2024-01-18T17:19:53.383611Z", "shell.execute_reply": "2024-01-18T17:19:53.383321Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "INVENTORY = \"\"\"\\\n", "1997,van,Ford,E350\n", "2000,car,Mercury,Cougar\n", "1999,car,Chevy,Venture\\\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.385287Z", "iopub.status.busy": "2024-01-18T17:19:53.385163Z", "iopub.status.idle": "2024-01-18T17:19:53.387090Z", "shell.execute_reply": "2024-01-18T17:19:53.386810Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "We have a Ford E350 van from 1997 vintage.\n", "It is an old but reliable model!\n", "We have a Mercury Cougar car from 2000 vintage.\n", "It is an old but reliable model!\n", "We have a Chevy Venture car from 1999 vintage.\n", "It is an old but reliable model!\n" ] } ], "source": [ "print(process_inventory(INVENTORY))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We found from the [chapter on parsers](Parser.ipynb) that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files ([RFC 4180](https://tools.ietf.org/html/rfc4180)), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a style such that specific methods are responsible for handling specific parts of the input, one can recover the parse tree by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " _We start with the assumption (1) that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program -- This includes almost all ad hoc parsers._\n", "\n", "The idea is as follows:\n", "\n", "* Hook into the Python execution and observe the fragments of input string as they are produced and named in different methods.\n", "* Stitch the input fragments together in a tree structure to retrieve the **Parse Tree**.\n", "* Abstract common elements from multiple parse trees to produce the **Context Free Grammar** of the input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Simple Grammar Miner" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Say we want to obtain the input grammar for the function `process_vehicle()`. We first collect the sample inputs for this function." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.388897Z", "iopub.status.busy": "2024-01-18T17:19:53.388784Z", "iopub.status.idle": "2024-01-18T17:19:53.390463Z", "shell.execute_reply": "2024-01-18T17:19:53.390235Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "VEHICLES = INVENTORY.split('\\n')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The set of methods responsible for processing inventory are the following." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.392018Z", "iopub.status.busy": "2024-01-18T17:19:53.391906Z", "iopub.status.idle": "2024-01-18T17:19:53.393543Z", "shell.execute_reply": "2024-01-18T17:19:53.393228Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "INVENTORY_METHODS = {\n", " 'process_inventory',\n", " 'process_vehicle',\n", " 'process_van',\n", " 'process_car'}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We have seen from the chapter on [configuration fuzzing](ConfigurationFuzzer.ipynb) that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the `frame` argument. Here is a simple tracer that can return the local variables and other contextual information in a traced function. We reuse the `Coverage` tracing class." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Tracer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.395228Z", "iopub.status.busy": "2024-01-18T17:19:53.395112Z", "iopub.status.idle": "2024-01-18T17:19:53.396981Z", "shell.execute_reply": "2024-01-18T17:19:53.396554Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Coverage import Coverage" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.398958Z", "iopub.status.busy": "2024-01-18T17:19:53.398795Z", "iopub.status.idle": "2024-01-18T17:19:53.400549Z", "shell.execute_reply": "2024-01-18T17:19:53.400257Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import inspect" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.402038Z", "iopub.status.busy": "2024-01-18T17:19:53.401930Z", "iopub.status.idle": "2024-01-18T17:19:53.404269Z", "shell.execute_reply": "2024-01-18T17:19:53.403977Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Coverage):\n", " def traceit(self, frame, event, arg):\n", " method_name = inspect.getframeinfo(frame).function\n", " if method_name not in INVENTORY_METHODS:\n", " return\n", " file_name = inspect.getframeinfo(frame).filename\n", "\n", " param_names = inspect.getargvalues(frame).args\n", " lineno = inspect.getframeinfo(frame).lineno\n", " local_vars = inspect.getargvalues(frame).locals\n", " print(event, file_name, lineno, method_name, param_names, local_vars)\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We run the code under trace context." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.405907Z", "iopub.status.busy": "2024-01-18T17:19:53.405804Z", "iopub.status.idle": "2024-01-18T17:19:53.456454Z", "shell.execute_reply": "2024-01-18T17:19:53.456165Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 29 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 30 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 31 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n", "call /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 40 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 41 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 42 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.']}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 43 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 46 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}\n", "line /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}\n", "return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}\n", "return /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n" ] } ], "source": [ "with Tracer() as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The main thing that we want out of tracing is a list of assignments of input fragments to different variables. We can use the tracing facility `settrace()` to get that as we showed above.\n", "\n", "However, the `settrace()` function hooks into the Python debugging facility. When it is in operation, no debugger can hook into the program. That is, if there is a problem with our grammar miner, we will not be able to attach a debugger to it to understand what is happening. This is not ideal. Hence, we limit the tracer to the simplest implementation possible, and implement the core of grammar mining in later stages." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `traceit()` function relies on information from the `frame` variable which exposes Python internals. We define a `context` class that encapsulates the information that we need from the `frame`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Context\n", "\n", "The `Context` class provides easy access to the information such as the current module, and parameter names." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.458508Z", "iopub.status.busy": "2024-01-18T17:19:53.458395Z", "iopub.status.idle": "2024-01-18T17:19:53.461016Z", "shell.execute_reply": "2024-01-18T17:19:53.460709Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Context:\n", " def __init__(self, frame, track_caller=True):\n", " self.method = inspect.getframeinfo(frame).function\n", " self.parameter_names = inspect.getargvalues(frame).args\n", " self.file_name = inspect.getframeinfo(frame).filename\n", " self.line_no = inspect.getframeinfo(frame).lineno\n", "\n", " def _t(self):\n", " return (self.file_name, self.line_no, self.method,\n", " ','.join(self.parameter_names))\n", "\n", " def __repr__(self):\n", " return \"%s:%d:%s(%s)\" % self._t()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here we add a few convenience methods that operate on the `frame` to `Context`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.463101Z", "iopub.status.busy": "2024-01-18T17:19:53.462932Z", "iopub.status.idle": "2024-01-18T17:19:53.465678Z", "shell.execute_reply": "2024-01-18T17:19:53.465331Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Context(Context):\n", " def extract_vars(self, frame):\n", " return inspect.getargvalues(frame).locals\n", "\n", " def parameters(self, all_vars):\n", " return {k: v for k, v in all_vars.items() if k in self.parameter_names}\n", "\n", " def qualified(self, all_vars):\n", " return {\"%s:%s\" % (self.method, k): v for k, v in all_vars.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We hook printing the context to our `traceit()` to see it in action. First we define a `log_event()` for displaying events." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.467730Z", "iopub.status.busy": "2024-01-18T17:19:53.467528Z", "iopub.status.idle": "2024-01-18T17:19:53.469528Z", "shell.execute_reply": "2024-01-18T17:19:53.469238Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def log_event(event, var):\n", " print({'call': '->', 'return': '<-'}.get(event, ' '), var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And use the `log_event()` in the `traceit()` function." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.471544Z", "iopub.status.busy": "2024-01-18T17:19:53.471369Z", "iopub.status.idle": "2024-01-18T17:19:53.473421Z", "shell.execute_reply": "2024-01-18T17:19:53.473152Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def traceit(self, frame, event, arg):\n", " log_event(event, Context(frame))\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Running `process_vehicle()` under trace prints the contexts encountered." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.475038Z", "iopub.status.busy": "2024-01-18T17:19:53.474906Z", "iopub.status.idle": "2024-01-18T17:19:53.478637Z", "shell.execute_reply": "2024-01-18T17:19:53.478264Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:102:__exit__(self,exc_type,exc_value,tb)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Coverage.ipynb:105:__exit__(self,exc_type,exc_value,tb)\n" ] } ], "source": [ "with Tracer() as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The trace produced by executing any function can get overwhelmingly large. Hence, we need to restrict our attention to specific modules. Further, we also restrict our attention exclusively to `str` variables since these variables are more likely to contain input fragments. (We will show how to deal with complex objects later in exercises.)\n", "\n", "The `Context` class we developed earlier is used to decide which modules to monitor, and which variables to trace.\n", "\n", "We store the current *input string* so that it can be used to determine if any particular string fragments came from the current input string. Any optional arguments are processed separately." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.481084Z", "iopub.status.busy": "2024-01-18T17:19:53.480976Z", "iopub.status.idle": "2024-01-18T17:19:53.483029Z", "shell.execute_reply": "2024-01-18T17:19:53.482777Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def __init__(self, my_input, **kwargs):\n", " self.options(kwargs)\n", " self.my_input, self.trace = my_input, []" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We use an optional argument `files` to indicate the specific source files we are interested in, and `methods` to indicate which specific methods are of interest. Further, we also use `log` to specify whether verbose logging should be enabled during trace. We use the `log_event()` method we defined earlier for logging." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The options processing is as below." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.484605Z", "iopub.status.busy": "2024-01-18T17:19:53.484521Z", "iopub.status.idle": "2024-01-18T17:19:53.486425Z", "shell.execute_reply": "2024-01-18T17:19:53.486202Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def options(self, kwargs):\n", " self.files = kwargs.get('files', [])\n", " self.methods = kwargs.get('methods', [])\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `files` and `methods` are checked to determine, if a particular event should be traced or not" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.488025Z", "iopub.status.busy": "2024-01-18T17:19:53.487930Z", "iopub.status.idle": "2024-01-18T17:19:53.490006Z", "shell.execute_reply": "2024-01-18T17:19:53.489772Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def tracing_context(self, cxt, event, arg):\n", " fres = not self.files or any(\n", " cxt.file_name.endswith(f) for f in self.files)\n", " mres = not self.methods or any(cxt.method == m for m in self.methods)\n", " return fres and mres" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similar to the context of events, we also want to restrict our attention to specific variables. For now, we want to focus only on strings. (See the Exercises at the end of the chapter on how to extend it to other kinds of objects)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.491503Z", "iopub.status.busy": "2024-01-18T17:19:53.491423Z", "iopub.status.idle": "2024-01-18T17:19:53.493259Z", "shell.execute_reply": "2024-01-18T17:19:53.492951Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def tracing_var(self, k, v):\n", " return isinstance(v, str)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We modify the `traceit()` to call an `on_event()` function with the context information only on the specific events we are interested in." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.494975Z", "iopub.status.busy": "2024-01-18T17:19:53.494863Z", "iopub.status.idle": "2024-01-18T17:19:53.497896Z", "shell.execute_reply": "2024-01-18T17:19:53.497584Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def on_event(self, event, arg, cxt, my_vars):\n", " self.trace.append((event, arg, cxt, my_vars))\n", " \n", " def create_context(self, frame):\n", " return Context(frame)\n", "\n", " def traceit(self, frame, event, arg):\n", " cxt = self.create_context(frame)\n", " if not self.tracing_context(cxt, event, arg):\n", " return self.traceit\n", " self.log(event, cxt)\n", "\n", " my_vars = {\n", " k: v\n", " for k, v in cxt.extract_vars(frame).items()\n", " if self.tracing_var(k, v)\n", " }\n", " self.on_event(event, arg, cxt, my_vars)\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `Tracer` class can now focus on specific kinds of events on specific files. Further, it provides a first level filter for variables that we find interesting. For example, we want to focus specifically on variables from `process_*` methods that contain input fragments. Here is how our updated `Tracer` can be used." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.499521Z", "iopub.status.busy": "2024-01-18T17:19:53.499408Z", "iopub.status.idle": "2024-01-18T17:19:53.502685Z", "shell.execute_reply": "2024-01-18T17:19:53.502380Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n" ] } ], "source": [ "with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The execution produced the following trace." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.504267Z", "iopub.status.busy": "2024-01-18T17:19:53.504190Z", "iopub.status.idle": "2024-01-18T17:19:53.506541Z", "shell.execute_reply": "2024-01-18T17:19:53.506295Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call process_vehicle {'vehicle': '1997,van,Ford,E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n", "call process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "return process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "return process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n" ] } ], "source": [ "for t in tracer.trace:\n", " print(t[0], t[2].method, dict(t[3]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since we are saving the input already in `Tracer`, it is redundant to specify it separately again as an argument." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.508116Z", "iopub.status.busy": "2024-01-18T17:19:53.508036Z", "iopub.status.idle": "2024-01-18T17:19:53.511524Z", "shell.execute_reply": "2024-01-18T17:19:53.511241Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n" ] } ], "source": [ "with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_vehicle(tracer.my_input)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### DefineTracker" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define a `DefineTracker` class that processes the trace from the `Tracer`. The idea is to store different variable definitions which are input fragments.\n", "\n", "The tracker identifies string fragments that are part of the input string, and stores them in a dictionary `my_assignments`. It saves the trace, and the corresponding input for processing. Finally, it calls `process()` to process the `trace` it was given. We will start with a simple tracker that relies on certain assumptions, and later see how these assumptions can be relaxed." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.513781Z", "iopub.status.busy": "2024-01-18T17:19:53.513502Z", "iopub.status.idle": "2024-01-18T17:19:53.515584Z", "shell.execute_reply": "2024-01-18T17:19:53.515355Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class DefineTracker:\n", " def __init__(self, my_input, trace, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", " self.trace = trace\n", " self.my_assignments = {}\n", " self.process()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems of using substring search is that short string sequences tend to be included in other string sequences even though they may not have come from the original string. That is, say the input fragment is `v`, it could have equally come from either `van` or `chevy`. We rely on being able to predict the exact place in the input where a given fragment occurred. Hence, we define a constant `FRAGMENT_LEN` such that we ignore strings up to that length. We also incorporate a logging facility as before." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.517382Z", "iopub.status.busy": "2024-01-18T17:19:53.517057Z", "iopub.status.idle": "2024-01-18T17:19:53.518851Z", "shell.execute_reply": "2024-01-18T17:19:53.518616Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "FRAGMENT_LEN = 3" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.520531Z", "iopub.status.busy": "2024-01-18T17:19:53.520348Z", "iopub.status.idle": "2024-01-18T17:19:53.522130Z", "shell.execute_reply": "2024-01-18T17:19:53.521902Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def options(self, kwargs):\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None\n", " self.fragment_len = kwargs.get('fragment_len', FRAGMENT_LEN)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Our tracer simply records the variable values as they occur. We next need to check if the variables contain values from the **input string**. Common ways to do this is to rely on symbolic execution or at least dynamic tainting, which are powerful, but also complex. However, one can obtain a reasonable approximation by simply relying on substring search. That is, we consider any value produced that is a substring of the original input string to have come from the original input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define an `is_input_fragment()` method that relies on string inclusion to detect if the string came from the input." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.523651Z", "iopub.status.busy": "2024-01-18T17:19:53.523492Z", "iopub.status.idle": "2024-01-18T17:19:53.525342Z", "shell.execute_reply": "2024-01-18T17:19:53.525111Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def is_input_fragment(self, var, value):\n", " return len(value) >= self.fragment_len and value in self.my_input" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can use `is_input_fragment()` to select only a subset of variables defined, as implemented below in `fragments()`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.526915Z", "iopub.status.busy": "2024-01-18T17:19:53.526790Z", "iopub.status.idle": "2024-01-18T17:19:53.528726Z", "shell.execute_reply": "2024-01-18T17:19:53.528444Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def fragments(self, variables):\n", " return {k: v for k, v in variables.items(\n", " ) if self.is_input_fragment(k, v)}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The tracker processes each event, and at each event, it updates the dictionary `my_assignments` with the current local variables that contain strings that are part of the input. Note that there is a choice here with respect to what happens during reassignment. We can either discard all the reassignments, or keep only the last assignment. Here, we choose the latter. If you want the former behavior, check whether the value exists in `my_assignments` before storing a fragment." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.530332Z", "iopub.status.busy": "2024-01-18T17:19:53.530226Z", "iopub.status.idle": "2024-01-18T17:19:53.532384Z", "shell.execute_reply": "2024-01-18T17:19:53.532148Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def track_event(self, event, arg, cxt, my_vars):\n", " self.log(event, (cxt.method, my_vars))\n", " self.my_assignments.update(self.fragments(my_vars))\n", "\n", " def process(self):\n", " for event, arg, cxt, my_vars in self.trace:\n", " self.track_event(event, arg, cxt, my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using the tracker, we can obtain the input fragments. For example, say we are only interested in strings that are at least `5` characters long." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.533870Z", "iopub.status.busy": "2024-01-18T17:19:53.533742Z", "iopub.status.idle": "2024-01-18T17:19:53.535725Z", "shell.execute_reply": "2024-01-18T17:19:53.535437Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vehicle = '1997,van,Ford,E350'\n" ] } ], "source": [ "tracker = DefineTracker(tracer.my_input, tracer.trace, fragment_len=5)\n", "for k, v in tracker.my_assignments.items():\n", " print(k, '=', repr(v))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Or strings that are `2` characters long (the default)." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.537282Z", "iopub.status.busy": "2024-01-18T17:19:53.537179Z", "iopub.status.idle": "2024-01-18T17:19:53.539147Z", "shell.execute_reply": "2024-01-18T17:19:53.538874Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vehicle = '1997,van,Ford,E350'\n", "year = '1997'\n", "kind = 'van'\n", "company = 'Ford'\n", "model = 'E350'\n" ] } ], "source": [ "tracker = DefineTracker(tracer.my_input, tracer.trace)\n", "for k, v in tracker.my_assignments.items():\n", " print(k, '=', repr(v))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.540654Z", "iopub.status.busy": "2024-01-18T17:19:53.540554Z", "iopub.status.idle": "2024-01-18T17:19:53.542109Z", "shell.execute_reply": "2024-01-18T17:19:53.541879Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def assignments(self):\n", " return self.my_assignments.items()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Assembling a Derivation Tree" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.543626Z", "iopub.status.busy": "2024-01-18T17:19:53.543530Z", "iopub.status.idle": "2024-01-18T17:19:53.545303Z", "shell.execute_reply": "2024-01-18T17:19:53.545019Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import START_SYMBOL, syntax_diagram, \\\n", " is_nonterminal, Grammar" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.546798Z", "iopub.status.busy": "2024-01-18T17:19:53.546693Z", "iopub.status.idle": "2024-01-18T17:19:53.548176Z", "shell.execute_reply": "2024-01-18T17:19:53.547968Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from GrammarFuzzer import GrammarFuzzer, display_tree, \\\n", " DerivationTree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The input fragments from the `DefineTracker` only tell half the story. The fragments may be created at different stages of parsing. Hence, we need to assemble the fragments to a derivation tree of the input. The basic idea is as follows:\n", "\n", "Our input from the previous step was:\n", "\n", "```python\n", "\"1997,van,Ford,E350\"\n", "```\n", "\n", "We start a derivation tree, and associate it with the start symbol in the grammar." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.549513Z", "iopub.status.busy": "2024-01-18T17:19:53.549431Z", "iopub.status.idle": "2024-01-18T17:19:53.551278Z", "shell.execute_reply": "2024-01-18T17:19:53.551043Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, [(\"1997,van,Ford,E350\", [])])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.552672Z", "iopub.status.busy": "2024-01-18T17:19:53.552592Z", "iopub.status.idle": "2024-01-18T17:19:53.941286Z", "shell.execute_reply": "2024-01-18T17:19:53.940918Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "1997,van,Ford,E350\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The next input was:\n", "```python\n", "vehicle = \"1997,van,Ford,E350\"\n", "```\n", "Since vehicle covers the `` node's value completely, we replace the value with the vehicle node." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.943292Z", "iopub.status.busy": "2024-01-18T17:19:53.943133Z", "iopub.status.idle": "2024-01-18T17:19:53.945280Z", "shell.execute_reply": "2024-01-18T17:19:53.945017Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [(\"1997,van,Ford,E350\", [])],\n", " [])])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:53.946729Z", "iopub.status.busy": "2024-01-18T17:19:53.946595Z", "iopub.status.idle": "2024-01-18T17:19:54.324708Z", "shell.execute_reply": "2024-01-18T17:19:54.324347Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "1997,van,Ford,E350\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The next input was:\n", "```python\n", "year = '1997'\n", "```\n", "Traversing the derivation tree from ``, we see that it replaces a portion of the `` node's value. Hence we split the `` node's value to two children, where one corresponds to the value `\"1997\"` and the other to `\",van,Ford,E350\"`, and replace the first one with the node ``." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:54.326610Z", "iopub.status.busy": "2024-01-18T17:19:54.326450Z", "iopub.status.idle": "2024-01-18T17:19:54.328785Z", "shell.execute_reply": "2024-01-18T17:19:54.328486Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",van,Ford,E350\", [])], [])])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:54.330321Z", "iopub.status.busy": "2024-01-18T17:19:54.330188Z", "iopub.status.idle": "2024-01-18T17:19:54.705251Z", "shell.execute_reply": "2024-01-18T17:19:54.704909Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<year>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ",van,Ford,E350\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "3\n", "1997\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We perform similar operations for \n", "```python\n", "company = 'Ford'\n", "```" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:54.707066Z", "iopub.status.busy": "2024-01-18T17:19:54.706942Z", "iopub.status.idle": "2024-01-18T17:19:54.709234Z", "shell.execute_reply": "2024-01-18T17:19:54.708894Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",van,\", []),\n", " ('', [('Ford', [])]),\n", " (\",E350\", [])], [])])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:54.710840Z", "iopub.status.busy": "2024-01-18T17:19:54.710714Z", "iopub.status.idle": "2024-01-18T17:19:55.110981Z", "shell.execute_reply": "2024-01-18T17:19:55.110562Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<year>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ",van,\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<company>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "7\n", ",E350\n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "\n", "\n", "\n", "3\n", "1997\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "Ford\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Similarly for\n", "```python\n", "kind = 'van'\n", "```\n", "and\n", "```python\n", "model = 'E350'\n", "```" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.113465Z", "iopub.status.busy": "2024-01-18T17:19:55.113132Z", "iopub.status.idle": "2024-01-18T17:19:55.115850Z", "shell.execute_reply": "2024-01-18T17:19:55.115580Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",\", []),\n", " (\"\", [('van', [])]),\n", " (\",\", []),\n", " ('', [('Ford', [])]),\n", " (\",\", []),\n", " (\"\", [('E350', [])])\n", " ], [])])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.117413Z", "iopub.status.busy": "2024-01-18T17:19:55.117312Z", "iopub.status.idle": "2024-01-18T17:19:55.494644Z", "shell.execute_reply": "2024-01-18T17:19:55.494281Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<year>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ", (44)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<kind>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "7\n", ", (44)\n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<company>\n", "\n", "\n", "\n", "1->8\n", "\n", "\n", "\n", "\n", "\n", "10\n", ", (44)\n", "\n", "\n", "\n", "1->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "<model>\n", "\n", "\n", "\n", "1->11\n", "\n", "\n", "\n", "\n", "\n", "3\n", "1997\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "van\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "9\n", "Ford\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "E350\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We now develop the complete algorithm with the above described steps.\n", "The derivation tree `TreeMiner` is initialized with the input string, and the variable assignments, and it converts the assignments to the corresponding derivation tree." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.496327Z", "iopub.status.busy": "2024-01-18T17:19:55.496217Z", "iopub.status.idle": "2024-01-18T17:19:55.498847Z", "shell.execute_reply": "2024-01-18T17:19:55.498489Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner:\n", " def __init__(self, my_input, my_assignments, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", " self.my_assignments = my_assignments\n", " self.tree = self.get_derivation_tree()\n", "\n", " def options(self, kwargs):\n", " self.log = log_call if kwargs.get('log') else lambda _i, _v: None\n", "\n", " def get_derivation_tree(self):\n", " return (START_SYMBOL, [])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `log_call()` is as follows." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.500651Z", "iopub.status.busy": "2024-01-18T17:19:55.500531Z", "iopub.status.idle": "2024-01-18T17:19:55.502233Z", "shell.execute_reply": "2024-01-18T17:19:55.501934Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def log_call(indent, var):\n", " print('\\t' * indent, var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The basic idea is as follows:\n", "* **For now, we assume that the value assigned to a variable is stable. That is, it is never reassigned. In particular, there are no recursive calls, or multiple calls to the same function from different parts.** (We will show how to overcome this limitation later).\n", "* For each pair _var_, _value_ found in `my_assignments`:\n", " 1. We search for occurrences of _value_ `val` in the derivation tree recursively.\n", " 2. If an occurrence was found as a value `V1` of a node `P1`, we partition the value of the node `P1` into three parts, with the central part matching the _value_ `val`, and the first and last part, the corresponding prefix and suffix in `V1`.\n", " 3. Reconstitute the node `P1` with three children, where prefix and suffix mentioned earlier are string values, and the matching value `val` is replaced by a node `var` with a single value `val`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "First, we define a wrapper to generate a nonterminal from a variable name." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.503812Z", "iopub.status.busy": "2024-01-18T17:19:55.503690Z", "iopub.status.idle": "2024-01-18T17:19:55.505406Z", "shell.execute_reply": "2024-01-18T17:19:55.505138Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def to_nonterminal(var):\n", " return \"<\" + var.lower() + \">\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `string_part_of_value()` method checks whether the given `part` value was part of the whole." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.506958Z", "iopub.status.busy": "2024-01-18T17:19:55.506850Z", "iopub.status.idle": "2024-01-18T17:19:55.508542Z", "shell.execute_reply": "2024-01-18T17:19:55.508304Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def string_part_of_value(self, part, value):\n", " return (part in value)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `partition_by_part()` splits the `value` by the given part if it matches, and returns a list containing the first part, the part that was replaced, and the last part. This is a format that can be used as a part of the list of children." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.510130Z", "iopub.status.busy": "2024-01-18T17:19:55.510026Z", "iopub.status.idle": "2024-01-18T17:19:55.511794Z", "shell.execute_reply": "2024-01-18T17:19:55.511571Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def partition(self, part, value):\n", " return value.partition(part)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.513175Z", "iopub.status.busy": "2024-01-18T17:19:55.513056Z", "iopub.status.idle": "2024-01-18T17:19:55.515093Z", "shell.execute_reply": "2024-01-18T17:19:55.514843Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def partition_by_part(self, pair, value):\n", " k, part = pair\n", " prefix_k_suffix = [\n", " (k, [[part, []]]) if i == 1 else (e, [])\n", " for i, e in enumerate(self.partition(part, value))\n", " if e]\n", " return prefix_k_suffix" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `insert_into_tree()` method accepts a given tree `tree` and a `(k,v)` pair. It recursively checks whether the given pair can be applied. If the pair can be applied, it applies the pair and returns `True`." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.516820Z", "iopub.status.busy": "2024-01-18T17:19:55.516708Z", "iopub.status.idle": "2024-01-18T17:19:55.520155Z", "shell.execute_reply": "2024-01-18T17:19:55.519840Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def insert_into_tree(self, my_tree, pair):\n", " var, values = my_tree\n", " k, v = pair\n", " self.log(1, \"- Node: %s\\t\\t? (%s:%s)\" % (var, k, repr(v)))\n", " applied = False\n", " for i, value_ in enumerate(values):\n", " value, arr = value_\n", " self.log(2, \"-> [%d] %s\" % (i, repr(value)))\n", " if is_nonterminal(value):\n", " applied = self.insert_into_tree(value_, pair)\n", " if applied:\n", " break\n", " elif self.string_part_of_value(v, value):\n", " prefix_k_suffix = self.partition_by_part(pair, value)\n", " del values[i]\n", " for j, rep in enumerate(prefix_k_suffix):\n", " values.insert(j + i, rep)\n", " applied = True\n", "\n", " self.log(2, \" > %s\" % (repr([i[0] for i in prefix_k_suffix])))\n", " break\n", " else:\n", " continue\n", " return applied" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here is how `insert_into_tree()` is used." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.521880Z", "iopub.status.busy": "2024-01-18T17:19:55.521767Z", "iopub.status.idle": "2024-01-18T17:19:55.523694Z", "shell.execute_reply": "2024-01-18T17:19:55.523393Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "tree: DerivationTree = (START_SYMBOL, [(\"1997,van,Ford,E350\", [])])\n", "m = TreeMiner('', {}, log=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First, we have our input string as the only node." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.525237Z", "iopub.status.busy": "2024-01-18T17:19:55.525145Z", "iopub.status.idle": "2024-01-18T17:19:55.929214Z", "shell.execute_reply": "2024-01-18T17:19:55.928633Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "1997,van,Ford,E350\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting the `` node." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.931407Z", "iopub.status.busy": "2024-01-18T17:19:55.931256Z", "iopub.status.idle": "2024-01-18T17:19:55.933600Z", "shell.execute_reply": "2024-01-18T17:19:55.933270Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', \"1997,van,Ford,E350\"))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:55.935265Z", "iopub.status.busy": "2024-01-18T17:19:55.935138Z", "iopub.status.idle": "2024-01-18T17:19:56.321766Z", "shell.execute_reply": "2024-01-18T17:19:56.321407Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "1997,van,Ford,E350\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting `` node." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:56.323685Z", "iopub.status.busy": "2024-01-18T17:19:56.323562Z", "iopub.status.idle": "2024-01-18T17:19:56.325563Z", "shell.execute_reply": "2024-01-18T17:19:56.325311Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['1997,van,Ford,', '']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'E350'))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:56.327475Z", "iopub.status.busy": "2024-01-18T17:19:56.327245Z", "iopub.status.idle": "2024-01-18T17:19:56.692057Z", "shell.execute_reply": "2024-01-18T17:19:56.691671Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "1997,van,Ford,\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<model>\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "E350\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree((tree))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:56.694084Z", "iopub.status.busy": "2024-01-18T17:19:56.693931Z", "iopub.status.idle": "2024-01-18T17:19:56.696562Z", "shell.execute_reply": "2024-01-18T17:19:56.696205Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997,van,Ford,'\n", "\t\t > ['1997,van,', '', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'Ford'))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:56.698920Z", "iopub.status.busy": "2024-01-18T17:19:56.698723Z", "iopub.status.idle": "2024-01-18T17:19:57.088624Z", "shell.execute_reply": "2024-01-18T17:19:57.088213Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "1997,van,\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<company>\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "\n", "\n", "\n", "5\n", ", (44)\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "<model>\n", "\n", "\n", "\n", "1->6\n", "\n", "\n", "\n", "\n", "\n", "4\n", "Ford\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "7\n", "E350\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.090562Z", "iopub.status.busy": "2024-01-18T17:19:57.090408Z", "iopub.status.idle": "2024-01-18T17:19:57.092920Z", "shell.execute_reply": "2024-01-18T17:19:57.092553Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997,van,'\n", "\t\t > ['1997,', '', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'van'))" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.094892Z", "iopub.status.busy": "2024-01-18T17:19:57.094753Z", "iopub.status.idle": "2024-01-18T17:19:57.511885Z", "shell.execute_reply": "2024-01-18T17:19:57.511454Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "1997,\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<kind>\n", "\n", "\n", "\n", "1->3\n", "\n", "\n", "\n", "\n", "\n", "5\n", ", (44)\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "<company>\n", "\n", "\n", "\n", "1->6\n", "\n", "\n", "\n", "\n", "\n", "8\n", ", (44)\n", "\n", "\n", "\n", "1->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<model>\n", "\n", "\n", "\n", "1->9\n", "\n", "\n", "\n", "\n", "\n", "4\n", "van\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "7\n", "Ford\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n", "10\n", "E350\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.513812Z", "iopub.status.busy": "2024-01-18T17:19:57.513695Z", "iopub.status.idle": "2024-01-18T17:19:57.515867Z", "shell.execute_reply": "2024-01-18T17:19:57.515632Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,'\n", "\t\t > ['', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', '1997'))" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.517530Z", "iopub.status.busy": "2024-01-18T17:19:57.517302Z", "iopub.status.idle": "2024-01-18T17:19:57.919123Z", "shell.execute_reply": "2024-01-18T17:19:57.918792Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<year>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ", (44)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<kind>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "7\n", ", (44)\n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<company>\n", "\n", "\n", "\n", "1->8\n", "\n", "\n", "\n", "\n", "\n", "10\n", ", (44)\n", "\n", "\n", "\n", "1->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "<model>\n", "\n", "\n", "\n", "1->11\n", "\n", "\n", "\n", "\n", "\n", "3\n", "1997\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "van\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "9\n", "Ford\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "E350\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To make life simple, we define a wrapper function `nt_var()` that will convert a token to its corresponding nonterminal symbol." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.920905Z", "iopub.status.busy": "2024-01-18T17:19:57.920783Z", "iopub.status.idle": "2024-01-18T17:19:57.922806Z", "shell.execute_reply": "2024-01-18T17:19:57.922527Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def nt_var(self, var):\n", " return var if is_nonterminal(var) else to_nonterminal(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now, we need to apply a new definition to an entire grammar." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.924261Z", "iopub.status.busy": "2024-01-18T17:19:57.924152Z", "iopub.status.idle": "2024-01-18T17:19:57.925967Z", "shell.execute_reply": "2024-01-18T17:19:57.925682Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def apply_new_definition(self, tree, var, value):\n", " nt_var = self.nt_var(var)\n", " return self.insert_into_tree(tree, (nt_var, value))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This algorithm is implemented as `get_derivation_tree()`. " ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.927558Z", "iopub.status.busy": "2024-01-18T17:19:57.927412Z", "iopub.status.idle": "2024-01-18T17:19:57.930267Z", "shell.execute_reply": "2024-01-18T17:19:57.929608Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def get_derivation_tree(self):\n", " tree = (START_SYMBOL, [(self.my_input, [])])\n", "\n", " for var, value in self.my_assignments:\n", " self.log(0, \"%s=%s\" % (var, repr(value)))\n", " self.apply_new_definition(tree, var, value)\n", " return tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `TreeMiner` is used as follows:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.932420Z", "iopub.status.busy": "2024-01-18T17:19:57.932277Z", "iopub.status.idle": "2024-01-18T17:19:57.937492Z", "shell.execute_reply": "2024-01-18T17:19:57.937256Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " vehicle='1997,van,Ford,E350'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['']\n", " year='1997'\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['', ',van,Ford,E350']\n", " kind='van'\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ',van,Ford,E350'\n", "\t\t > [',', '', ',Ford,E350']\n", " company='Ford'\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ',Ford,E350'\n", "\t\t > [',', '', ',E350']\n", " model='E350'\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ',E350'\n", "\t\t > [',', '']\n" ] }, { "data": { "text/plain": [ "('',\n", " [('',\n", " [('', [['1997', []]]),\n", " (',', []),\n", " ('', [['van', []]]),\n", " (',', []),\n", " ('', [['Ford', []]]),\n", " (',', []),\n", " ('', [['E350', []]])])])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Tracer(VEHICLES[0]) as tracer:\n", " process_vehicle(tracer.my_input)\n", "assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", "dt = TreeMiner(tracer.my_input, assignments, log=True)\n", "dt.tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The obtained derivation tree is as below." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:57.939192Z", "iopub.status.busy": "2024-01-18T17:19:57.939041Z", "iopub.status.idle": "2024-01-18T17:19:58.357396Z", "shell.execute_reply": "2024-01-18T17:19:58.357041Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<vehicle>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<year>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ", (44)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<kind>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "7\n", ", (44)\n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<company>\n", "\n", "\n", "\n", "1->8\n", "\n", "\n", "\n", "\n", "\n", "10\n", ", (44)\n", "\n", "\n", "\n", "1->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "<model>\n", "\n", "\n", "\n", "1->11\n", "\n", "\n", "\n", "\n", "\n", "3\n", "1997\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "van\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "9\n", "Ford\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "E350\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(TreeMiner(tracer.my_input, assignments).tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Combining all the pieces:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.359814Z", "iopub.status.busy": "2024-01-18T17:19:58.359663Z", "iopub.status.idle": "2024-01-18T17:19:58.367669Z", "shell.execute_reply": "2024-01-18T17:19:58.367340Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,van,Ford,E350\n", "vehicle = '1997,van,Ford,E350'\n", "year = '1997'\n", "kind = 'van'\n", "company = 'Ford'\n", "model = 'E350'\n", "\n", "2000,car,Mercury,Cougar\n", "vehicle = '2000,car,Mercury,Cougar'\n", "year = '2000'\n", "kind = 'car'\n", "company = 'Mercury'\n", "model = 'Cougar'\n", "\n", "1999,car,Chevy,Venture\n", "vehicle = '1999,car,Chevy,Venture'\n", "year = '1999'\n", "kind = 'car'\n", "company = 'Chevy'\n", "model = 'Venture'\n", "\n" ] } ], "source": [ "trees = []\n", "for vehicle in VEHICLES:\n", " print(vehicle)\n", " with Tracer(vehicle) as tracer:\n", " process_vehicle(tracer.my_input)\n", " assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", " trees.append((tracer.my_input, assignments))\n", " for var, val in assignments:\n", " print(var + \" = \" + repr(val))\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The corresponding derivation trees are below." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.369516Z", "iopub.status.busy": "2024-01-18T17:19:58.369276Z", "iopub.status.idle": "2024-01-18T17:19:58.372150Z", "shell.execute_reply": "2024-01-18T17:19:58.371926Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,van,Ford,E350\n", "2000,car,Mercury,Cougar\n", "1999,car,Chevy,Venture\n" ] } ], "source": [ "csv_dt = []\n", "for inputstr, assignments in trees:\n", " print(inputstr)\n", " dt = TreeMiner(inputstr, assignments)\n", " csv_dt.append(dt)\n", " display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recovering Grammars from Derivation Trees\n", "\n", "We define a class `Miner` that can combine multiple derivation trees to produce the grammar. The initial grammar is empty." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.373740Z", "iopub.status.busy": "2024-01-18T17:19:58.373608Z", "iopub.status.idle": "2024-01-18T17:19:58.375407Z", "shell.execute_reply": "2024-01-18T17:19:58.375121Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner:\n", " def __init__(self):\n", " self.grammar = {}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `tree_to_grammar()` method converts our derivation tree to a grammar by picking one node at a time, and adding it to the grammar. The node name becomes the key, and any list of children it has becomes another alternative for that key." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.377024Z", "iopub.status.busy": "2024-01-18T17:19:58.376896Z", "iopub.status.idle": "2024-01-18T17:19:58.379838Z", "shell.execute_reply": "2024-01-18T17:19:58.379474Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def tree_to_grammar(self, tree):\n", " node, children = tree\n", " one_alt = [ck for ck, gc in children]\n", " hsh = {node: [one_alt] if one_alt else []}\n", " for child in children:\n", " if not is_nonterminal(child[0]):\n", " continue\n", " chsh = self.tree_to_grammar(child)\n", " for k in chsh:\n", " if k not in hsh:\n", " hsh[k] = chsh[k]\n", " else:\n", " hsh[k].extend(chsh[k])\n", " return hsh" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.381919Z", "iopub.status.busy": "2024-01-18T17:19:58.381737Z", "iopub.status.idle": "2024-01-18T17:19:58.384699Z", "shell.execute_reply": "2024-01-18T17:19:58.384364Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'': [['']],\n", " '': [['', ',', '', ',', '', ',', '']],\n", " '': [['1997']],\n", " '': [['van']],\n", " '': [['Ford']],\n", " '': [['E350']]}" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gm = GrammarMiner()\n", "gm.tree_to_grammar(csv_dt[0].tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The grammar being generated here is `canonical`. We define a function `readable()` that takes in a canonical grammar and returns it in a readable form." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.386567Z", "iopub.status.busy": "2024-01-18T17:19:58.386432Z", "iopub.status.idle": "2024-01-18T17:19:58.388532Z", "shell.execute_reply": "2024-01-18T17:19:58.388258Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def readable(grammar):\n", " def readable_rule(rule):\n", " return ''.join(rule)\n", "\n", " return {k: list(set(readable_rule(a) for a in grammar[k]))\n", " for k in grammar}" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.389987Z", "iopub.status.busy": "2024-01-18T17:19:58.389888Z", "iopub.status.idle": "2024-01-18T17:19:58.400955Z", "shell.execute_reply": "2024-01-18T17:19:58.400691Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "year\n", ",\n", "kind\n", ",\n", "company\n", ",\n", "model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(readable(gm.tree_to_grammar(csv_dt[0].tree)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `add_tree()` method gets a combined list of non-terminals from current grammar, and the tree to be added to the grammar, and updates the definitions of each non-terminal." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.402661Z", "iopub.status.busy": "2024-01-18T17:19:58.402507Z", "iopub.status.idle": "2024-01-18T17:19:58.404217Z", "shell.execute_reply": "2024-01-18T17:19:58.403948Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import itertools" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.405650Z", "iopub.status.busy": "2024-01-18T17:19:58.405542Z", "iopub.status.idle": "2024-01-18T17:19:58.407777Z", "shell.execute_reply": "2024-01-18T17:19:58.407511Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def add_tree(self, t):\n", " t_grammar = self.tree_to_grammar(t.tree)\n", " self.grammar = {\n", " key: self.grammar.get(key, []) + t_grammar.get(key, [])\n", " for key in itertools.chain(self.grammar.keys(), t_grammar.keys())\n", " }" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `add_tree()` is used as follows:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.409505Z", "iopub.status.busy": "2024-01-18T17:19:58.409382Z", "iopub.status.idle": "2024-01-18T17:19:58.411301Z", "shell.execute_reply": "2024-01-18T17:19:58.410980Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar_miner = GrammarMiner()\n", "for dt in csv_dt:\n", " inventory_grammar_miner.add_tree(dt)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.413262Z", "iopub.status.busy": "2024-01-18T17:19:58.413133Z", "iopub.status.idle": "2024-01-18T17:19:58.421756Z", "shell.execute_reply": "2024-01-18T17:19:58.421511Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "year\n", ",\n", "kind\n", ",\n", "company\n", ",\n", "model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1997\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(readable(inventory_grammar_miner.grammar))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given execution traces from various inputs, one can define `update_grammar()` to obtain the complete grammar from the traces." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.423498Z", "iopub.status.busy": "2024-01-18T17:19:58.423256Z", "iopub.status.idle": "2024-01-18T17:19:58.425929Z", "shell.execute_reply": "2024-01-18T17:19:58.425619Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def update_grammar(self, inputstr, trace):\n", " at = self.create_tracker(inputstr, trace)\n", " dt = self.create_tree_miner(inputstr, at.assignments())\n", " self.add_tree(dt)\n", " return self.grammar\n", "\n", " def create_tracker(self, *args):\n", " return DefineTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return TreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The complete grammar recovery is implemented in `recover_grammar()`." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.427487Z", "iopub.status.busy": "2024-01-18T17:19:58.427366Z", "iopub.status.idle": "2024-01-18T17:19:58.429935Z", "shell.execute_reply": "2024-01-18T17:19:58.429639Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def recover_grammar(fn: Callable, inputs: Iterable[str], \n", " **kwargs: Any) -> Grammar:\n", " miner = GrammarMiner()\n", "\n", " for inputstr in inputs:\n", " with Tracer(inputstr, **kwargs) as tracer:\n", " fn(tracer.my_input)\n", " miner.update_grammar(tracer.my_input, tracer.trace)\n", "\n", " return readable(miner.grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that the grammar could have been retrieved directly from the tracker, without the intermediate derivation tree stage. However, going through the derivation tree allows one to inspect the inputs being fragmented and verify that it happens correctly." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 1. Recovering the Inventory Grammar" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.431570Z", "iopub.status.busy": "2024-01-18T17:19:58.431467Z", "iopub.status.idle": "2024-01-18T17:19:58.435496Z", "shell.execute_reply": "2024-01-18T17:19:58.435197Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar = recover_grammar(process_vehicle, VEHICLES)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.436921Z", "iopub.status.busy": "2024-01-18T17:19:58.436837Z", "iopub.status.idle": "2024-01-18T17:19:58.439032Z", "shell.execute_reply": "2024-01-18T17:19:58.438778Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'': [''],\n", " '': [',,,'],\n", " '': ['2000', '1997', '1999'],\n", " '': ['car', 'van'],\n", " '': ['Mercury', 'Chevy', 'Ford'],\n", " '': ['E350', 'Cougar', 'Venture']}" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inventory_grammar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 2. Recovering URL Grammar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Our algorithm is robust enough to recover grammar from real world programs. For example, the `urlparse` function in the Python `urlib` module accepts the following sample URLs." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.440492Z", "iopub.status.busy": "2024-01-18T17:19:58.440415Z", "iopub.status.idle": "2024-01-18T17:19:58.442141Z", "shell.execute_reply": "2024-01-18T17:19:58.441876Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "URLS = [\n", " 'http://user:pass@www.google.com:80/?q=path#ref',\n", " 'https://www.cispa.saarland:80/',\n", " 'http://www.fuzzingbook.org/#News',\n", "]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The urllib caches its intermediate results for faster access. Hence, we need to disable it using `clear_cache()` after every invocation." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.443803Z", "iopub.status.busy": "2024-01-18T17:19:58.443698Z", "iopub.status.idle": "2024-01-18T17:19:58.445659Z", "shell.execute_reply": "2024-01-18T17:19:58.445195Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from urllib.parse import urlparse, clear_cache # type: ignore" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We use the sample URLs to recover grammar as follows. The `urlparse` function tends to cache its previous parsing results. Hence, we define a new method `url_parse()` that clears the cache before each call." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.447293Z", "iopub.status.busy": "2024-01-18T17:19:58.447203Z", "iopub.status.idle": "2024-01-18T17:19:58.449050Z", "shell.execute_reply": "2024-01-18T17:19:58.448762Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def url_parse(url):\n", " clear_cache()\n", " urlparse(url)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.450416Z", "iopub.status.busy": "2024-01-18T17:19:58.450309Z", "iopub.status.idle": "2024-01-18T17:19:58.571121Z", "shell.execute_reply": "2024-01-18T17:19:58.570852Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://user:pass@www.google.com:80/?q=path#ref\n", "url = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "scheme = 'http'\n", "netloc = 'user:pass@www.google.com:80'\n", "fragment = 'ref'\n", "query = 'q=path'\n", "\n", "https://www.cispa.saarland:80/\n", "url = 'https://www.cispa.saarland:80/'\n", "scheme = 'https'\n", "netloc = 'www.cispa.saarland:80'\n", "\n", "http://www.fuzzingbook.org/#News\n", "url = 'http://www.fuzzingbook.org/#News'\n", "scheme = 'http'\n", "netloc = 'www.fuzzingbook.org'\n", "fragment = 'News'\n", "\n", "http://user:pass@www.google.com:80/?q=path#ref\n", "https://www.cispa.saarland:80/\n", "http://www.fuzzingbook.org/#News\n" ] } ], "source": [ "trees = []\n", "for url in URLS:\n", " print(url)\n", " with Tracer(url) as tracer:\n", " url_parse(tracer.my_input)\n", " assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", " trees.append((tracer.my_input, assignments))\n", " for var, val in assignments:\n", " print(var + \" = \" + repr(val))\n", " print()\n", "\n", "\n", "url_dt = []\n", "for inputstr, assignments in trees:\n", " print(inputstr)\n", " dt = TreeMiner(inputstr, assignments)\n", " url_dt.append(dt)\n", " display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us use `url_parse()` to recover the grammar:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.572786Z", "iopub.status.busy": "2024-01-18T17:19:58.572672Z", "iopub.status.idle": "2024-01-18T17:19:58.680929Z", "shell.execute_reply": "2024-01-18T17:19:58.680621Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.682698Z", "iopub.status.busy": "2024-01-18T17:19:58.682602Z", "iopub.status.idle": "2024-01-18T17:19:58.691799Z", "shell.execute_reply": "2024-01-18T17:19:58.691473Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "scheme\n", "://\n", "netloc\n", "/?\n", "query\n", "#\n", "fragment\n", "\n", "scheme\n", "://\n", "netloc\n", "/#\n", "fragment\n", "\n", "scheme\n", "://\n", "netloc\n", "/" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.fuzzingbook.org\n", "\n", "www.cispa.saarland:80" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The recovered grammar describes the URL format reasonably well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Fuzzing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can now use our recovered grammar for fuzzing as follows.\n", "\n", "First, the inventory grammar." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.693698Z", "iopub.status.busy": "2024-01-18T17:19:58.693484Z", "iopub.status.idle": "2024-01-18T17:19:58.696580Z", "shell.execute_reply": "2024-01-18T17:19:58.696313Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1999,car,Mercury,E350\n", "1997,car,Chevy,Cougar\n", "1999,van,Mercury,Venture\n", "2000,car,Ford,Venture\n", "1997,car,Mercury,E350\n", "1997,car,Mercury,Cougar\n", "1999,car,Chevy,E350\n", "1999,car,Chevy,E350\n", "1999,car,Mercury,Cougar\n", "2000,car,Chevy,E350\n" ] } ], "source": [ "f = GrammarFuzzer(inventory_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, the URL grammar." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.698282Z", "iopub.status.busy": "2024-01-18T17:19:58.698193Z", "iopub.status.idle": "2024-01-18T17:19:58.700866Z", "shell.execute_reply": "2024-01-18T17:19:58.700634Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://user:pass@www.google.com:80/#News\n", "http://user:pass@www.google.com:80/\n", "http://www.cispa.saarland:80/\n", "https://www.fuzzingbook.org/\n", "https://www.cispa.saarland:80/?q=path#ref\n", "http://www.cispa.saarland:80/\n", "http://www.cispa.saarland:80/?q=path#News\n", "https://user:pass@www.google.com:80/#ref\n", "https://user:pass@www.google.com:80/\n", "https://www.cispa.saarland:80/#News\n" ] } ], "source": [ "f = GrammarFuzzer(url_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What this means is that we can now take a program and a few samples, extract its grammar, and then use this very grammar for fuzzing. Now that's quite an opportunity!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problems with the Simple Miner" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems with our simple grammar miner is the assumption that the values assigned to variables are stable. Unfortunately, that may not hold true in all cases. For example, here is a URL with a slightly different format." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.702569Z", "iopub.status.busy": "2024-01-18T17:19:58.702490Z", "iopub.status.idle": "2024-01-18T17:19:58.704073Z", "shell.execute_reply": "2024-01-18T17:19:58.703829Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "URLS_X = URLS + ['ftp://freebsd.org/releases/5.8']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The grammar generated from this set of samples is not as nice as what we got earlier" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.705512Z", "iopub.status.busy": "2024-01-18T17:19:58.705433Z", "iopub.status.idle": "2024-01-18T17:19:58.842925Z", "shell.execute_reply": "2024-01-18T17:19:58.842579Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.845059Z", "iopub.status.busy": "2024-01-18T17:19:58.844906Z", "iopub.status.idle": "2024-01-18T17:19:58.856361Z", "shell.execute_reply": "2024-01-18T17:19:58.856067Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "scheme\n", "://\n", "netloc\n", "url\n", "\n", "url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "scheme\n", "://\n", "netloc\n", "/?\n", "query\n", "#\n", "fragment\n", "\n", "/releases/5.8\n", "\n", "scheme\n", "://\n", "netloc\n", "/#\n", "fragment\n", "\n", "scheme\n", "://\n", "netloc\n", "/" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http\n", "\n", "ftp" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "www.fuzzingbook.org\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.cispa.saarland:80\n", "\n", "freebsd.org" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Clearly, something has gone wrong.\n", "\n", "To investigate why the `url` definition has gone wrong, let us inspect the trace for the URL." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.858126Z", "iopub.status.busy": "2024-01-18T17:19:58.858044Z", "iopub.status.idle": "2024-01-18T17:19:58.886184Z", "shell.execute_reply": "2024-01-18T17:19:58.885767Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 372 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "1 392 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "5 124 ({'arg': ''},)\n", "6 121 ({'arg': ''},)\n", "7 126 ({'arg': ''},)\n", "8 127 ({'arg': ''},)\n", "10 393 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "11 437 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "12 458 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "16 124 ({'arg': ''},)\n", "17 121 ({'arg': ''},)\n", "18 126 ({'arg': ''},)\n", "19 127 ({'arg': ''},)\n", "21 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "22 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "23 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "24 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "25 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "26 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "27 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "28 461 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "29 462 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "30 460 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "31 464 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "32 465 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "33 466 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "34 467 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "35 469 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "36 471 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "37 472 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "38 473 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "39 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "40 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)\n", "41 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)\n", "42 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "43 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "44 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "45 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "46 475 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "47 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "48 478 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "49 480 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "50 481 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "51 411 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "52 412 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "53 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "54 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "55 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "56 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "57 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "58 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "59 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "60 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "61 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "62 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "63 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "64 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "65 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "66 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "68 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "69 483 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "70 482 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "71 485 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "72 486 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "73 487 ({'url': '/?q=path', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)\n", "74 488 ({'url': '/?q=path', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)\n", "75 489 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "76 419 ({'netloc': 'user:pass@www.google.com:80'},)\n", "77 420 ({'netloc': 'user:pass@www.google.com:80'},)\n", "78 421 ({'netloc': 'user:pass@www.google.com:80'},)\n", "80 490 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "84 491 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "85 492 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "90 394 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "91 395 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)\n", "92 398 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)\n", "93 399 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)\n", "97 400 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)\n" ] } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[0]) as tracer:\n", " urlparse(tracer.my_input)\n", "for i, t in enumerate(tracer.trace):\n", " if t[0] in {'call', 'line'} and 'parse.py' in str(t[2]) and t[3]:\n", " print(i, t[2]._t()[1], t[3:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice how the value of `url` changes as the parsing progresses? This violates our assumption that the value assigned to a variable is stable. We next look at how this limitation can be removed." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Grammar Miner with Reassignment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One way to uniquely identify different variables is to annotate them with *line numbers* both when they are defined and also when their value changes. Consider the code fragment below" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Tracking variable assignment locations" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.888121Z", "iopub.status.busy": "2024-01-18T17:19:58.888031Z", "iopub.status.idle": "2024-01-18T17:19:58.889792Z", "shell.execute_reply": "2024-01-18T17:19:58.889422Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def C(cp_1):\n", " c_2 = cp_1 + '@2'\n", " c_3 = c_2 + '@3'\n", " return c_3" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.891450Z", "iopub.status.busy": "2024-01-18T17:19:58.891362Z", "iopub.status.idle": "2024-01-18T17:19:58.893276Z", "shell.execute_reply": "2024-01-18T17:19:58.892870Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def B(bp_7):\n", " b_8 = bp_7 + '@8'\n", " return C(b_8)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.895100Z", "iopub.status.busy": "2024-01-18T17:19:58.895015Z", "iopub.status.idle": "2024-01-18T17:19:58.897070Z", "shell.execute_reply": "2024-01-18T17:19:58.896671Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def A(ap_12):\n", " a_13 = ap_12 + '@13'\n", " a_14 = B(a_13) + '@14'\n", " a_14 = a_14 + '@15'\n", " a_13 = a_14 + '@16'\n", " a_14 = B(a_13) + '@17'\n", " a_14 = B(a_13) + '@18'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice how all variables are either named corresponding to either where they are defined, or the value is annotated to indicate that it was changed.\n", "\n", "Let us run this under the trace." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:58.898636Z", "iopub.status.busy": "2024-01-18T17:19:58.898562Z", "iopub.status.idle": "2024-01-18T17:19:59.004165Z", "shell.execute_reply": "2024-01-18T17:19:59.003805Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call 1:A {'ap_12': '____'}\n", "line 2:A {'ap_12': '____'}\n", "line 3:A {'ap_12': '____', 'a_13': '____@13'}\n", "call 1:B {'bp_7': '____@13'}\n", "line 2:B {'bp_7': '____@13'}\n", "line 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}\n", "call 1:C {'cp_1': '____@13@8'}\n", "line 2:C {'cp_1': '____@13@8'}\n", "line 3:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2'}\n", "line 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}\n", "return 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}\n", "line 4:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14'}\n", "line 5:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14@15'}\n", "line 6:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15'}\n", "call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}\n", "line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "line 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@17'}\n", "call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}\n", "line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "return 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@18'}\n", "call 102:__exit__ {}\n", "line 105:__exit__ {}\n" ] } ], "source": [ "with Tracer('____') as tracer:\n", " A(tracer.my_input)\n", "\n", "for t in tracer.trace:\n", " print(t[0], \"%d:%s\" % (t[2].line_no, t[2].method), t[3])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Each variable was referenced first as follows:\n", "\n", "* `cp_1` -- *call* `1:C`\n", "* `c_2` -- *line* `3:C` (but the previous event was *line* `2:C`)\n", "* `c_3` -- *line* `4:C` (but the previous event was *line* `3:C`)\n", "* `bp_7` -- *call* `7:B`\n", "* `b_8` -- *line* `9:B` (but the previous event was *line* `8:B`)\n", "* `ap_12` -- *call* `12:A`\n", "* `a_13` -- *line* `14:A` (but the previous event was *line* `13:A`)\n", "* `a_14` -- *line* `15:A` (the previous event was *return* `9:B`. However, the previous event in `A()` was *line* `14:A`)\n", "* reassign `a_14` at *15* -- *line* `16:A` (the previous event was *line* `15:A`)\n", "* reassign `a_13` at *16* -- *line* `17:A` (the previous event was *line* `16:A`)\n", "* reassign `a_14` at *17* -- *return* `17:A` (the previous event in `A()` was *line* `17:A`)\n", "* reassign `a_14` at *18* -- *return* `18:A` (the previous event in `A()` was *line* `18:A`)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So, our observations are that, if it is a call, the current location is the right one for any new variables being defined. On the other hand, if the variable being referenced for the first time (or reassigned a new value), then the right location to consider is the previous location *in the same method invocation*. Next, let us see how we can incorporate this information into variable naming." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we need a way to track the individual method calls as they are being made. For this we define the class `CallStack`. Each method invocation gets a separate identifier, and when the method call is over, the identifier is reset." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### CallStack" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.006198Z", "iopub.status.busy": "2024-01-18T17:19:59.006073Z", "iopub.status.idle": "2024-01-18T17:19:59.008910Z", "shell.execute_reply": "2024-01-18T17:19:59.008593Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class CallStack:\n", " def __init__(self, **kwargs):\n", " self.options(kwargs)\n", " self.method_id = (START_SYMBOL, 0)\n", " self.method_register = 0\n", " self.mstack = [self.method_id]\n", "\n", " def enter(self, method):\n", " self.method_register += 1\n", " self.method_id = (method, self.method_register)\n", " self.log('call', \"%s%s\" % (self.indent(), str(self)))\n", " self.mstack.append(self.method_id)\n", "\n", " def leave(self):\n", " self.mstack.pop()\n", " self.log('return', \"%s%s\" % (self.indent(), str(self)))\n", " self.method_id = self.mstack[-1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A few extra functions to make life simpler." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.010983Z", "iopub.status.busy": "2024-01-18T17:19:59.010667Z", "iopub.status.idle": "2024-01-18T17:19:59.014023Z", "shell.execute_reply": "2024-01-18T17:19:59.013489Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class CallStack(CallStack):\n", " def options(self, kwargs):\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None\n", "\n", " def indent(self):\n", " return len(self.mstack) * \"\\t\"\n", "\n", " def at(self, n):\n", " return self.mstack[n]\n", "\n", " def __len__(self):\n", " return len(mstack) - 1\n", "\n", " def __str__(self):\n", " return \"%s:%d\" % self.method_id\n", "\n", " def __repr__(self):\n", " return repr(self.method_id)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We also define a convenience method to display a given stack." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.015903Z", "iopub.status.busy": "2024-01-18T17:19:59.015814Z", "iopub.status.idle": "2024-01-18T17:19:59.018205Z", "shell.execute_reply": "2024-01-18T17:19:59.017908Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def display_stack(istack):\n", " def stack_to_tree(stack):\n", " current, *rest = stack\n", " if not rest:\n", " return (repr(current), [])\n", " return (repr(current), [stack_to_tree(rest)])\n", " display_tree(stack_to_tree(istack.mstack), graph_attr=lr_graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how we can use the `CallStack`." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.020128Z", "iopub.status.busy": "2024-01-18T17:19:59.019900Z", "iopub.status.idle": "2024-01-18T17:19:59.022689Z", "shell.execute_reply": "2024-01-18T17:19:59.022358Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('', 0)" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs = CallStack()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.024625Z", "iopub.status.busy": "2024-01-18T17:19:59.024468Z", "iopub.status.idle": "2024-01-18T17:19:59.027057Z", "shell.execute_reply": "2024-01-18T17:19:59.026699Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('hello')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.029109Z", "iopub.status.busy": "2024-01-18T17:19:59.028912Z", "iopub.status.idle": "2024-01-18T17:19:59.031760Z", "shell.execute_reply": "2024-01-18T17:19:59.031451Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('world', 2)" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('world')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.033438Z", "iopub.status.busy": "2024-01-18T17:19:59.033307Z", "iopub.status.idle": "2024-01-18T17:19:59.035811Z", "shell.execute_reply": "2024-01-18T17:19:59.035468Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.leave()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.037668Z", "iopub.status.busy": "2024-01-18T17:19:59.037525Z", "iopub.status.idle": "2024-01-18T17:19:59.040053Z", "shell.execute_reply": "2024-01-18T17:19:59.039707Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('world', 3)" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('world')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.041766Z", "iopub.status.busy": "2024-01-18T17:19:59.041654Z", "iopub.status.idle": "2024-01-18T17:19:59.044026Z", "shell.execute_reply": "2024-01-18T17:19:59.043680Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.leave()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In order to account for variable reassignments, we need to have a more intelligent data structure than a dictionary for storing variables. We first define a simple interface `Vars`. It acts as a container for variables, and is instantiated at `my_assignments`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Vars" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `Vars` stores references to variables as they occur during parsing in its internal dictionary `defs`. We initialize the dictionary with the original string." ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.045922Z", "iopub.status.busy": "2024-01-18T17:19:59.045805Z", "iopub.status.idle": "2024-01-18T17:19:59.047812Z", "shell.execute_reply": "2024-01-18T17:19:59.047456Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Vars:\n", " def __init__(self, original):\n", " self.defs = {}\n", " self.my_input = original" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The dictionary needs two methods: `update()` that takes a set of key-value pairs to update itself, and `_set_kv()` that updates a particular key-value pair." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.049564Z", "iopub.status.busy": "2024-01-18T17:19:59.049442Z", "iopub.status.idle": "2024-01-18T17:19:59.051636Z", "shell.execute_reply": "2024-01-18T17:19:59.051335Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Vars(Vars):\n", " def _set_kv(self, k, v):\n", " self.defs[k] = v\n", "\n", " def __setitem__(self, k, v):\n", " self._set_kv(k, v)\n", "\n", " def update(self, v):\n", " for k, v in v.items():\n", " self._set_kv(k, v)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `Vars` is a proxy for the internal dictionary. For example, here is how one can use it." ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.053792Z", "iopub.status.busy": "2024-01-18T17:19:59.053672Z", "iopub.status.idle": "2024-01-18T17:19:59.055951Z", "shell.execute_reply": "2024-01-18T17:19:59.055624Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v = Vars('')\n", "v.defs" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.057629Z", "iopub.status.busy": "2024-01-18T17:19:59.057529Z", "iopub.status.idle": "2024-01-18T17:19:59.059718Z", "shell.execute_reply": "2024-01-18T17:19:59.059370Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'x': 'X'}" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v['x'] = 'X'\n", "v.defs" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.061604Z", "iopub.status.busy": "2024-01-18T17:19:59.061457Z", "iopub.status.idle": "2024-01-18T17:19:59.063929Z", "shell.execute_reply": "2024-01-18T17:19:59.063610Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'x': 'x', 'y': 'y'}" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v.update({'x': 'x', 'y': 'y'})\n", "v.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### AssignmentVars" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We now extend the simple `Vars` to account for variable reassignments. For this, we define `AssignmentVars`.\n", "\n", "The idea for detecting reassignments and renaming variables is as follows: We keep track of the previous reassignments to particular variables using `accessed_seq_var`. It contains the last rename of any particular variable as its corresponding value. The `new_vars` contains a list of all new variables that were added on this iteration." ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.065781Z", "iopub.status.busy": "2024-01-18T17:19:59.065667Z", "iopub.status.idle": "2024-01-18T17:19:59.067923Z", "shell.execute_reply": "2024-01-18T17:19:59.067571Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(Vars):\n", " def __init__(self, original):\n", " super().__init__(original)\n", " self.accessed_seq_var = {}\n", " self.var_def_lines = {}\n", " self.current_event = None\n", " self.new_vars = set()\n", " self.method_init()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `method_init()` method takes care of keeping track of method invocations using records saved in the `call_stack`. `event_locations` is for keeping track of the locations accessed *within this method*. This is used for line number tracking of variable definitions." ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.070258Z", "iopub.status.busy": "2024-01-18T17:19:59.069903Z", "iopub.status.idle": "2024-01-18T17:19:59.072414Z", "shell.execute_reply": "2024-01-18T17:19:59.072059Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def method_init(self):\n", " self.call_stack = CallStack()\n", " self.event_locations = {self.call_stack.method_id: []}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `update()` is now modified to track the changed line numbers if any, using `var_location_register()`. We reinitialize the `new_vars` after use for the next event." ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.074006Z", "iopub.status.busy": "2024-01-18T17:19:59.073900Z", "iopub.status.idle": "2024-01-18T17:19:59.075789Z", "shell.execute_reply": "2024-01-18T17:19:59.075533Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def update(self, v):\n", " for k, v in v.items():\n", " self._set_kv(k, v)\n", " self.var_location_register(self.new_vars)\n", " self.new_vars = set()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The variable name now incorporates an index of how many reassignments it has gone through, effectively making each reassignment a unique variable." ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.077351Z", "iopub.status.busy": "2024-01-18T17:19:59.077219Z", "iopub.status.idle": "2024-01-18T17:19:59.079143Z", "shell.execute_reply": "2024-01-18T17:19:59.078834Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_name(self, var):\n", " return (var, self.accessed_seq_var[var])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "While storing variables, we need to first check whether it was previously known. If it is not, we need to initialize the rename count. This is accomplished by `var_access`." ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.080816Z", "iopub.status.busy": "2024-01-18T17:19:59.080683Z", "iopub.status.idle": "2024-01-18T17:19:59.082639Z", "shell.execute_reply": "2024-01-18T17:19:59.082359Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_access(self, var):\n", " if var not in self.accessed_seq_var:\n", " self.accessed_seq_var[var] = 0\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "During a variable reassignment, we update the `accessed_seq_var` to reflect the new count." ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.084238Z", "iopub.status.busy": "2024-01-18T17:19:59.084067Z", "iopub.status.idle": "2024-01-18T17:19:59.086170Z", "shell.execute_reply": "2024-01-18T17:19:59.085893Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_assign(self, var):\n", " self.accessed_seq_var[var] += 1\n", " self.new_vars.add(self.var_name(var))\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These methods can be used as follows" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.087676Z", "iopub.status.busy": "2024-01-18T17:19:59.087560Z", "iopub.status.idle": "2024-01-18T17:19:59.089556Z", "shell.execute_reply": "2024-01-18T17:19:59.089293Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav = AssignmentVars('')\n", "sav.defs" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.090958Z", "iopub.status.busy": "2024-01-18T17:19:59.090864Z", "iopub.status.idle": "2024-01-18T17:19:59.093004Z", "shell.execute_reply": "2024-01-18T17:19:59.092665Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 0)" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_access('v1')" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.094865Z", "iopub.status.busy": "2024-01-18T17:19:59.094733Z", "iopub.status.idle": "2024-01-18T17:19:59.096994Z", "shell.execute_reply": "2024-01-18T17:19:59.096684Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 1)" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_assign('v1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Assigning to it again increments the counter." ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.098825Z", "iopub.status.busy": "2024-01-18T17:19:59.098701Z", "iopub.status.idle": "2024-01-18T17:19:59.100624Z", "shell.execute_reply": "2024-01-18T17:19:59.100358Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 2)" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_assign('v1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The core of the logic is in `_set_kv()`. When a variable is being assigned, we get the sequenced variable name `s_var`. If the sequenced variable name was previously unknown in `defs`, then we have no further concerns. We add the sequenced variable to `defs`.\n", "\n", "If the variable is previously known, then it is an indication of a possible reassignment. In this case, we look at the value the variable is holding. We check if the value changed. If it has not, then it is not.\n", "\n", "If the value has changed, it is a reassignment. We first increment the variable usage sequence using `var_assign`, retrieve the new name, update the new name in `defs`." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.102143Z", "iopub.status.busy": "2024-01-18T17:19:59.102044Z", "iopub.status.idle": "2024-01-18T17:19:59.104075Z", "shell.execute_reply": "2024-01-18T17:19:59.103761Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def _set_kv(self, var, val):\n", " s_var = self.var_access(var)\n", " if s_var in self.defs and self.defs[s_var] == val:\n", " return\n", " self.defs[self.var_assign(var)] = val" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how it can be used. Assigning a variable the first time initializes its counter." ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.105627Z", "iopub.status.busy": "2024-01-18T17:19:59.105515Z", "iopub.status.idle": "2024-01-18T17:19:59.107525Z", "shell.execute_reply": "2024-01-18T17:19:59.107283Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X'}" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav = AssignmentVars('')\n", "sav['x'] = 'X'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If the variable is assigned again with the same value, it is probably not a reassignment." ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.109051Z", "iopub.status.busy": "2024-01-18T17:19:59.108933Z", "iopub.status.idle": "2024-01-18T17:19:59.110988Z", "shell.execute_reply": "2024-01-18T17:19:59.110628Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X'}" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav['x'] = 'X'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, if the value changed, it is a reassignment." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.112688Z", "iopub.status.busy": "2024-01-18T17:19:59.112558Z", "iopub.status.idle": "2024-01-18T17:19:59.115194Z", "shell.execute_reply": "2024-01-18T17:19:59.114789Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X', ('x', 2): 'Y'}" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav['x'] = 'Y'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There is a subtlety here. It is possible for a child method to be called from the middle of a parent method, and for both to use the same variable name with different values. In this case, when the child returns, parent will have the old variable with old value in context. With our implementation, we consider this as a reassignment. However, this is OK because adding a new reassignment is harmless, but missing one is not. Further, we will discuss later how this can be avoided." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We also define bookkeeping codes for `register_event()` `method_enter()` and `method_exit()` which are the methods responsible for keeping track of the method stack. The basic idea is that, each `method_enter()` represents a new method invocation. Hence, it merits a new method id, which is generated from the `method_register`, and saved in the `method_id`. Since this is a new method, the method stack is extended by one element with this id. In the case of `method_exit()`, we pop the method stack, and reset the current `method_id` to what was below the current one." ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.116832Z", "iopub.status.busy": "2024-01-18T17:19:59.116721Z", "iopub.status.idle": "2024-01-18T17:19:59.119219Z", "shell.execute_reply": "2024-01-18T17:19:59.118988Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def method_enter(self, cxt, my_vars):\n", " self.current_event = 'call'\n", " self.call_stack.enter(cxt.method)\n", " self.event_locations[self.call_stack.method_id] = []\n", " self.register_event(cxt)\n", " self.update(my_vars)\n", "\n", " def method_exit(self, cxt, my_vars):\n", " self.current_event = 'return'\n", " self.register_event(cxt)\n", " self.update(my_vars)\n", " self.call_stack.leave()\n", "\n", " def method_statement(self, cxt, my_vars):\n", " self.current_event = 'line'\n", " self.register_event(cxt)\n", " self.update(my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For each of the method events, we also register the event using `register_event()` which keeps track of the line numbers that were referenced in *this* method." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.120875Z", "iopub.status.busy": "2024-01-18T17:19:59.120769Z", "iopub.status.idle": "2024-01-18T17:19:59.122468Z", "shell.execute_reply": "2024-01-18T17:19:59.122245Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def register_event(self, cxt):\n", " self.event_locations[self.call_stack.method_id].append(cxt.line_no)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `var_location_register()` keeps the locations of newly added variables. The definition location of variables in a `call` is the *current* location. However, for a `line`, it would be the previous event in the current method." ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.123998Z", "iopub.status.busy": "2024-01-18T17:19:59.123895Z", "iopub.status.idle": "2024-01-18T17:19:59.126215Z", "shell.execute_reply": "2024-01-18T17:19:59.125957Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_location_register(self, my_vars):\n", " def loc(mid):\n", " if self.current_event == 'call':\n", " return self.event_locations[mid][-1]\n", " elif self.current_event == 'line':\n", " return self.event_locations[mid][-2]\n", " elif self.current_event == 'return':\n", " return self.event_locations[mid][-2]\n", " else:\n", " assert False\n", "\n", " my_loc = loc(self.call_stack.method_id)\n", " for var in my_vars:\n", " self.var_def_lines[var] = my_loc" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We define `defined_vars()` which returns the names of variables annotated with the line numbers as below." ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.127926Z", "iopub.status.busy": "2024-01-18T17:19:59.127808Z", "iopub.status.idle": "2024-01-18T17:19:59.130022Z", "shell.execute_reply": "2024-01-18T17:19:59.129760Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def defined_vars(self, formatted=True):\n", " def fmt(k):\n", " v = (k[0], self.var_def_lines[k])\n", " return \"%s@%s\" % v if formatted else v\n", "\n", " return [(fmt(k), v) for k, v in self.defs.items()]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similar to `defined_vars()` we define `seq_vars()` which annotates different variables with the number of times they were used." ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.131716Z", "iopub.status.busy": "2024-01-18T17:19:59.131582Z", "iopub.status.idle": "2024-01-18T17:19:59.133876Z", "shell.execute_reply": "2024-01-18T17:19:59.133515Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def seq_vars(self, formatted=True):\n", " def fmt(k):\n", " v = (k[0], self.var_def_lines[k], k[1])\n", " return \"%s@%s:%s\" % v if formatted else v\n", "\n", " return {fmt(k): v for k, v in self.defs.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### AssignmentTracker" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `AssignmentTracker` keeps the assignment definitions using the `AssignmentVars` we defined previously." ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.136037Z", "iopub.status.busy": "2024-01-18T17:19:59.135858Z", "iopub.status.idle": "2024-01-18T17:19:59.138301Z", "shell.execute_reply": "2024-01-18T17:19:59.137957Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentTracker(DefineTracker):\n", " def __init__(self, my_input, trace, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", "\n", " self.my_assignments = self.create_assignments(my_input)\n", "\n", " self.trace = trace\n", " self.process()\n", "\n", " def create_assignments(self, *args):\n", " return AssignmentVars(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To fine-tune the process, we define an optional parameter called `track_return`. During tracing a method return, Python produces a virtual variable that contains the result of the returned value. If the `track_return` is set, we capture this value as a variable.\n", "\n", "* `track_return` -- if true, add a *virtual variable* to the Vars representing the return value" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.139926Z", "iopub.status.busy": "2024-01-18T17:19:59.139813Z", "iopub.status.idle": "2024-01-18T17:19:59.141684Z", "shell.execute_reply": "2024-01-18T17:19:59.141437Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentTracker(AssignmentTracker):\n", " def options(self, kwargs):\n", " self.track_return = kwargs.get('track_return', False)\n", " super().options(kwargs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "There can be different kinds of events during a trace, which includes `call` when a function is entered, `return` when the function returns, `exception` when an exception is thrown and `line` when a statement is executed.\n", "\n", "The previous `Tracker` was too simplistic in that it did not distinguish between the different events. We rectify that and define `on_call()`, `on_return()`, and `on_line()` respectively, which get called on their corresponding events.\n", "\n", "Note that `on_line()` is called also for `on_return()`. The reason is, that Python invokes the trace function *before* the corresponding line is executed. Hence, effectively, the `on_return()` is called with the binding produced by the execution of the previous statement in the environment. Our processing in effect is done on values that were bound by the previous statement. Hence, calling `on_line()` here is appropriate as it provides the event handler a chance to work on the previous binding." ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.143326Z", "iopub.status.busy": "2024-01-18T17:19:59.143205Z", "iopub.status.idle": "2024-01-18T17:19:59.146632Z", "shell.execute_reply": "2024-01-18T17:19:59.146202Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentTracker(AssignmentTracker):\n", " def on_call(self, arg, cxt, my_vars):\n", " my_vars = cxt.parameters(my_vars)\n", " self.my_assignments.method_enter(cxt, self.fragments(my_vars))\n", "\n", " def on_line(self, arg, cxt, my_vars):\n", " self.my_assignments.method_statement(cxt, self.fragments(my_vars))\n", "\n", " def on_return(self, arg, cxt, my_vars):\n", " self.on_line(arg, cxt, my_vars)\n", " my_vars = {'<-%s' % cxt.method: arg} if self.track_return else {}\n", " self.my_assignments.method_exit(cxt, my_vars)\n", "\n", " def on_exception(self, arg, cxt, my_vara):\n", " return\n", "\n", " def track_event(self, event, arg, cxt, my_vars):\n", " self.current_event = event\n", " dispatch = {\n", " 'call': self.on_call,\n", " 'return': self.on_return,\n", " 'line': self.on_line,\n", " 'exception': self.on_exception\n", " }\n", " dispatch[event](arg, cxt, my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can now use `AssignmentTracker` to track the different variables. To verify that our variable line number inference works, we recover definitions from the functions `A()`, `B()` and `C()` (with data annotations removed so that the input fragments are correctly identified). " ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.148548Z", "iopub.status.busy": "2024-01-18T17:19:59.148448Z", "iopub.status.idle": "2024-01-18T17:19:59.150191Z", "shell.execute_reply": "2024-01-18T17:19:59.149885Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def C(cp_1): # type: ignore\n", " c_2 = cp_1\n", " c_3 = c_2\n", " return c_3" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.151764Z", "iopub.status.busy": "2024-01-18T17:19:59.151666Z", "iopub.status.idle": "2024-01-18T17:19:59.153361Z", "shell.execute_reply": "2024-01-18T17:19:59.153063Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def B(bp_7): # type: ignore\n", " b_8 = bp_7\n", " return C(b_8)" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.154837Z", "iopub.status.busy": "2024-01-18T17:19:59.154754Z", "iopub.status.idle": "2024-01-18T17:19:59.156577Z", "shell.execute_reply": "2024-01-18T17:19:59.156331Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def A(ap_12): # type: ignore\n", " a_13 = ap_12\n", " a_14 = B(a_13)\n", " a_14 = a_14\n", " a_13 = a_14\n", " a_14 = B(a_13)\n", " a_14 = B(a_14)[3:]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Running `A()` with sufficient input." ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.158064Z", "iopub.status.busy": "2024-01-18T17:19:59.157977Z", "iopub.status.idle": "2024-01-18T17:19:59.262249Z", "shell.execute_reply": "2024-01-18T17:19:59.261944Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ap_12@1:1 = '---xxx'\n", "a_13@2:1 = '---xxx'\n", "bp_7@1:1 = '---xxx'\n", "b_8@2:1 = '---xxx'\n", "cp_1@1:1 = '---xxx'\n", "c_2@2:1 = '---xxx'\n", "c_3@3:1 = '---xxx'\n", "a_14@3:1 = '---xxx'\n", "a_14@7:2 = 'xxx'\n", "\n", "ap_12@1 = '---xxx'\n", "a_13@2 = '---xxx'\n", "bp_7@1 = '---xxx'\n", "b_8@2 = '---xxx'\n", "cp_1@1 = '---xxx'\n", "c_2@2 = '---xxx'\n", "c_3@3 = '---xxx'\n", "a_14@3 = '---xxx'\n", "a_14@7 = 'xxx'\n" ] } ], "source": [ "with Tracer('---xxx') as tracer:\n", " A(tracer.my_input)\n", "tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)\n", "for k, v in tracker.my_assignments.seq_vars().items():\n", " print(k, '=', repr(v))\n", "print()\n", "for k, v in tracker.my_assignments.defined_vars(formatted=True):\n", " print(k, '=', repr(v))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As can be seen, the line numbers are now correctly identified for each variable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " Let us try retrieving the assignments for a real world example." ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.264366Z", "iopub.status.busy": "2024-01-18T17:19:59.264204Z", "iopub.status.idle": "2024-01-18T17:19:59.386053Z", "shell.execute_reply": "2024-01-18T17:19:59.385683Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "url@372 = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "url@478 = '//user:pass@www.google.com:80/?q=path#ref'\n", "scheme@478 = 'http'\n", "url@481 = '/?q=path#ref'\n", "netloc@481 = 'user:pass@www.google.com:80'\n", "url@486 = '/?q=path'\n", "fragment@486 = 'ref'\n", "query@488 = 'q=path'\n", "url@393 = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "\n", "url@372 = 'https://www.cispa.saarland:80/'\n", "url@478 = '//www.cispa.saarland:80/'\n", "scheme@478 = 'https'\n", "netloc@481 = 'www.cispa.saarland:80'\n", "url@393 = 'https://www.cispa.saarland:80/'\n", "\n", "url@372 = 'http://www.fuzzingbook.org/#News'\n", "url@478 = '//www.fuzzingbook.org/#News'\n", "scheme@478 = 'http'\n", "url@481 = '/#News'\n", "netloc@481 = 'www.fuzzingbook.org'\n", "fragment@486 = 'News'\n", "url@393 = 'http://www.fuzzingbook.org/#News'\n", "\n", "url@372 = 'ftp://freebsd.org/releases/5.8'\n", "url@478 = '//freebsd.org/releases/5.8'\n", "scheme@478 = 'ftp'\n", "url@481 = '/releases/5.8'\n", "netloc@481 = 'freebsd.org'\n", "url@393 = 'ftp://freebsd.org/releases/5.8'\n", "url@394 = '/releases/5.8'\n", "\n" ] } ], "source": [ "traces = []\n", "for inputstr in URLS_X:\n", " clear_cache()\n", " with Tracer(inputstr, files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", " traces.append((tracer.my_input, tracer.trace))\n", "\n", " tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)\n", " for k, v in tracker.my_assignments.defined_vars():\n", " print(k, '=', repr(v))\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The line numbers of variables can be verified from the source code of [urllib/parse.py](https://github.com/python/cpython/blob/3.6/Lib/urllib/parse.py)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recovering a Derivation Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Does handling variable reassignments help with our URL examples? We look at these next." ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.388488Z", "iopub.status.busy": "2024-01-18T17:19:59.388294Z", "iopub.status.idle": "2024-01-18T17:19:59.390501Z", "shell.execute_reply": "2024-01-18T17:19:59.390226Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def get_derivation_tree(self):\n", " tree = (START_SYMBOL, [(self.my_input, [])])\n", " for var, value in self.my_assignments:\n", " self.log(0, \"%s=%s\" % (var, repr(value)))\n", " self.apply_new_definition(tree, var, value)\n", " return tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 1: Recovering URL Derivation Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First we obtain the derivation tree of the URL 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### URL 1 derivation tree" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.392176Z", "iopub.status.busy": "2024-01-18T17:19:59.392067Z", "iopub.status.idle": "2024-01-18T17:19:59.818202Z", "shell.execute_reply": "2024-01-18T17:19:59.817862Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<url@372>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<scheme@478>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ": (58)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<url@478>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "3\n", "http\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "//\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<netloc@481>\n", "\n", "\n", "\n", "5->7\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<url@481>\n", "\n", "\n", "\n", "5->9\n", "\n", "\n", "\n", "\n", "\n", "8\n", "user:pass@www.google.com:80\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<url@486>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "14\n", "# (35)\n", "\n", "\n", "\n", "9->14\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<fragment@486>\n", "\n", "\n", "\n", "9->15\n", "\n", "\n", "\n", "\n", "\n", "11\n", "/?\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "12\n", "<query@488>\n", "\n", "\n", "\n", "10->12\n", "\n", "\n", "\n", "\n", "\n", "13\n", "q=path\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "16\n", "ref\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[0], files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we obtain the derivation tree of URL 4" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### URL 4 derivation tree" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:19:59.819966Z", "iopub.status.busy": "2024-01-18T17:19:59.819850Z", "iopub.status.idle": "2024-01-18T17:20:00.242062Z", "shell.execute_reply": "2024-01-18T17:20:00.241702Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<url@372>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<scheme@478>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "4\n", ": (58)\n", "\n", "\n", "\n", "1->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<url@478>\n", "\n", "\n", "\n", "1->5\n", "\n", "\n", "\n", "\n", "\n", "3\n", "ftp\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "6\n", "//\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<netloc@481>\n", "\n", "\n", "\n", "5->7\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<url@481>\n", "\n", "\n", "\n", "5->9\n", "\n", "\n", "\n", "\n", "\n", "8\n", "freebsd.org\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<url@394>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "/releases/5.8\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[-1], files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The derivation trees seem to belong to the same grammar. Hence, we obtain the grammar for the complete set. First, we update the `recover_grammar()` to use `AssignTracker`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recover Grammar" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.244118Z", "iopub.status.busy": "2024-01-18T17:20:00.243980Z", "iopub.status.idle": "2024-01-18T17:20:00.246644Z", "shell.execute_reply": "2024-01-18T17:20:00.246307Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def update_grammar(self, inputstr, trace):\n", " at = self.create_tracker(inputstr, trace)\n", " dt = self.create_tree_miner(inputstr, at.my_assignments.defined_vars())\n", " self.add_tree(dt)\n", " return self.grammar\n", "\n", " def create_tracker(self, *args):\n", " return AssignmentTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return TreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we use the modified `recover_grammar()` on derivation trees obtained from URLs." ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.248146Z", "iopub.status.busy": "2024-01-18T17:20:00.248037Z", "iopub.status.idle": "2024-01-18T17:20:00.389875Z", "shell.execute_reply": "2024-01-18T17:20:00.389589Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The recovered grammar is below." ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.391737Z", "iopub.status.busy": "2024-01-18T17:20:00.391645Z", "iopub.status.idle": "2024-01-18T17:20:00.406113Z", "shell.execute_reply": "2024-01-18T17:20:00.405812Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "url@372" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@372\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "scheme@478\n", ":\n", "url@478" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme@478\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http\n", "\n", "ftp" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@478\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "//\n", "netloc@481\n", "/\n", "\n", "//\n", "netloc@481\n", "url@481" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc@481\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "www.fuzzingbook.org\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.cispa.saarland:80\n", "\n", "freebsd.org" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@481\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "url@486\n", "#\n", "fragment@486\n", "\n", "url@394\n", "\n", "/#\n", "fragment@486" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@486\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/?\n", "query@488" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query@488\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment@486\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@394\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/releases/5.8" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us fuzz a little to see if the produced values are sane." ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.407735Z", "iopub.status.busy": "2024-01-18T17:20:00.407646Z", "iopub.status.idle": "2024-01-18T17:20:00.410896Z", "shell.execute_reply": "2024-01-18T17:20:00.410636Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://user:pass@www.google.com:80/#News\n", "https://freebsd.org/\n", "ftp://www.fuzzingbook.org/\n", "https://freebsd.org/\n", "https://user:pass@www.google.com:80/#ref\n", "http://user:pass@www.google.com:80/?q=path#News\n", "http://www.cispa.saarland:80/#ref\n", "https://www.fuzzingbook.org/#News\n", "ftp://www.fuzzingbook.org/\n", "ftp://user:pass@www.google.com:80/#ref\n" ] } ], "source": [ "f = GrammarFuzzer(url_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Our modifications do seem to help. Next, we check whether we can still retrieve the grammar for inventory." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 2: Recovering Inventory Grammar" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.412640Z", "iopub.status.busy": "2024-01-18T17:20:00.412511Z", "iopub.status.idle": "2024-01-18T17:20:00.417833Z", "shell.execute_reply": "2024-01-18T17:20:00.417601Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar = recover_grammar(process_vehicle, VEHICLES)" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.419316Z", "iopub.status.busy": "2024-01-18T17:20:00.419226Z", "iopub.status.idle": "2024-01-18T17:20:00.428370Z", "shell.execute_reply": "2024-01-18T17:20:00.428092Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "vehicle@29" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle@29\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "year@30\n", ",\n", "kind@30\n", ",\n", "company@30\n", ",\n", "model@30" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year@30\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1997\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind@30\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company@30\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model@30\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(inventory_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using fuzzing to produce values from the grammar." ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.430113Z", "iopub.status.busy": "2024-01-18T17:20:00.429882Z", "iopub.status.idle": "2024-01-18T17:20:00.432937Z", "shell.execute_reply": "2024-01-18T17:20:00.432700Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000,van,Chevy,Cougar\n", "2000,van,Ford,Cougar\n", "1999,van,Chevy,E350\n", "1997,car,Ford,E350\n", "2000,van,Mercury,Venture\n", "1999,car,Ford,E350\n", "1999,car,Mercury,Venture\n", "1999,car,Mercury,E350\n", "1997,van,Mercury,Cougar\n", "1999,car,Chevy,Venture\n" ] } ], "source": [ "f = GrammarFuzzer(inventory_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problems with the Grammar Miner with Reassignment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems with our grammar miner is that it doesn't yet account for the current context. That is, when replacing, a variable can replace tokens that it does not have access to (and hence, it is not a fragment of). Consider this example." ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.434672Z", "iopub.status.busy": "2024-01-18T17:20:00.434585Z", "iopub.status.idle": "2024-01-18T17:20:00.869062Z", "shell.execute_reply": "2024-01-18T17:20:00.868626Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<inventory@22>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<vehicle@24>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "14\n", "\\n (10)\n", "\n", "\n", "\n", "1->14\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<vehicle@24>\n", "\n", "\n", "\n", "1->15\n", "\n", "\n", "\n", "\n", "\n", "27\n", "\\n (10)\n", "\n", "\n", "\n", "1->27\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<vehicle@24>\n", "\n", "\n", "\n", "1->28\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<year@30>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "5\n", ", (44)\n", "\n", "\n", "\n", "2->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "<kind@30>\n", "\n", "\n", "\n", "2->6\n", "\n", "\n", "\n", "\n", "\n", "8\n", ", (44)\n", "\n", "\n", "\n", "2->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<company@30>\n", "\n", "\n", "\n", "2->9\n", "\n", "\n", "\n", "\n", "\n", "11\n", ", (44)\n", "\n", "\n", "\n", "2->11\n", "\n", "\n", "\n", "\n", "\n", "12\n", "<model@30>\n", "\n", "\n", "\n", "2->12\n", "\n", "\n", "\n", "\n", "\n", "4\n", "1997\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "7\n", "van\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n", "10\n", "Ford\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "13\n", "E350\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<year@30>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "18\n", ", (44)\n", "\n", "\n", "\n", "15->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<kind@30>\n", "\n", "\n", "\n", "15->19\n", "\n", "\n", "\n", "\n", "\n", "21\n", ", (44)\n", "\n", "\n", "\n", "15->21\n", "\n", "\n", "\n", "\n", "\n", "22\n", "<company@30>\n", "\n", "\n", "\n", "15->22\n", "\n", "\n", "\n", "\n", "\n", "24\n", ", (44)\n", "\n", "\n", "\n", "15->24\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<model@30>\n", "\n", "\n", "\n", "15->25\n", "\n", "\n", "\n", "\n", "\n", "17\n", "2000\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "car\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "23\n", "Mercury\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "\n", "\n", "\n", "26\n", "Cougar\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "29\n", "<year@30>\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "31\n", ",car,\n", "\n", "\n", "\n", "28->31\n", "\n", "\n", "\n", "\n", "\n", "32\n", "<company@30>\n", "\n", "\n", "\n", "28->32\n", "\n", "\n", "\n", "\n", "\n", "34\n", ", (44)\n", "\n", "\n", "\n", "28->34\n", "\n", "\n", "\n", "\n", "\n", "35\n", "<model@30>\n", "\n", "\n", "\n", "28->35\n", "\n", "\n", "\n", "\n", "\n", "30\n", "1999\n", "\n", "\n", "\n", "29->30\n", "\n", "\n", "\n", "\n", "\n", "33\n", "Chevy\n", "\n", "\n", "\n", "32->33\n", "\n", "\n", "\n", "\n", "\n", "36\n", "Venture\n", "\n", "\n", "\n", "35->36\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Tracer(INVENTORY) as tracer:\n", " process_inventory(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree, graph_attr=lr_graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen, the derivation tree obtained is not quite what we expected. The issue is easily seen if we enable logging in the `TreeMiner`." ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.870958Z", "iopub.status.busy": "2024-01-18T17:20:00.870822Z", "iopub.status.idle": "2024-01-18T17:20:00.878668Z", "shell.execute_reply": "2024-01-18T17:20:00.878303Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " inventory@22='1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture')\n", "\t\t -> [0] '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['']\n", " vehicle@24='1997,van,Ford,E350'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['', '\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture']\n", " year@30='1997'\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['', ',van,Ford,E350']\n", " kind@30='van'\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ',van,Ford,E350'\n", "\t\t > [',', '', ',Ford,E350']\n", " company@30='Ford'\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ',Ford,E350'\n", "\t\t > [',', '', ',E350']\n", " model@30='E350'\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ',E350'\n", "\t\t > [',', '']\n", " vehicle@24='2000,car,Mercury,Cougar'\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['\\n', '', '\\n1999,car,Chevy,Venture']\n", " year@30='2000'\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] '2000,car,Mercury,Cougar'\n", "\t\t > ['', ',car,Mercury,Cougar']\n", " kind@30='car'\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ',car,Mercury,Cougar'\n", "\t\t > [',', '', ',Mercury,Cougar']\n", " company@30='Mercury'\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ',Mercury,Cougar'\n", "\t\t > [',', '', ',Cougar']\n", " model@30='Cougar'\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ',Cougar'\n", "\t\t > [',', '']\n", " vehicle@24='1999,car,Chevy,Venture'\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'1999,car,Chevy,Venture')\n", "\t\t -> [0] 'Cougar'\n", "\t\t -> [3] '\\n1999,car,Chevy,Venture'\n", "\t\t > ['\\n', '']\n", " year@30='1999'\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] 'Cougar'\n", "\t\t -> [3] '\\n'\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'1999')\n", "\t\t -> [0] '1999,car,Chevy,Venture'\n", "\t\t > ['', ',car,Chevy,Venture']\n", " company@30='Chevy'\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] 'Cougar'\n", "\t\t -> [3] '\\n'\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Chevy')\n", "\t\t -> [0] '1999'\n", "\t\t -> [1] ',car,Chevy,Venture'\n", "\t\t > [',car,', '', ',Venture']\n", " model@30='Venture'\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'Cougar'\n", "\t\t -> [3] '\\n'\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] '1999'\n", "\t\t -> [1] ',car,'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Venture')\n", "\t\t -> [0] 'Chevy'\n", "\t\t -> [3] ',Venture'\n", "\t\t > [',', '']\n" ] } ], "source": [ "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars(), log=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Look at the last statement. We have a value `1999,car,` where only the `year` got replaced. We no longer have a `'car'` variable to continue the replacement here. This happens because the `'car'` value in `'1999,car,Chevy,Venture'` is not treated as a new value because the value `'car'` had occurred for `'vehicle'` variable in the exact same location for a *different* method call (for `'2000,car,Mercury,Cougar'`)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Grammar Miner with Scope\n", "\n", "We need to incorporate inspection of the variables in the current context. We already have a stack of method calls so that we can obtain the current method at any point. We need to do the same for variables.\n", "\n", "For that, we extend the `CallStack` to a new class `InputStack` which holds the method invoked as well as the parameters observed. It is essentially the record of activation of the method. We start with the original input at the base of the stack, and for each new method-call, we push the parameters of that call into the stack as a new record." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Input Stack" ] }, { "cell_type": "code", "execution_count": 155, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.880974Z", "iopub.status.busy": "2024-01-18T17:20:00.880757Z", "iopub.status.idle": "2024-01-18T17:20:00.883823Z", "shell.execute_reply": "2024-01-18T17:20:00.883368Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(CallStack):\n", " def __init__(self, i, fragment_len=FRAGMENT_LEN):\n", " self.inputs = [{START_SYMBOL: i}]\n", " self.fragment_len = fragment_len\n", " super().__init__()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In order to check if a particular variable be saved, we define `in_current_record()` which checks only the variables in the current scope for inclusion (rather than the original input string)." ] }, { "cell_type": "code", "execution_count": 156, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.885942Z", "iopub.status.busy": "2024-01-18T17:20:00.885837Z", "iopub.status.idle": "2024-01-18T17:20:00.887886Z", "shell.execute_reply": "2024-01-18T17:20:00.887595Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(InputStack):\n", " def in_current_record(self, val):\n", " return any(val in var for var in self.inputs[-1].values())" ] }, { "cell_type": "code", "execution_count": 157, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.889479Z", "iopub.status.busy": "2024-01-18T17:20:00.889393Z", "iopub.status.idle": "2024-01-18T17:20:00.890942Z", "shell.execute_reply": "2024-01-18T17:20:00.890668Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "my_istack = InputStack('hello my world')" ] }, { "cell_type": "code", "execution_count": 158, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.892406Z", "iopub.status.busy": "2024-01-18T17:20:00.892317Z", "iopub.status.idle": "2024-01-18T17:20:00.894653Z", "shell.execute_reply": "2024-01-18T17:20:00.894320Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.in_current_record('hello')" ] }, { "cell_type": "code", "execution_count": 159, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.896485Z", "iopub.status.busy": "2024-01-18T17:20:00.896384Z", "iopub.status.idle": "2024-01-18T17:20:00.899133Z", "shell.execute_reply": "2024-01-18T17:20:00.898767Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 159, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.in_current_record('bye')" ] }, { "cell_type": "code", "execution_count": 160, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.901357Z", "iopub.status.busy": "2024-01-18T17:20:00.901064Z", "iopub.status.idle": "2024-01-18T17:20:00.903188Z", "shell.execute_reply": "2024-01-18T17:20:00.902934Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "my_istack.inputs.append({'greeting': 'hello', 'location': 'world'})" ] }, { "cell_type": "code", "execution_count": 161, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.904994Z", "iopub.status.busy": "2024-01-18T17:20:00.904722Z", "iopub.status.idle": "2024-01-18T17:20:00.906871Z", "shell.execute_reply": "2024-01-18T17:20:00.906547Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.in_current_record('hello')" ] }, { "cell_type": "code", "execution_count": 162, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.908489Z", "iopub.status.busy": "2024-01-18T17:20:00.908315Z", "iopub.status.idle": "2024-01-18T17:20:00.910685Z", "shell.execute_reply": "2024-01-18T17:20:00.910412Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 162, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.in_current_record('my')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define the method `ignored()` that returns true if either the variable is not a string, or the variable length is less than the defined `fragment_len`." ] }, { "cell_type": "code", "execution_count": 163, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.912996Z", "iopub.status.busy": "2024-01-18T17:20:00.912629Z", "iopub.status.idle": "2024-01-18T17:20:00.914964Z", "shell.execute_reply": "2024-01-18T17:20:00.914596Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(InputStack):\n", " def ignored(self, val):\n", " return not (isinstance(val, str) and len(val) >= self.fragment_len)" ] }, { "cell_type": "code", "execution_count": 164, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.917280Z", "iopub.status.busy": "2024-01-18T17:20:00.917041Z", "iopub.status.idle": "2024-01-18T17:20:00.919591Z", "shell.execute_reply": "2024-01-18T17:20:00.919314Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 164, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack = InputStack('hello world')\n", "my_istack.ignored(1)" ] }, { "cell_type": "code", "execution_count": 165, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.921204Z", "iopub.status.busy": "2024-01-18T17:20:00.921062Z", "iopub.status.idle": "2024-01-18T17:20:00.923206Z", "shell.execute_reply": "2024-01-18T17:20:00.922952Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "True" ] }, "execution_count": 165, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.ignored('a')" ] }, { "cell_type": "code", "execution_count": 166, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.924714Z", "iopub.status.busy": "2024-01-18T17:20:00.924584Z", "iopub.status.idle": "2024-01-18T17:20:00.926738Z", "shell.execute_reply": "2024-01-18T17:20:00.926475Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "False" ] }, "execution_count": 166, "metadata": {}, "output_type": "execute_result" } ], "source": [ "my_istack.ignored('help')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can now define the `in_scope()` method that checks whether the variable needs to be ignored, and if it is not to be ignored, whether the variable value is present in the current scope." ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.928680Z", "iopub.status.busy": "2024-01-18T17:20:00.928495Z", "iopub.status.idle": "2024-01-18T17:20:00.930874Z", "shell.execute_reply": "2024-01-18T17:20:00.930508Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(InputStack):\n", " def in_scope(self, k, val):\n", " if self.ignored(val):\n", " return False\n", " return self.in_current_record(val)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Finally, we update `enter()` that pushes relevant variables in the current context to the stack." ] }, { "cell_type": "code", "execution_count": 168, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.933343Z", "iopub.status.busy": "2024-01-18T17:20:00.933080Z", "iopub.status.idle": "2024-01-18T17:20:00.935467Z", "shell.execute_reply": "2024-01-18T17:20:00.935174Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(InputStack):\n", " def enter(self, method, inputs):\n", " my_inputs = {k: v for k, v in inputs.items() if self.in_scope(k, v)}\n", " self.inputs.append(my_inputs)\n", " super().enter(method)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "When a method returns, we also need a corresponding `leave()` to pop out the inputs and unwind the stack." ] }, { "cell_type": "code", "execution_count": 169, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.937314Z", "iopub.status.busy": "2024-01-18T17:20:00.937143Z", "iopub.status.idle": "2024-01-18T17:20:00.939071Z", "shell.execute_reply": "2024-01-18T17:20:00.938843Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class InputStack(InputStack):\n", " def leave(self):\n", " self.inputs.pop()\n", " super().leave()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### ScopedVars" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We need to update our `AssignmentVars` to include information about which scope the variable was defined in. We start by updating `method_init()`." ] }, { "cell_type": "code", "execution_count": 170, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.940814Z", "iopub.status.busy": "2024-01-18T17:20:00.940581Z", "iopub.status.idle": "2024-01-18T17:20:00.942724Z", "shell.execute_reply": "2024-01-18T17:20:00.942430Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopedVars(AssignmentVars):\n", " def method_init(self):\n", " self.call_stack = self.create_call_stack(self.my_input)\n", " self.event_locations = {self.call_stack.method_id: []}\n", "\n", " def create_call_stack(self, i):\n", " return InputStack(i)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similarly, the `method_enter()` now initializes the `accessed_seq_var` for the current method call." ] }, { "cell_type": "code", "execution_count": 171, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.944383Z", "iopub.status.busy": "2024-01-18T17:20:00.944233Z", "iopub.status.idle": "2024-01-18T17:20:00.946889Z", "shell.execute_reply": "2024-01-18T17:20:00.946467Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def method_enter(self, cxt, my_vars):\n", " self.current_event = 'call'\n", " self.call_stack.enter(cxt.method, my_vars)\n", " self.accessed_seq_var[self.call_stack.method_id] = {}\n", " self.event_locations[self.call_stack.method_id] = []\n", " self.register_event(cxt)\n", " self.update(my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `update()` method now saves the context in which the value is defined. In the case of a parameter to a function, the context should be the context in which the function was called. On the other hand, a value defined during a statement execution would have the current context.\n", "\n", "Further, we annotate on value rather than key because we do not want to duplicate variables when parameters are in context in the next line. They will have same value, but different context because they are present in a statement execution.\n" ] }, { "cell_type": "code", "execution_count": 172, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.949110Z", "iopub.status.busy": "2024-01-18T17:20:00.948877Z", "iopub.status.idle": "2024-01-18T17:20:00.951476Z", "shell.execute_reply": "2024-01-18T17:20:00.951202Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def update(self, v):\n", " if self.current_event == 'call':\n", " context = -2\n", " elif self.current_event == 'line':\n", " context = -1\n", " else:\n", " context = -1\n", " for k, v in v.items():\n", " self._set_kv(k, (v, self.call_stack.at(context)))\n", " self.var_location_register(self.new_vars)\n", " self.new_vars = set()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We also need to save the current method invocation to determine which variables are in scope. This information is now incorporated in the variable name as `accessed_seq_var[method_id][var]`." ] }, { "cell_type": "code", "execution_count": 173, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.953003Z", "iopub.status.busy": "2024-01-18T17:20:00.952876Z", "iopub.status.idle": "2024-01-18T17:20:00.954679Z", "shell.execute_reply": "2024-01-18T17:20:00.954427Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def var_name(self, var):\n", " return (var, self.call_stack.method_id,\n", " self.accessed_seq_var[self.call_stack.method_id][var])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As before, `var_access` simply initializes the corresponding counter, this time in the context of `method_id`." ] }, { "cell_type": "code", "execution_count": 174, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.956314Z", "iopub.status.busy": "2024-01-18T17:20:00.956177Z", "iopub.status.idle": "2024-01-18T17:20:00.958197Z", "shell.execute_reply": "2024-01-18T17:20:00.957965Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def var_access(self, var):\n", " if var not in self.accessed_seq_var[self.call_stack.method_id]:\n", " self.accessed_seq_var[self.call_stack.method_id][var] = 0\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "During a variable reassignment, we update the `accessed_seq_var` to reflect the new count." ] }, { "cell_type": "code", "execution_count": 175, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.959909Z", "iopub.status.busy": "2024-01-18T17:20:00.959772Z", "iopub.status.idle": "2024-01-18T17:20:00.961847Z", "shell.execute_reply": "2024-01-18T17:20:00.961570Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def var_assign(self, var):\n", " self.accessed_seq_var[self.call_stack.method_id][var] += 1\n", " self.new_vars.add(self.var_name(var))\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We now update `defined_vars()` to account for the new information." ] }, { "cell_type": "code", "execution_count": 176, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.964225Z", "iopub.status.busy": "2024-01-18T17:20:00.964055Z", "iopub.status.idle": "2024-01-18T17:20:00.966404Z", "shell.execute_reply": "2024-01-18T17:20:00.966160Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def defined_vars(self, formatted=True):\n", " def fmt(k):\n", " method, i = k[1]\n", " v = (method, i, k[0], self.var_def_lines[k])\n", " return \"%s[%d]:%s@%s\" % v if formatted else v\n", "\n", " return [(fmt(k), v) for k, v in self.defs.items()]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Updating `seq_vars()` to account for new information." ] }, { "cell_type": "code", "execution_count": 177, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.968105Z", "iopub.status.busy": "2024-01-18T17:20:00.967904Z", "iopub.status.idle": "2024-01-18T17:20:00.970325Z", "shell.execute_reply": "2024-01-18T17:20:00.970066Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedVars(ScopedVars):\n", " def seq_vars(self, formatted=True):\n", " def fmt(k):\n", " method, i = k[1]\n", " v = (method, i, k[0], self.var_def_lines[k], k[2])\n", " return \"%s[%d]:%s@%s:%s\" % v if formatted else v\n", "\n", " return {fmt(k): v for k, v in self.defs.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Scope Tracker" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "With the `InputStack` and `Vars` defined, we can now define the `ScopeTracker`. The `ScopeTracker` only saves variables if the value is present in the current scope." ] }, { "cell_type": "code", "execution_count": 178, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.972042Z", "iopub.status.busy": "2024-01-18T17:20:00.971906Z", "iopub.status.idle": "2024-01-18T17:20:00.973935Z", "shell.execute_reply": "2024-01-18T17:20:00.973686Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopeTracker(AssignmentTracker):\n", " def __init__(self, my_input, trace, **kwargs):\n", " self.current_event = None\n", " super().__init__(my_input, trace, **kwargs)\n", "\n", " def create_assignments(self, *args):\n", " return ScopedVars(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define a wrapper for checking whether a variable is present in the scope." ] }, { "cell_type": "code", "execution_count": 179, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.975421Z", "iopub.status.busy": "2024-01-18T17:20:00.975284Z", "iopub.status.idle": "2024-01-18T17:20:00.977249Z", "shell.execute_reply": "2024-01-18T17:20:00.976997Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopeTracker(ScopeTracker):\n", " def is_input_fragment(self, var, value):\n", " return self.my_assignments.call_stack.in_scope(var, value)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can use the `ScopeTracker` as follows." ] }, { "cell_type": "code", "execution_count": 180, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.979376Z", "iopub.status.busy": "2024-01-18T17:20:00.979209Z", "iopub.status.idle": "2024-01-18T17:20:00.989792Z", "shell.execute_reply": "2024-01-18T17:20:00.989457Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "process_inventory[1]:inventory@22:1 = ('1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', ('', 0))\n", "process_inventory[1]:inventory@22:2 = ('1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_inventory[1]:vehicle@24:1 = ('1997,van,Ford,E350', ('process_inventory', 1))\n", "process_vehicle[2]:vehicle@29:1 = ('1997,van,Ford,E350', ('process_inventory', 1))\n", "process_vehicle[2]:vehicle@29:2 = ('1997,van,Ford,E350', ('process_vehicle', 2))\n", "process_vehicle[2]:year@30:1 = ('1997', ('process_vehicle', 2))\n", "process_vehicle[2]:kind@30:1 = ('van', ('process_vehicle', 2))\n", "process_vehicle[2]:company@30:1 = ('Ford', ('process_vehicle', 2))\n", "process_vehicle[2]:model@30:1 = ('E350', ('process_vehicle', 2))\n", "process_van[3]:year@40:1 = ('1997', ('process_vehicle', 2))\n", "process_van[3]:company@40:1 = ('Ford', ('process_vehicle', 2))\n", "process_van[3]:model@40:1 = ('E350', ('process_vehicle', 2))\n", "process_van[3]:year@40:2 = ('1997', ('process_van', 3))\n", "process_van[3]:company@40:2 = ('Ford', ('process_van', 3))\n", "process_van[3]:model@40:2 = ('E350', ('process_van', 3))\n", "process_inventory[1]:vehicle@24:2 = ('2000,car,Mercury,Cougar', ('process_inventory', 1))\n", "process_vehicle[4]:vehicle@29:1 = ('2000,car,Mercury,Cougar', ('process_inventory', 1))\n", "process_vehicle[4]:vehicle@29:2 = ('2000,car,Mercury,Cougar', ('process_vehicle', 4))\n", "process_vehicle[4]:year@30:1 = ('2000', ('process_vehicle', 4))\n", "process_vehicle[4]:kind@30:1 = ('car', ('process_vehicle', 4))\n", "process_vehicle[4]:company@30:1 = ('Mercury', ('process_vehicle', 4))\n", "process_vehicle[4]:model@30:1 = ('Cougar', ('process_vehicle', 4))\n", "process_car[5]:year@49:1 = ('2000', ('process_vehicle', 4))\n", "process_car[5]:company@49:1 = ('Mercury', ('process_vehicle', 4))\n", "process_car[5]:model@49:1 = ('Cougar', ('process_vehicle', 4))\n", "process_car[5]:year@49:2 = ('2000', ('process_car', 5))\n", "process_car[5]:company@49:2 = ('Mercury', ('process_car', 5))\n", "process_car[5]:model@49:2 = ('Cougar', ('process_car', 5))\n", "process_inventory[1]:vehicle@24:3 = ('1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_vehicle[6]:vehicle@29:1 = ('1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_vehicle[6]:vehicle@29:2 = ('1999,car,Chevy,Venture', ('process_vehicle', 6))\n", "process_vehicle[6]:year@30:1 = ('1999', ('process_vehicle', 6))\n", "process_vehicle[6]:kind@30:1 = ('car', ('process_vehicle', 6))\n", "process_vehicle[6]:company@30:1 = ('Chevy', ('process_vehicle', 6))\n", "process_vehicle[6]:model@30:1 = ('Venture', ('process_vehicle', 6))\n", "process_car[7]:year@49:1 = ('1999', ('process_vehicle', 6))\n", "process_car[7]:company@49:1 = ('Chevy', ('process_vehicle', 6))\n", "process_car[7]:model@49:1 = ('Venture', ('process_vehicle', 6))\n", "process_car[7]:year@49:2 = ('1999', ('process_car', 7))\n", "process_car[7]:company@49:2 = ('Chevy', ('process_car', 7))\n", "process_car[7]:model@49:2 = ('Venture', ('process_car', 7))\n" ] } ], "source": [ "vehicle_traces = []\n", "with Tracer(INVENTORY) as tracer:\n", " process_inventory(tracer.my_input)\n", "sm = ScopeTracker(tracer.my_input, tracer.trace)\n", "vehicle_traces.append((tracer.my_input, sm))\n", "for k, v in sm.my_assignments.seq_vars().items():\n", " print(k, '=', repr(v))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recovering a Derivation Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The main difference in `apply_new_definition()` is that we add a second condition that checks for scope. In particular, variables are only allowed to replace portions of string fragments that were in scope.\n", "The variable scope is indicated by `scope`. However, merely accounting for scope is not sufficient. For example, consider the fragment below." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "```python\n", "def my_fn(stringval):\n", " partA, partB = stringval.split('/')\n", " return partA, partB\n", "\n", "svalue = ...\n", "v1, v2 = my_fn(svalue)\n", "```" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here, `v1` and `v2` get their values from a previous function call. Not from their current context. That is, we have to provide an exception for cases where an internal child method call may have generated a large fragment as we showed above. To account for that, we define `mseq()` that retrieves the method call sequence. In the above case, the `mseq()` of the internal child method call would be larger than the current `mseq()`. If so, we allow the replacement to proceed." ] }, { "cell_type": "code", "execution_count": 181, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.992115Z", "iopub.status.busy": "2024-01-18T17:20:00.991908Z", "iopub.status.idle": "2024-01-18T17:20:00.993975Z", "shell.execute_reply": "2024-01-18T17:20:00.993658Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopeTreeMiner(TreeMiner):\n", " def mseq(self, key):\n", " method, seq, var, lno = key\n", " return seq" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `nt_var()` method needs to take the tuple and generate a non-terminal symbol out of it. We skip the method sequence because it is not relevant for the grammar." ] }, { "cell_type": "code", "execution_count": 182, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.995841Z", "iopub.status.busy": "2024-01-18T17:20:00.995688Z", "iopub.status.idle": "2024-01-18T17:20:00.997589Z", "shell.execute_reply": "2024-01-18T17:20:00.997337Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopeTreeMiner(ScopeTreeMiner):\n", " def nt_var(self, key):\n", " method, seq, var, lno = key\n", " return to_nonterminal(\"%s@%d:%s\" % (method, lno, var))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We now redefine the `apply_new_definition()` to account for context and scope. In particular, a variable is allowed to replace a part of a value only if the variable is in *scope* -- that is, it's scope (method sequence number of either its calling context in case it is a parameter or the current context in case it is a statement) is same as that of the value's method sequence number. An exception is made when the value's method sequence number is greater than the variable's method sequence number. In that case, the value may have come from an internal call. We allow the replacement to proceed in that case." ] }, { "cell_type": "code", "execution_count": 183, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:00.999201Z", "iopub.status.busy": "2024-01-18T17:20:00.999083Z", "iopub.status.idle": "2024-01-18T17:20:01.003405Z", "shell.execute_reply": "2024-01-18T17:20:01.003112Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopeTreeMiner(ScopeTreeMiner):\n", " def partition(self, part, value):\n", " return value.partition(part)\n", " def partition_by_part(self, pair, value):\n", " (nt_var, nt_seq), (v, v_scope) = pair\n", " prefix_k_suffix = [\n", " (nt_var, [(v, [], nt_seq)]) if i == 1 else (e, [])\n", " for i, e in enumerate(self.partition(v, value))\n", " if e]\n", " return prefix_k_suffix\n", " \n", " def insert_into_tree(self, my_tree, pair):\n", " var, values, my_scope = my_tree\n", " (nt_var, nt_seq), (v, v_scope) = pair\n", " applied = False\n", " for i, value_ in enumerate(values):\n", " key, arr, scope = value_\n", " self.log(2, \"-> [%d] %s\" % (i, repr(value_)))\n", " if is_nonterminal(key):\n", " applied = self.insert_into_tree(value_, pair)\n", " if applied:\n", " break\n", " else:\n", " if v_scope != scope:\n", " if nt_seq > scope:\n", " continue\n", " if not v or not self.string_part_of_value(v, key):\n", " continue\n", " prefix_k_suffix = [(k, children, scope) for k, children\n", " in self.partition_by_part(pair, key)]\n", " del values[i]\n", " for j, rep in enumerate(prefix_k_suffix):\n", " values.insert(j + i, rep)\n", "\n", " applied = True\n", " self.log(2, \" > %s\" % (repr([i[0] for i in prefix_k_suffix])))\n", " break\n", " return applied" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `apply_new_definition()` is now modified to carry additional contextual information `mseq`." ] }, { "cell_type": "code", "execution_count": 184, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.005014Z", "iopub.status.busy": "2024-01-18T17:20:01.004925Z", "iopub.status.idle": "2024-01-18T17:20:01.006918Z", "shell.execute_reply": "2024-01-18T17:20:01.006701Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopeTreeMiner(ScopeTreeMiner):\n", " def apply_new_definition(self, tree, var, value_):\n", " nt_var = self.nt_var(var)\n", " seq = self.mseq(var)\n", " val, (smethod, mseq) = value_\n", " return self.insert_into_tree(tree, ((nt_var, seq), (val, mseq)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We also modify `get_derivation_tree()` so that the initial node carries the context." ] }, { "cell_type": "code", "execution_count": 185, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.008460Z", "iopub.status.busy": "2024-01-18T17:20:01.008376Z", "iopub.status.idle": "2024-01-18T17:20:01.010865Z", "shell.execute_reply": "2024-01-18T17:20:01.010547Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopeTreeMiner(ScopeTreeMiner):\n", " def get_derivation_tree(self):\n", " tree = (START_SYMBOL, [(self.my_input, [], 0)], 0)\n", " for var, value in self.my_assignments:\n", " self.log(0, \"%s=%s\" % (var, repr(value)))\n", " self.apply_new_definition(tree, var, value)\n", " return tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 1: Recovering URL Parse Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We verify that our URL parse tree recovery still works as expected." ] }, { "cell_type": "code", "execution_count": 186, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.012798Z", "iopub.status.busy": "2024-01-18T17:20:01.012641Z", "iopub.status.idle": "2024-01-18T17:20:01.146944Z", "shell.execute_reply": "2024-01-18T17:20:01.146608Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "('urlparse', 1, 'url', 372) = ('http://user:pass@www.google.com:80/?q=path#ref', ('', 0))\n", "('urlparse', 1, 'url', 372) = ('http://user:pass@www.google.com:80/?q=path#ref', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('http://user:pass@www.google.com:80/?q=path#ref', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('http://user:pass@www.google.com:80/?q=path#ref', ('urlsplit', 3))\n", "('urlsplit', 3, 'url', 478) = ('//user:pass@www.google.com:80/?q=path#ref', ('urlsplit', 3))\n", "('urlsplit', 3, 'scheme', 478) = ('http', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//user:pass@www.google.com:80/?q=path#ref', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//user:pass@www.google.com:80/?q=path#ref', ('_splitnetloc', 5))\n", "('urlsplit', 3, 'url', 481) = ('/?q=path#ref', ('urlsplit', 3))\n", "('urlsplit', 3, 'netloc', 481) = ('user:pass@www.google.com:80', ('urlsplit', 3))\n", "('urlsplit', 3, 'url', 486) = ('/?q=path', ('urlsplit', 3))\n", "('urlsplit', 3, 'fragment', 486) = ('ref', ('urlsplit', 3))\n", "('urlsplit', 3, 'query', 488) = ('q=path', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('user:pass@www.google.com:80', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('user:pass@www.google.com:80', ('_checknetloc', 6))\n", "('urlparse', 1, 'scheme', 394) = ('http', ('urlparse', 1))\n", "('urlparse', 1, 'netloc', 394) = ('user:pass@www.google.com:80', ('urlparse', 1))\n", "('urlparse', 1, 'query', 394) = ('q=path', ('urlparse', 1))\n", "('urlparse', 1, 'fragment', 394) = ('ref', ('urlparse', 1))\n", "('urlparse', 1, 'url', 372) = ('https://www.cispa.saarland:80/', ('', 0))\n", "('urlparse', 1, 'url', 372) = ('https://www.cispa.saarland:80/', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('https://www.cispa.saarland:80/', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('https://www.cispa.saarland:80/', ('urlsplit', 3))\n", "('urlsplit', 3, 'url', 478) = ('//www.cispa.saarland:80/', ('urlsplit', 3))\n", "('urlsplit', 3, 'scheme', 478) = ('https', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//www.cispa.saarland:80/', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//www.cispa.saarland:80/', ('_splitnetloc', 5))\n", "('urlsplit', 3, 'netloc', 481) = ('www.cispa.saarland:80', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('www.cispa.saarland:80', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('www.cispa.saarland:80', ('_checknetloc', 6))\n", "('urlparse', 1, 'scheme', 394) = ('https', ('urlparse', 1))\n", "('urlparse', 1, 'netloc', 394) = ('www.cispa.saarland:80', ('urlparse', 1))\n", "('urlparse', 1, 'url', 372) = ('http://www.fuzzingbook.org/#News', ('', 0))\n", "('urlparse', 1, 'url', 372) = ('http://www.fuzzingbook.org/#News', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('http://www.fuzzingbook.org/#News', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('http://www.fuzzingbook.org/#News', ('urlsplit', 3))\n", "('urlsplit', 3, 'url', 478) = ('//www.fuzzingbook.org/#News', ('urlsplit', 3))\n", "('urlsplit', 3, 'scheme', 478) = ('http', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//www.fuzzingbook.org/#News', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//www.fuzzingbook.org/#News', ('_splitnetloc', 5))\n", "('urlsplit', 3, 'url', 481) = ('/#News', ('urlsplit', 3))\n", "('urlsplit', 3, 'netloc', 481) = ('www.fuzzingbook.org', ('urlsplit', 3))\n", "('urlsplit', 3, 'fragment', 486) = ('News', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('www.fuzzingbook.org', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('www.fuzzingbook.org', ('_checknetloc', 6))\n", "('urlparse', 1, 'scheme', 394) = ('http', ('urlparse', 1))\n", "('urlparse', 1, 'netloc', 394) = ('www.fuzzingbook.org', ('urlparse', 1))\n", "('urlparse', 1, 'fragment', 394) = ('News', ('urlparse', 1))\n", "('urlparse', 1, 'url', 372) = ('ftp://freebsd.org/releases/5.8', ('', 0))\n", "('urlparse', 1, 'url', 372) = ('ftp://freebsd.org/releases/5.8', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('ftp://freebsd.org/releases/5.8', ('urlparse', 1))\n", "('urlsplit', 3, 'url', 437) = ('ftp://freebsd.org/releases/5.8', ('urlsplit', 3))\n", "('urlsplit', 3, 'url', 478) = ('//freebsd.org/releases/5.8', ('urlsplit', 3))\n", "('urlsplit', 3, 'scheme', 478) = ('ftp', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//freebsd.org/releases/5.8', ('urlsplit', 3))\n", "('_splitnetloc', 5, 'url', 411) = ('//freebsd.org/releases/5.8', ('_splitnetloc', 5))\n", "('urlsplit', 3, 'url', 481) = ('/releases/5.8', ('urlsplit', 3))\n", "('urlsplit', 3, 'netloc', 481) = ('freebsd.org', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('freebsd.org', ('urlsplit', 3))\n", "('_checknetloc', 6, 'netloc', 419) = ('freebsd.org', ('_checknetloc', 6))\n", "('urlparse', 1, 'url', 394) = ('/releases/5.8', ('urlparse', 1))\n", "('urlparse', 1, 'scheme', 394) = ('ftp', ('urlparse', 1))\n", "('urlparse', 1, 'netloc', 394) = ('freebsd.org', ('urlparse', 1))\n" ] } ], "source": [ "url_dts = []\n", "for inputstr in URLS_X:\n", " clear_cache()\n", " with Tracer(inputstr, files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", " sm = ScopeTracker(tracer.my_input, tracer.trace)\n", " for k, v in sm.my_assignments.defined_vars(formatted=False):\n", " print(k, '=', repr(v))\n", " dt = ScopeTreeMiner(\n", " tracer.my_input,\n", " sm.my_assignments.defined_vars(\n", " formatted=False))\n", " display_tree(dt.tree, graph_attr=lr_graph)\n", " url_dts.append(dt)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 2: Recovering Inventory Parse Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we look at recovering the parse tree from `process_inventory()` which failed last time." ] }, { "cell_type": "code", "execution_count": 187, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.148970Z", "iopub.status.busy": "2024-01-18T17:20:01.148828Z", "iopub.status.idle": "2024-01-18T17:20:01.573620Z", "shell.execute_reply": "2024-01-18T17:20:01.573255Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "process_inventory[1]:inventory@22 = ('1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', ('', 0))\n", "process_inventory[1]:inventory@22 = ('1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_inventory[1]:vehicle@24 = ('1997,van,Ford,E350', ('process_inventory', 1))\n", "process_vehicle[2]:vehicle@29 = ('1997,van,Ford,E350', ('process_inventory', 1))\n", "process_vehicle[2]:vehicle@29 = ('1997,van,Ford,E350', ('process_vehicle', 2))\n", "process_vehicle[2]:year@30 = ('1997', ('process_vehicle', 2))\n", "process_vehicle[2]:kind@30 = ('van', ('process_vehicle', 2))\n", "process_vehicle[2]:company@30 = ('Ford', ('process_vehicle', 2))\n", "process_vehicle[2]:model@30 = ('E350', ('process_vehicle', 2))\n", "process_van[3]:year@40 = ('1997', ('process_vehicle', 2))\n", "process_van[3]:company@40 = ('Ford', ('process_vehicle', 2))\n", "process_van[3]:model@40 = ('E350', ('process_vehicle', 2))\n", "process_van[3]:year@40 = ('1997', ('process_van', 3))\n", "process_van[3]:company@40 = ('Ford', ('process_van', 3))\n", "process_van[3]:model@40 = ('E350', ('process_van', 3))\n", "process_inventory[1]:vehicle@24 = ('2000,car,Mercury,Cougar', ('process_inventory', 1))\n", "process_vehicle[4]:vehicle@29 = ('2000,car,Mercury,Cougar', ('process_inventory', 1))\n", "process_vehicle[4]:vehicle@29 = ('2000,car,Mercury,Cougar', ('process_vehicle', 4))\n", "process_vehicle[4]:year@30 = ('2000', ('process_vehicle', 4))\n", "process_vehicle[4]:kind@30 = ('car', ('process_vehicle', 4))\n", "process_vehicle[4]:company@30 = ('Mercury', ('process_vehicle', 4))\n", "process_vehicle[4]:model@30 = ('Cougar', ('process_vehicle', 4))\n", "process_car[5]:year@49 = ('2000', ('process_vehicle', 4))\n", "process_car[5]:company@49 = ('Mercury', ('process_vehicle', 4))\n", "process_car[5]:model@49 = ('Cougar', ('process_vehicle', 4))\n", "process_car[5]:year@49 = ('2000', ('process_car', 5))\n", "process_car[5]:company@49 = ('Mercury', ('process_car', 5))\n", "process_car[5]:model@49 = ('Cougar', ('process_car', 5))\n", "process_inventory[1]:vehicle@24 = ('1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_vehicle[6]:vehicle@29 = ('1999,car,Chevy,Venture', ('process_inventory', 1))\n", "process_vehicle[6]:vehicle@29 = ('1999,car,Chevy,Venture', ('process_vehicle', 6))\n", "process_vehicle[6]:year@30 = ('1999', ('process_vehicle', 6))\n", "process_vehicle[6]:kind@30 = ('car', ('process_vehicle', 6))\n", "process_vehicle[6]:company@30 = ('Chevy', ('process_vehicle', 6))\n", "process_vehicle[6]:model@30 = ('Venture', ('process_vehicle', 6))\n", "process_car[7]:year@49 = ('1999', ('process_vehicle', 6))\n", "process_car[7]:company@49 = ('Chevy', ('process_vehicle', 6))\n", "process_car[7]:model@49 = ('Venture', ('process_vehicle', 6))\n", "process_car[7]:year@49 = ('1999', ('process_car', 7))\n", "process_car[7]:company@49 = ('Chevy', ('process_car', 7))\n", "process_car[7]:model@49 = ('Venture', ('process_car', 7))\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<process_inventory@22:inventory>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<process_inventory@22:inventory>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<process_inventory@24:vehicle>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "23\n", "\\n (10)\n", "\n", "\n", "\n", "2->23\n", "\n", "\n", "\n", "\n", "\n", "24\n", "<process_inventory@24:vehicle>\n", "\n", "\n", "\n", "2->24\n", "\n", "\n", "\n", "\n", "\n", "44\n", "\\n (10)\n", "\n", "\n", "\n", "2->44\n", "\n", "\n", "\n", "\n", "\n", "45\n", "<process_inventory@24:vehicle>\n", "\n", "\n", "\n", "2->45\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "<process_vehicle@30:year>\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "10\n", ", (44)\n", "\n", "\n", "\n", "5->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "<process_vehicle@30:kind>\n", "\n", "\n", "\n", "5->11\n", "\n", "\n", "\n", "\n", "\n", "13\n", ", (44)\n", "\n", "\n", "\n", "5->13\n", "\n", "\n", "\n", "\n", "\n", "14\n", "<process_vehicle@30:company>\n", "\n", "\n", "\n", "5->14\n", "\n", "\n", "\n", "\n", "\n", "18\n", ", (44)\n", "\n", "\n", "\n", "5->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<process_vehicle@30:model>\n", "\n", "\n", "\n", "5->19\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<process_van@40:year>\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<process_van@40:year>\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "1997\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "12\n", "van\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<process_van@40:company>\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<process_van@40:company>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "17\n", "Ford\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "<process_van@40:model>\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "21\n", "<process_van@40:model>\n", "\n", "\n", "\n", "20->21\n", "\n", "\n", "\n", "\n", "\n", "22\n", "E350\n", "\n", "\n", "\n", "21->22\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "24->25\n", "\n", "\n", "\n", "\n", "\n", "26\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "27\n", "<process_vehicle@30:year>\n", "\n", "\n", "\n", "26->27\n", "\n", "\n", "\n", "\n", "\n", "31\n", ", (44)\n", "\n", "\n", "\n", "26->31\n", "\n", "\n", "\n", "\n", "\n", "32\n", "<process_vehicle@30:kind>\n", "\n", "\n", "\n", "26->32\n", "\n", "\n", "\n", "\n", "\n", "34\n", ", (44)\n", "\n", "\n", "\n", "26->34\n", "\n", "\n", "\n", "\n", "\n", "35\n", "<process_vehicle@30:company>\n", "\n", "\n", "\n", "26->35\n", "\n", "\n", "\n", "\n", "\n", "39\n", ", (44)\n", "\n", "\n", "\n", "26->39\n", "\n", "\n", "\n", "\n", "\n", "40\n", "<process_vehicle@30:model>\n", "\n", "\n", "\n", "26->40\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<process_car@49:year>\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "29\n", "<process_car@49:year>\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "30\n", "2000\n", "\n", "\n", "\n", "29->30\n", "\n", "\n", "\n", "\n", "\n", "33\n", "car\n", "\n", "\n", "\n", "32->33\n", "\n", "\n", "\n", "\n", "\n", "36\n", "<process_car@49:company>\n", "\n", "\n", "\n", "35->36\n", "\n", "\n", "\n", "\n", "\n", "37\n", "<process_car@49:company>\n", "\n", "\n", "\n", "36->37\n", "\n", "\n", "\n", "\n", "\n", "38\n", "Mercury\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "\n", "\n", "\n", "41\n", "<process_car@49:model>\n", "\n", "\n", "\n", "40->41\n", "\n", "\n", "\n", "\n", "\n", "42\n", "<process_car@49:model>\n", "\n", "\n", "\n", "41->42\n", "\n", "\n", "\n", "\n", "\n", "43\n", "Cougar\n", "\n", "\n", "\n", "42->43\n", "\n", "\n", "\n", "\n", "\n", "46\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "45->46\n", "\n", "\n", "\n", "\n", "\n", "47\n", "<process_vehicle@29:vehicle>\n", "\n", "\n", "\n", "46->47\n", "\n", "\n", "\n", "\n", "\n", "48\n", "<process_vehicle@30:year>\n", "\n", "\n", "\n", "47->48\n", "\n", "\n", "\n", "\n", "\n", "52\n", ", (44)\n", "\n", "\n", "\n", "47->52\n", "\n", "\n", "\n", "\n", "\n", "53\n", "<process_vehicle@30:kind>\n", "\n", "\n", "\n", "47->53\n", "\n", "\n", "\n", "\n", "\n", "55\n", ", (44)\n", "\n", "\n", "\n", "47->55\n", "\n", "\n", "\n", "\n", "\n", "56\n", "<process_vehicle@30:company>\n", "\n", "\n", "\n", "47->56\n", "\n", "\n", "\n", "\n", "\n", "60\n", ", (44)\n", "\n", "\n", "\n", "47->60\n", "\n", "\n", "\n", "\n", "\n", "61\n", "<process_vehicle@30:model>\n", "\n", "\n", "\n", "47->61\n", "\n", "\n", "\n", "\n", "\n", "49\n", "<process_car@49:year>\n", "\n", "\n", "\n", "48->49\n", "\n", "\n", "\n", "\n", "\n", "50\n", "<process_car@49:year>\n", "\n", "\n", "\n", "49->50\n", "\n", "\n", "\n", "\n", "\n", "51\n", "1999\n", "\n", "\n", "\n", "50->51\n", "\n", "\n", "\n", "\n", "\n", "54\n", "car\n", "\n", "\n", "\n", "53->54\n", "\n", "\n", "\n", "\n", "\n", "57\n", "<process_car@49:company>\n", "\n", "\n", "\n", "56->57\n", "\n", "\n", "\n", "\n", "\n", "58\n", "<process_car@49:company>\n", "\n", "\n", "\n", "57->58\n", "\n", "\n", "\n", "\n", "\n", "59\n", "Chevy\n", "\n", "\n", "\n", "58->59\n", "\n", "\n", "\n", "\n", "\n", "62\n", "<process_car@49:model>\n", "\n", "\n", "\n", "61->62\n", "\n", "\n", "\n", "\n", "\n", "63\n", "<process_car@49:model>\n", "\n", "\n", "\n", "62->63\n", "\n", "\n", "\n", "\n", "\n", "64\n", "Venture\n", "\n", "\n", "\n", "63->64\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Tracer(INVENTORY) as tracer:\n", " process_inventory(tracer.my_input)\n", "\n", "sm = ScopeTracker(tracer.my_input, tracer.trace)\n", "for k, v in sm.my_assignments.defined_vars():\n", " print(k, '=', repr(v))\n", "inventory_dt = ScopeTreeMiner(\n", " tracer.my_input,\n", " sm.my_assignments.defined_vars(\n", " formatted=False))\n", "display_tree(inventory_dt.tree, graph_attr=lr_graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The recovered parse tree seems reasonable.\n", "\n", "One of the things that one might notice from our Example (2) is that the three subtrees -- `vehicle[2:1]`, `vehicle[4:1]` and `vehicle[6:1]` are quite alike. We will examine how this can be exploited to generate a grammar directly, next." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Grammar Mining" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `tree_to_grammar()` is now redefined as follows, to account for the extra scope in nodes." ] }, { "cell_type": "code", "execution_count": 188, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.575522Z", "iopub.status.busy": "2024-01-18T17:20:01.575379Z", "iopub.status.idle": "2024-01-18T17:20:01.578572Z", "shell.execute_reply": "2024-01-18T17:20:01.578261Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedGrammarMiner(GrammarMiner):\n", " def tree_to_grammar(self, tree):\n", " key, children, scope = tree\n", " one_alt = [ckey for ckey, gchildren, cscope in children if ckey != key]\n", " hsh = {key: [one_alt] if one_alt else []}\n", " for child in children:\n", " (ckey, _gc, _cscope) = child\n", " if not is_nonterminal(ckey):\n", " continue\n", " chsh = self.tree_to_grammar(child)\n", " for k in chsh:\n", " if k not in hsh:\n", " hsh[k] = chsh[k]\n", " else:\n", " hsh[k].extend(chsh[k])\n", " return hsh" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The grammar is in canonical form, which needs to be massaged to display. First, the recovered grammar for inventory." ] }, { "cell_type": "code", "execution_count": 189, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.580850Z", "iopub.status.busy": "2024-01-18T17:20:01.580697Z", "iopub.status.idle": "2024-01-18T17:20:01.599454Z", "shell.execute_reply": "2024-01-18T17:20:01.599143Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@22:inventory" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@22:inventory\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@24:vehicle\n", "\n", "\n", "process_inventory@24:vehicle\n", "\n", "\n", "process_inventory@24:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@24:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "si = ScopedGrammarMiner()\n", "si.add_tree(inventory_dt)\n", "syntax_diagram(readable(si.grammar))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The recovered grammar for URLs." ] }, { "cell_type": "code", "execution_count": 190, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.601255Z", "iopub.status.busy": "2024-01-18T17:20:01.601084Z", "iopub.status.idle": "2024-01-18T17:20:01.621381Z", "shell.execute_reply": "2024-01-18T17:20:01.620996Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@372:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@372:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@437:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@437:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@478:scheme\n", ":\n", "urlsplit@478:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@478:scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:scheme" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http\n", "\n", "ftp" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@478:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "_splitnetloc@411:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "_splitnetloc@411:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "//\n", "urlsplit@481:netloc\n", "urlsplit@481:url\n", "\n", "//\n", "urlsplit@481:netloc\n", "/" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@481:netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "_checknetloc@419:netloc" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "_checknetloc@419:netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:netloc" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "www.fuzzingbook.org\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.cispa.saarland:80\n", "\n", "freebsd.org" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@481:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@486:url\n", "#\n", "urlsplit@486:fragment\n", "\n", "/#\n", "urlsplit@486:fragment\n", "\n", "urlparse@394:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@486:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/?\n", "urlsplit@488:query" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@488:query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:query" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@486:fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:fragment" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/releases/5.8" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "su = ScopedGrammarMiner()\n", "for t in url_dts:\n", " su.add_tree(t)\n", "syntax_diagram(readable(su.grammar))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "One might notice that the grammar is not entirely human-readable, with a number of single token definitions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Hence, the last piece of the puzzle is the cleanup method `clean_grammar()`, which cleans up such definitions. The idea is to look for single token definitions such that a key is defined exactly by another key (single alternative, single token, nonterminal)." ] }, { "cell_type": "code", "execution_count": 191, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.623217Z", "iopub.status.busy": "2024-01-18T17:20:01.623126Z", "iopub.status.idle": "2024-01-18T17:20:01.625705Z", "shell.execute_reply": "2024-01-18T17:20:01.625454Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedGrammarMiner(ScopedGrammarMiner):\n", " def get_replacements(self, grammar):\n", " replacements = {}\n", " for k in grammar:\n", " if k == START_SYMBOL:\n", " continue\n", " alts = grammar[k]\n", " if len(set([str(i) for i in alts])) != 1:\n", " continue\n", " rule = alts[0]\n", " if len(rule) != 1:\n", " continue\n", " tok = rule[0]\n", " if not is_nonterminal(tok):\n", " continue\n", " replacements[k] = tok\n", " return replacements" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Once we have such a list, iteratively replace the original key where ever it is used with the token we found earlier. Repeat until none is left." ] }, { "cell_type": "code", "execution_count": 192, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.627691Z", "iopub.status.busy": "2024-01-18T17:20:01.627462Z", "iopub.status.idle": "2024-01-18T17:20:01.630708Z", "shell.execute_reply": "2024-01-18T17:20:01.630338Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class ScopedGrammarMiner(ScopedGrammarMiner):\n", " def clean_grammar(self):\n", " replacements = self.get_replacements(self.grammar)\n", "\n", " while True:\n", " changed = set()\n", " for k in self.grammar:\n", " if k in replacements:\n", " continue\n", " new_alts = []\n", " for alt in self.grammar[k]:\n", " new_alt = []\n", " for t in alt:\n", " if t in replacements:\n", " new_alt.append(replacements[t])\n", " changed.add(t)\n", " else:\n", " new_alt.append(t)\n", " new_alts.append(new_alt)\n", " self.grammar[k] = new_alts\n", " if not changed:\n", " break\n", " for k in changed:\n", " self.grammar.pop(k, None)\n", " return readable(self.grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `clean_grammar()` is used as follows:" ] }, { "cell_type": "code", "execution_count": 193, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.632261Z", "iopub.status.busy": "2024-01-18T17:20:01.632172Z", "iopub.status.idle": "2024-01-18T17:20:01.649136Z", "shell.execute_reply": "2024-01-18T17:20:01.648851Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@22:inventory" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@22:inventory\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "si = ScopedGrammarMiner()\n", "si.add_tree(inventory_dt)\n", "syntax_diagram(readable(si.clean_grammar()))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We update the `update_grammar()` to use the right tracker and miner." ] }, { "cell_type": "code", "execution_count": 194, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.650959Z", "iopub.status.busy": "2024-01-18T17:20:01.650695Z", "iopub.status.idle": "2024-01-18T17:20:01.653095Z", "shell.execute_reply": "2024-01-18T17:20:01.652850Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class ScopedGrammarMiner(ScopedGrammarMiner):\n", " def update_grammar(self, inputstr, trace):\n", " at = self.create_tracker(inputstr, trace)\n", " dt = self.create_tree_miner(\n", " inputstr, at.my_assignments.defined_vars(\n", " formatted=False))\n", " self.add_tree(dt)\n", " return self.grammar\n", "\n", " def create_tracker(self, *args):\n", " return ScopeTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return ScopeTreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `recover_grammar()` uses the right miner, and returns a cleaned grammar." ] }, { "cell_type": "code", "execution_count": 195, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.654929Z", "iopub.status.busy": "2024-01-18T17:20:01.654719Z", "iopub.status.idle": "2024-01-18T17:20:01.656855Z", "shell.execute_reply": "2024-01-18T17:20:01.656582Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def recover_grammar(fn, inputs, **kwargs): # type: ignore\n", " miner = ScopedGrammarMiner()\n", " for inputstr in inputs:\n", " with Tracer(inputstr, **kwargs) as tracer:\n", " fn(tracer.my_input)\n", " miner.update_grammar(tracer.my_input, tracer.trace)\n", " return readable(miner.clean_grammar())" ] }, { "cell_type": "code", "execution_count": 196, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.658473Z", "iopub.status.busy": "2024-01-18T17:20:01.658288Z", "iopub.status.idle": "2024-01-18T17:20:01.816373Z", "shell.execute_reply": "2024-01-18T17:20:01.816072Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])" ] }, { "cell_type": "code", "execution_count": 197, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.818117Z", "iopub.status.busy": "2024-01-18T17:20:01.818022Z", "iopub.status.idle": "2024-01-18T17:20:01.832052Z", "shell.execute_reply": "2024-01-18T17:20:01.831600Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@437:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@437:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:scheme\n", ":\n", "_splitnetloc@411:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http\n", "\n", "ftp" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "_splitnetloc@411:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "//\n", "urlparse@394:netloc\n", "/\n", "\n", "//\n", "urlparse@394:netloc\n", "urlsplit@481:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "www.fuzzingbook.org\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.cispa.saarland:80\n", "\n", "freebsd.org" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@481:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@486:url\n", "#\n", "urlparse@394:fragment\n", "\n", "/#\n", "urlparse@394:fragment\n", "\n", "urlparse@394:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@486:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/?\n", "urlparse@394:query" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/releases/5.8" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "code", "execution_count": 198, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.833816Z", "iopub.status.busy": "2024-01-18T17:20:01.833737Z", "iopub.status.idle": "2024-01-18T17:20:01.836723Z", "shell.execute_reply": "2024-01-18T17:20:01.836440Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://freebsd.org/\n", "http://www.cispa.saarland:80/#News\n", "http://user:pass@www.google.com:80/\n", "ftp://www.cispa.saarland:80/#News\n", "http://www.fuzzingbook.org/\n", "https://freebsd.org/\n", "https://user:pass@www.google.com:80/\n", "ftp://www.fuzzingbook.org/\n", "https://www.cispa.saarland:80/#News\n", "https://user:pass@www.google.com:80/\n" ] } ], "source": [ "f = GrammarFuzzer(url_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "code", "execution_count": 199, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.838383Z", "iopub.status.busy": "2024-01-18T17:20:01.838236Z", "iopub.status.idle": "2024-01-18T17:20:01.847272Z", "shell.execute_reply": "2024-01-18T17:20:01.846961Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar = recover_grammar(process_inventory, [INVENTORY])" ] }, { "cell_type": "code", "execution_count": 200, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.848934Z", "iopub.status.busy": "2024-01-18T17:20:01.848844Z", "iopub.status.idle": "2024-01-18T17:20:01.864494Z", "shell.execute_reply": "2024-01-18T17:20:01.864237Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@22:inventory" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@22:inventory\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(inventory_grammar)" ] }, { "cell_type": "code", "execution_count": 201, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.866188Z", "iopub.status.busy": "2024-01-18T17:20:01.866074Z", "iopub.status.idle": "2024-01-18T17:20:01.874464Z", "shell.execute_reply": "2024-01-18T17:20:01.874243Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "2000,car,Ford,E350\n", "2000,van,Ford,E350\n", "1997,car,Chevy,Cougar\n", "1997,van,Ford,E350\n", "1997,car,Ford,E350\n", "1999,car,Mercury,Venture\n", "2000,car,Ford,E350\n", "2000,van,Chevy,Venture\n", "1997,van,Mercury,Venture\n", "2000,car,Ford,E350\n", "1997,van,Ford,Venture\n", "2000,van,Mercury,E350\n", "1997,van,Chevy,Venture\n", "2000,van,Ford,Venture\n", "1999,car,Ford,E350\n", "1997,car,Chevy,Venture\n", "1997,car,Mercury,E350\n", "1997,van,Ford,E350\n", "1997,van,Ford,Cougar\n", "1999,van,Chevy,E350\n", "2000,car,Ford,E350\n", "1997,van,Chevy,Venture\n", "2000,van,Chevy,Venture\n", "2000,van,Ford,E350\n", "1997,van,Chevy,Cougar\n", "1999,van,Chevy,E350\n", "1999,car,Chevy,E350\n", "1999,car,Ford,E350\n", "2000,van,Ford,Venture\n", "1997,van,Ford,Venture\n" ] } ], "source": [ "f = GrammarFuzzer(inventory_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We see how tracking scope helps us to extract an even more precise grammar." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Notice that we use *String* inclusion testing as a way of determining whether a particular string fragment came from the original input string. While this may seem rather error-prone compared to dynamic tainting, we note that numerous tracing tools such as `dtrace()` and `ptrace()` allow one to obtain the information we seek from execution of binaries directly in different platforms. However, methods for obtaining dynamic taints almost always involve instrumenting the binaries before they can be used. Hence, this method of string inclusion can be more generally applied than dynamic tainting approaches. Further, dynamic taints are often lost due to implicit transmission, or at the boundary between *Python* and *C* code. String inclusion has not such problems. Hence, our approach can often obtain better results than relying on dynamic tainting." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Synopsis\n", "\n", "This chapter provides a number of classes to mine input grammars from existing programs. The function `recover_grammar()` could be the easiest to use. It takes a function and a set of inputs, and returns a grammar that describes its input language." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We apply `recover_grammar()` on a `url_parse()` function that takes and decomposes URLs:" ] }, { "cell_type": "code", "execution_count": 202, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.876191Z", "iopub.status.busy": "2024-01-18T17:20:01.876046Z", "iopub.status.idle": "2024-01-18T17:20:01.877886Z", "shell.execute_reply": "2024-01-18T17:20:01.877587Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_parse('https://www.fuzzingbook.org/')" ] }, { "cell_type": "code", "execution_count": 203, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.879689Z", "iopub.status.busy": "2024-01-18T17:20:01.879541Z", "iopub.status.idle": "2024-01-18T17:20:01.881714Z", "shell.execute_reply": "2024-01-18T17:20:01.881470Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['http://user:pass@www.google.com:80/?q=path#ref',\n", " 'https://www.cispa.saarland:80/',\n", " 'http://www.fuzzingbook.org/#News']" ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "URLS" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We extract the input grammar for `url_parse()` using `recover_grammar()`:" ] }, { "cell_type": "code", "execution_count": 204, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.883350Z", "iopub.status.busy": "2024-01-18T17:20:01.883221Z", "iopub.status.idle": "2024-01-18T17:20:01.993056Z", "shell.execute_reply": "2024-01-18T17:20:01.992759Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'': [''],\n", " '': [':<_splitnetloc@411:url>'],\n", " '': ['https', 'http'],\n", " '<_splitnetloc@411:url>': ['///',\n", " '//'],\n", " '': ['user:pass@www.google.com:80',\n", " 'www.fuzzingbook.org',\n", " 'www.cispa.saarland:80'],\n", " '': ['#',\n", " '/#'],\n", " '': ['/?'],\n", " '': ['q=path'],\n", " '': ['ref', 'News']}" ] }, "execution_count": 204, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])\n", "grammar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The names of nonterminals are a bit technical; but the grammar nicely represents the structure of the input; for instance, the different schemes (`\"http\"`, `\"https\"`) are all identified:" ] }, { "cell_type": "code", "execution_count": 205, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:01.994762Z", "iopub.status.busy": "2024-01-18T17:20:01.994647Z", "iopub.status.idle": "2024-01-18T17:20:02.005887Z", "shell.execute_reply": "2024-01-18T17:20:02.005551Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@437:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@437:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlparse@394:scheme\n", ":\n", "_splitnetloc@411:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:scheme\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "https\n", "\n", "http" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "_splitnetloc@411:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "//\n", "urlparse@394:netloc\n", "/\n", "\n", "//\n", "urlparse@394:netloc\n", "urlsplit@481:url" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:netloc\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "user:pass@www.google.com:80\n", "\n", "www.fuzzingbook.org\n", "\n", "www.cispa.saarland:80" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@481:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "urlsplit@486:url\n", "#\n", "urlparse@394:fragment\n", "\n", "/#\n", "urlparse@394:fragment" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlsplit@486:url\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "/?\n", "urlparse@394:query" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:query\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "q=path" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "urlparse@394:fragment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "ref\n", "\n", "News" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The grammar can be immediately used for fuzzing, producing arbitrary combinations of input elements, which are all syntactically valid." ] }, { "cell_type": "code", "execution_count": 206, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.007570Z", "iopub.status.busy": "2024-01-18T17:20:02.007486Z", "iopub.status.idle": "2024-01-18T17:20:02.314819Z", "shell.execute_reply": "2024-01-18T17:20:02.314516Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from GrammarCoverageFuzzer import GrammarCoverageFuzzer" ] }, { "cell_type": "code", "execution_count": 207, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.316741Z", "iopub.status.busy": "2024-01-18T17:20:02.316571Z", "iopub.status.idle": "2024-01-18T17:20:02.320896Z", "shell.execute_reply": "2024-01-18T17:20:02.320514Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "['https://www.fuzzingbook.org/',\n", " 'http://user:pass@www.google.com:80/#News',\n", " 'http://www.cispa.saarland:80/?q=path#ref',\n", " 'https://user:pass@www.google.com:80/#ref',\n", " 'https://www.cispa.saarland:80/']" ] }, "execution_count": 207, "metadata": {}, "output_type": "execute_result" } ], "source": [ "fuzzer = GrammarCoverageFuzzer(grammar)\n", "[fuzzer.fuzz() for i in range(5)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Being able to automatically extract a grammar and to use this grammar for fuzzing makes for very effective test generation with a minimum of manual work." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Lessons Learned\n", "\n", "* Given a set of sample inputs for a program, we can learn an input grammar by examining variable values during execution if the program relies on handwritten parsers.\n", "* Simple string inclusion checks are sufficient to obtain reasonably accurate grammars from real world programs.\n", "* The resulting grammars can be directly used for fuzzing, and can have a multiplier effect on any samples you have." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Next Steps\n", "\n", "* Learn how to use [information flow](InformationFlow.ipynb) to further improve mapping inputs to states." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Background\n", "\n", "Recovering the language from a _set of samples_ (i.e., not taking into account a possible program that might process them) is a well researched topic. The excellent reference by Higuera \\cite{higuera2010grammatical} covers all the classical approaches. The current state of the art in black box grammar mining is described by Clark \\cite{clark2013learning}.\n", "\n", "Learning an input language from a _program_, with or without samples, is yet an emerging topic, despite its potential for fuzzing. The pioneering work in this area was done by Lin et al. \\cite{Lin2008} who invented a way to retrieve the parse trees from top down and bottom up parsers. The approach described in this chapter is based directly on the AUTOGRAM work of Hoschele et al. \\cite{Hoschele2017}." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercises" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise 1: Flattening complex objects\n", "\n", "Our grammar miners only check for string fragments. However, programs may often pass containers or custom objects containing input fragments. For example, consider the plausible modification for our inventory processor, where we use a custom object `Vehicle` to carry fragments." ] }, { "cell_type": "code", "execution_count": 208, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.322715Z", "iopub.status.busy": "2024-01-18T17:20:02.322619Z", "iopub.status.idle": "2024-01-18T17:20:02.324772Z", "shell.execute_reply": "2024-01-18T17:20:02.324481Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Vehicle:\n", " def __init__(self, vehicle: str):\n", " year, kind, company, model, *_ = vehicle.split(',')\n", " self.year, self.kind, self.company, self.model = year, kind, company, model" ] }, { "cell_type": "code", "execution_count": 209, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.326281Z", "iopub.status.busy": "2024-01-18T17:20:02.326162Z", "iopub.status.idle": "2024-01-18T17:20:02.328268Z", "shell.execute_reply": "2024-01-18T17:20:02.327985Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def process_inventory_with_obj(inventory: str) -> str:\n", " res = []\n", " for vehicle in inventory.split('\\n'):\n", " ret = process_vehicle(vehicle)\n", " res.extend(ret)\n", "\n", " return '\\n'.join(res)" ] }, { "cell_type": "code", "execution_count": 210, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.329826Z", "iopub.status.busy": "2024-01-18T17:20:02.329716Z", "iopub.status.idle": "2024-01-18T17:20:02.331698Z", "shell.execute_reply": "2024-01-18T17:20:02.331460Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def process_vehicle_with_obj(vehicle: str) -> List[str]:\n", " v = Vehicle(vehicle)\n", " if v.kind == 'van':\n", " return process_van_with_obj(v)\n", "\n", " elif v.kind == 'car':\n", " return process_car_with_obj(v)\n", "\n", " else:\n", " raise Exception('Invalid entry')" ] }, { "cell_type": "code", "execution_count": 211, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.333278Z", "iopub.status.busy": "2024-01-18T17:20:02.333172Z", "iopub.status.idle": "2024-01-18T17:20:02.335165Z", "shell.execute_reply": "2024-01-18T17:20:02.334931Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def process_van_with_obj(vehicle: Vehicle) -> List[str]:\n", " res = [\n", " \"We have a %s %s van from %s vintage.\" % (vehicle.company,\n", " vehicle.model, vehicle.year)\n", " ]\n", " iyear = int(vehicle.year)\n", " if iyear > 2010:\n", " res.append(\"It is a recent model!\")\n", " else:\n", " res.append(\"It is an old but reliable model!\")\n", " return res" ] }, { "cell_type": "code", "execution_count": 212, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.336583Z", "iopub.status.busy": "2024-01-18T17:20:02.336484Z", "iopub.status.idle": "2024-01-18T17:20:02.338431Z", "shell.execute_reply": "2024-01-18T17:20:02.338176Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def process_car_with_obj(vehicle: Vehicle) -> List[str]:\n", " res = [\n", " \"We have a %s %s car from %s vintage.\" % (vehicle.company,\n", " vehicle.model, vehicle.year)\n", " ]\n", " iyear = int(vehicle.year)\n", " if iyear > 2016:\n", " res.append(\"It is a recent model!\")\n", " else:\n", " res.append(\"It is an old but reliable model!\")\n", " return res" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We recover the grammar as before." ] }, { "cell_type": "code", "execution_count": 213, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.339911Z", "iopub.status.busy": "2024-01-18T17:20:02.339812Z", "iopub.status.idle": "2024-01-18T17:20:02.414973Z", "shell.execute_reply": "2024-01-18T17:20:02.414643Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "vehicle_grammar = recover_grammar(\n", " process_inventory_with_obj,\n", " [INVENTORY],\n", " methods=INVENTORY_METHODS)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The new vehicle grammar is missing in details, especially as to the different models and company for a van and car." ] }, { "cell_type": "code", "execution_count": 214, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.416973Z", "iopub.status.busy": "2024-01-18T17:20:02.416853Z", "iopub.status.idle": "2024-01-18T17:20:02.431874Z", "shell.execute_reply": "2024-01-18T17:20:02.431613Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(vehicle_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden", "solution2_first": true }, "source": [ "The problem is that, we are looking specifically for string objects that contain fragments of the input string during tracing. Can you modify our grammar miner to correctly account for the complex objects too?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "**Solution.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "The problem can be understood if we execute the tracer under verbose logging." ] }, { "cell_type": "code", "execution_count": 215, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.433872Z", "iopub.status.busy": "2024-01-18T17:20:02.433718Z", "iopub.status.idle": "2024-01-18T17:20:02.441104Z", "shell.execute_reply": "2024-01-18T17:20:02.440892Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:22:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:23:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory)\n", "\n", "Traced values:\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:22:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:23:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('return', ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('return', ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('return', ['We have a Mercury Cougar car from 2000 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('return', ['We have a Mercury Cougar car from 2000 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('return', ['We have a Chevy Venture car from 1999 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('return', ['We have a Chevy Venture car from 1999 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('return', 'We have a Ford E350 van from 1997 vintage.\\nIt is an old but reliable model!\\nWe have a Mercury Cougar car from 2000 vintage.\\nIt is an old but reliable model!\\nWe have a Chevy Venture car from 1999 vintage.\\nIt is an old but reliable model!', /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n" ] } ], "source": [ "with Tracer(INVENTORY, methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_inventory(tracer.my_input)\n", "print()\n", "print('Traced values:')\n", "for t in tracer.trace:\n", " print(t)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "You can see that we lose track of string fragments as soon as they are incorporated into the `Vehicle` object. The way out is to trace these variables separately." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "For that, we develop the `flatten()` method that given any custom complex object and its key, returns a list of flattened (_key_, _value_) pairs that correspond to the object passed in.\n", "\n", "The `MAX_DEPTH` parameter controls the maximum flattening limit." ] }, { "cell_type": "code", "execution_count": 216, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.442940Z", "iopub.status.busy": "2024-01-18T17:20:02.442764Z", "iopub.status.idle": "2024-01-18T17:20:02.444523Z", "shell.execute_reply": "2024-01-18T17:20:02.444240Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "MAX_DEPTH = 10" ] }, { "cell_type": "code", "execution_count": 217, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.446242Z", "iopub.status.busy": "2024-01-18T17:20:02.446035Z", "iopub.status.idle": "2024-01-18T17:20:02.447775Z", "shell.execute_reply": "2024-01-18T17:20:02.447550Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "def set_flatten_depth(depth):\n", " global MAX_DEPTH\n", " MAX_DEPTH = depth" ] }, { "cell_type": "code", "execution_count": 218, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.449358Z", "iopub.status.busy": "2024-01-18T17:20:02.449163Z", "iopub.status.idle": "2024-01-18T17:20:02.452622Z", "shell.execute_reply": "2024-01-18T17:20:02.452371Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "def flatten(key, val, depth=MAX_DEPTH):\n", " tv = type(val)\n", " if depth <= 0:\n", " return [(key, val)]\n", " if isinstance(val, (int, float, complex, str, bytes, bytearray)):\n", " return [(key, val)]\n", " elif isinstance(val, (set, frozenset, list, tuple, range)):\n", " values = [(i, e) for i, elt in enumerate(val) for e in flatten(i, elt, depth-1)]\n", " return [(\"%s.%d\" % (key, i), v) for i, v in values]\n", " elif isinstance(val, dict):\n", " values = [e for k, elt in val.items() for e in flatten(k, elt, depth-1)]\n", " return [(\"%s.%s\" % (key, k), v) for k, v in values]\n", " elif isinstance(val, str):\n", " return [(key, val)]\n", " elif hasattr(val, '__dict__'):\n", " values = [e for k, elt in val.__dict__.items()\n", " for e in flatten(k, elt, depth-1)]\n", " return [(\"%s.%s\" % (key, k), v) for k, v in values]\n", " else:\n", " return [(key, val)]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Next, we hook the `flatten()` into the `Context` class so that the parameters we obtain are flattened." ] }, { "cell_type": "code", "execution_count": 219, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.454150Z", "iopub.status.busy": "2024-01-18T17:20:02.454075Z", "iopub.status.idle": "2024-01-18T17:20:02.456516Z", "shell.execute_reply": "2024-01-18T17:20:02.456274Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class Context(Context):\n", " def extract_vars(self, frame):\n", " vals = inspect.getargvalues(frame).locals\n", " return {k1: v1 for k, v in vals.items() for k1, v1 in flatten(k, v)}\n", "\n", " def parameters(self, all_vars):\n", " def check_param(k):\n", " return any(k.startswith(p) for p in self.parameter_names)\n", " return {k: v for k, v in all_vars.items() if check_param(k)}\n", "\n", " def qualified(self, all_vars):\n", " return {\"%s:%s\" % (self.method, k): v for k, v in all_vars.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "With this change, we have the following trace output." ] }, { "cell_type": "code", "execution_count": 220, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.458070Z", "iopub.status.busy": "2024-01-18T17:20:02.457988Z", "iopub.status.idle": "2024-01-18T17:20:02.465862Z", "shell.execute_reply": "2024-01-18T17:20:02.465627Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:22:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:23:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", "-> /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory)\n", " /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory)\n", "<- /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory)\n", "\n", "Traced values:\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:22:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:23:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:40:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:41:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:42:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:43:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:46:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('return', ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:47:process_van(year,company,model), {'year': '1997', 'company': 'Ford', 'model': 'E350'})\n", "('return', ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:32:process_vehicle(vehicle), {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1997,van,Ford,E350'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('return', ['We have a Mercury Cougar car from 2000 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '2000', 'company': 'Mercury', 'model': 'Cougar'})\n", "('return', ['We have a Mercury Cougar car from 2000 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '2000,car,Mercury,Cougar', 'year': '2000', 'kind': 'car', 'company': 'Mercury', 'model': 'Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '2000,car,Mercury,Cougar'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:25:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:29:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:30:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:31:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:34:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('call', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:49:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:50:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:51:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:52:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:55:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('return', ['We have a Chevy Venture car from 1999 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:56:process_car(year,company,model), {'year': '1999', 'company': 'Chevy', 'model': 'Venture'})\n", "('return', ['We have a Chevy Venture car from 1999 vintage.', 'It is an old but reliable model!'], /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:35:process_vehicle(vehicle), {'vehicle': '1999,car,Chevy,Venture', 'year': '1999', 'kind': 'car', 'company': 'Chevy', 'model': 'Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:26:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:24:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('line', None, /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n", "('return', 'We have a Ford E350 van from 1997 vintage.\\nIt is an old but reliable model!\\nWe have a Mercury Cougar car from 2000 vintage.\\nIt is an old but reliable model!\\nWe have a Chevy Venture car from 1999 vintage.\\nIt is an old but reliable model!', /Users/zeller/Projects/fuzzingbook/notebooks/Parser.ipynb:27:process_inventory(inventory), {'inventory': '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture', 'vehicle': '1999,car,Chevy,Venture'})\n" ] } ], "source": [ "with Tracer(INVENTORY, methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_inventory(tracer.my_input)\n", "print()\n", "print('Traced values:')\n", "for t in tracer.trace:\n", " print(t)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Our change seems to have worked. Let us derive the grammar." ] }, { "cell_type": "code", "execution_count": 221, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.467451Z", "iopub.status.busy": "2024-01-18T17:20:02.467372Z", "iopub.status.idle": "2024-01-18T17:20:02.477155Z", "shell.execute_reply": "2024-01-18T17:20:02.476770Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "vehicle_grammar = recover_grammar(\n", " process_inventory,\n", " [INVENTORY],\n", " methods=INVENTORY_METHODS)" ] }, { "cell_type": "code", "execution_count": 222, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.478825Z", "iopub.status.busy": "2024-01-18T17:20:02.478735Z", "iopub.status.idle": "2024-01-18T17:20:02.493103Z", "shell.execute_reply": "2024-01-18T17:20:02.492869Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@22:inventory" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@22:inventory\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(vehicle_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "The recovered grammar contains all the details that we were able to recover before." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Exercise 2: Incorporating Taints from InformationFlow" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" }, "solution2": "hidden", "solution2_first": true }, "source": [ "We have been using *string inclusion* to check whether a particular fragment came from the input string. This is unsatisfactory as it required us to compromise on the size of the strings tracked, which was limited to those greater than `FRAGMENT_LEN`. Further, it is possible that a single method could process a string where a fragment repeats, but is part of different tokens. For example, an embedded comma in the CSV file would cause our parser to fail. One way to avoid this is to rely on *dynamic taints*, and check for taint inclusion rather than string inclusion.\n", "\n", "The chapter on [information flow](InformationFlow.ipynb) details how to incorporate dynamic taints. Can you update our grammar miner based on scope to use *dynamic taints* instead?" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "**Solution.**" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "First, we import `ostr` to track the origins of string fragments." ] }, { "cell_type": "code", "execution_count": 223, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.495192Z", "iopub.status.busy": "2024-01-18T17:20:02.495027Z", "iopub.status.idle": "2024-01-18T17:20:02.557982Z", "shell.execute_reply": "2024-01-18T17:20:02.557719Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "from InformationFlow import ostr" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Next, we define `is_fragment()` to verify that a fragment is from a given input string." ] }, { "cell_type": "code", "execution_count": 224, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.560152Z", "iopub.status.busy": "2024-01-18T17:20:02.559930Z", "iopub.status.idle": "2024-01-18T17:20:02.561957Z", "shell.execute_reply": "2024-01-18T17:20:02.561697Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "def is_fragment(fragment, original):\n", " assert isinstance(original, ostr)\n", " if not isinstance(fragment, ostr):\n", " return False\n", " return set(fragment.origin) <= set(original.origin)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Now, all that remains is to hook the tainted fragment check to our grammar miner. This is accomplished by modifying `in_current_record()` and `ignored()` methods in the `InputStack`." ] }, { "cell_type": "code", "execution_count": 225, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.563641Z", "iopub.status.busy": "2024-01-18T17:20:02.563513Z", "iopub.status.idle": "2024-01-18T17:20:02.565331Z", "shell.execute_reply": "2024-01-18T17:20:02.565089Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedInputStack(InputStack):\n", " def in_current_record(self, val):\n", " return any(is_fragment(val, var) for var in self.inputs[-1].values())" ] }, { "cell_type": "code", "execution_count": 226, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.567170Z", "iopub.status.busy": "2024-01-18T17:20:02.566876Z", "iopub.status.idle": "2024-01-18T17:20:02.568548Z", "shell.execute_reply": "2024-01-18T17:20:02.568342Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedInputStack(TaintedInputStack):\n", " def ignored(self, val):\n", " return not isinstance(val, ostr)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "We then hook in the `TaintedInputStack` to the grammar mining infrastructure." ] }, { "cell_type": "code", "execution_count": 227, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.570451Z", "iopub.status.busy": "2024-01-18T17:20:02.570116Z", "iopub.status.idle": "2024-01-18T17:20:02.571912Z", "shell.execute_reply": "2024-01-18T17:20:02.571674Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedScopedVars(ScopedVars):\n", " def create_call_stack(self, i):\n", " return TaintedInputStack(i)" ] }, { "cell_type": "code", "execution_count": 228, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.573538Z", "iopub.status.busy": "2024-01-18T17:20:02.573359Z", "iopub.status.idle": "2024-01-18T17:20:02.575167Z", "shell.execute_reply": "2024-01-18T17:20:02.574955Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedScopeTracker(ScopeTracker):\n", " def create_assignments(self, *args):\n", " return TaintedScopedVars(*args)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.577217Z", "iopub.status.busy": "2024-01-18T17:20:02.576934Z", "iopub.status.idle": "2024-01-18T17:20:02.579197Z", "shell.execute_reply": "2024-01-18T17:20:02.578949Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedScopeTreeMiner(ScopeTreeMiner):\n", " def string_part_of_value(self, part, value):\n", " return is_fragment(part, value)\n", " \n", " def partition(self, part, value):\n", " begin = value.origin.index(part.origin[0])\n", " end = value.origin.index(part.origin[-1])+1\n", " return value[:begin], value[begin:end], value[end:]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "" ] }, { "cell_type": "code", "execution_count": 230, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.580848Z", "iopub.status.busy": "2024-01-18T17:20:02.580618Z", "iopub.status.idle": "2024-01-18T17:20:02.582636Z", "shell.execute_reply": "2024-01-18T17:20:02.582391Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "class TaintedScopedGrammarMiner(ScopedGrammarMiner):\n", " def create_tracker(self, *args):\n", " return TaintedScopeTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return TaintedScopeTreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Finally, we define `recover_grammar_with_taints()` to recover the grammar." ] }, { "cell_type": "code", "execution_count": 231, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.584308Z", "iopub.status.busy": "2024-01-18T17:20:02.584152Z", "iopub.status.idle": "2024-01-18T17:20:02.586410Z", "shell.execute_reply": "2024-01-18T17:20:02.586158Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "def recover_grammar_with_taints(fn, inputs, **kwargs):\n", " miner = TaintedScopedGrammarMiner()\n", " for inputstr in inputs:\n", " with Tracer(ostr(inputstr), **kwargs) as tracer:\n", " fn(tracer.my_input)\n", " miner.update_grammar(tracer.my_input, tracer.trace)\n", " return readable(miner.clean_grammar())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "Here is how one can use it." ] }, { "cell_type": "code", "execution_count": 232, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.588229Z", "iopub.status.busy": "2024-01-18T17:20:02.587918Z", "iopub.status.idle": "2024-01-18T17:20:02.632823Z", "shell.execute_reply": "2024-01-18T17:20:02.632549Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "inventory_grammar = recover_grammar_with_taints(\n", " process_inventory, [INVENTORY],\n", " methods=[\n", " 'process_inventory', 'process_vehicle', 'process_car', 'process_van'\n", " ])" ] }, { "cell_type": "code", "execution_count": 233, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.634678Z", "iopub.status.busy": "2024-01-18T17:20:02.634592Z", "iopub.status.idle": "2024-01-18T17:20:02.650063Z", "shell.execute_reply": "2024-01-18T17:20:02.649814Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_inventory@22:inventory" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_inventory@22:inventory\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle\n", "\n", "\n", "process_vehicle@29:vehicle" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@29:vehicle\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_vehicle@30:year\n", ",\n", "process_vehicle@30:kind\n", ",\n", "process_vehicle@30:company\n", ",\n", "process_vehicle@30:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_car@49:year\n", "\n", "process_van@40:year" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "1997" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:kind\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "car\n", "\n", "van" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:company\n", "\n", "process_car@49:company" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Ford" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_vehicle@30:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "process_van@40:model\n", "\n", "process_car@49:model" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_van@40:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "E350" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:year\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "2000\n", "\n", "1999" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:company\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Mercury\n", "\n", "Chevy" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "process_car@49:model\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "Cougar\n", "\n", "Venture" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(inventory_grammar)" ] }, { "cell_type": "code", "execution_count": 234, "metadata": { "execution": { "iopub.execute_input": "2024-01-18T17:20:02.651723Z", "iopub.status.busy": "2024-01-18T17:20:02.651591Z", "iopub.status.idle": "2024-01-18T17:20:02.990854Z", "shell.execute_reply": "2024-01-18T17:20:02.990553Z" }, "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "outputs": [], "source": [ "url_grammar = recover_grammar_with_taints(\n", " url_parse, URLS_X + ['ftp://user4:pass1@host4/?key4=value3'],\n", " methods=['urlsplit', 'urlparse', '_splitnetloc'])" ] } ], "metadata": { "ipub": { "bibliography": "fuzzingbook.bib" }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.2" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "toc-autonumbering": false }, "nbformat": 4, "nbformat_minor": 4 }