0

{ "cells": [ { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "# Mining Input Grammars\n", "\n", "So far, the grammars we have seen have been mostly specified manually – that is, you (or the person knowing the input format) had to design and write a grammar in the first place. While the grammars we have seen so far have been rather simple, creating a grammar for complex inputs can involve quite some effort. In this chapter, we therefore introduce techniques that _automatically mine grammars from programs_ – by executing the programs and observing how they process which parts of the input. In conjunction with a grammar fuzzer, this allows us to \n", "1. take a program, \n", "2. extract its input grammar, and \n", "3. fuzz it with high efficiency and effectiveness, using the concepts in this book." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:03.780231Z", "iopub.status.busy": "2025-01-16T09:56:03.780116Z", "iopub.status.idle": "2025-01-16T09:56:03.839514Z", "shell.execute_reply": "2025-01-16T09:56:03.839197Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [ { "data": { "text/html": [ "\n", " \n", " " ], "text/plain": [ "" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from bookutils import YouTubeVideo\n", "YouTubeVideo(\"ddM1oL2LYDI\")" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "**Prerequisites**\n", "\n", "* You should have read the [chapter on grammars](Grammars.ipynb).\n", "* The [chapter on configuration fuzzing](ConfigurationFuzzer.ipynb) introduces grammar mining for configuration options, as well as observing variables and values during execution.\n", "* We use the tracer from the [chapter on coverage](Coverage.ipynb).\n", "* The concept of parsing from the [chapter on parsers](Parser.ipynb) is also useful." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "## Synopsis\n", "\n", "\n", "To [use the code provided in this chapter](Importing.ipynb), write\n", "\n", "```python\n", ">>> from fuzzingbook.GrammarMiner import \n", "```\n", "\n", "and then make use of the following features.\n", "\n", "\n", "This chapter provides a number of classes to mine input grammars from existing programs. The function `recover_grammar()` could be the easiest to use. It takes a function and a set of inputs, and returns a grammar that describes its input language.\n", "\n", "We apply `recover_grammar()` on a `url_parse()` function that takes and decomposes URLs:\n", "\n", "```python\n", ">>> url_parse('https://www.fuzzingbook.org/')\n", ">>> URLS\n", "['http://user:pass@www.google.com:80/?q=path#ref',\n", " 'https://www.cispa.saarland:80/',\n", " 'http://www.fuzzingbook.org/#News']\n", "```\n", "We extract the input grammar for `url_parse()` using `recover_grammar()`:\n", "\n", "```python\n", ">>> grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])\n", ">>> grammar\n", "{'': [''],\n", " '': [':<_splitnetloc@413:url>'],\n", " '': ['http', 'https'],\n", " '<_splitnetloc@413:url>': ['///',\n", " '//'],\n", " '': ['www.cispa.saarland:80',\n", " 'www.fuzzingbook.org',\n", " 'user:pass@www.google.com:80'],\n", " '': ['#',\n", " '/#'],\n", " '': ['/?'],\n", " '': ['q=path'],\n", " '': ['News', 'ref']}\n", "```\n", "The names of nonterminals are a bit technical; but the grammar nicely represents the structure of the input; for instance, the different schemes (`\"http\"`, `\"https\"`) are all identified:\n", "\n", "```python\n", ">>> syntax_diagram(grammar)\n", "start\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-1.svg)\n", "```\n", "urlsplit@452:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-2.svg)\n", "```\n", "urlparse@396:scheme\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-3.svg)\n", "```\n", "_splitnetloc@413:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-4.svg)\n", "```\n", "urlparse@396:netloc\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-5.svg)\n", "```\n", "urlsplit@494:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-6.svg)\n", "```\n", "urlsplit@502:url\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-7.svg)\n", "```\n", "urlparse@396:query\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-8.svg)\n", "```\n", "urlparse@396:fragment\n", "\n", "```\n", "![](PICS/GrammarMiner-synopsis-9.svg)\n", "\n", "The grammar can be immediately used for fuzzing, producing arbitrary combinations of input elements, which are all syntactically valid.\n", "\n", "```python\n", ">>> from GrammarCoverageFuzzer import GrammarCoverageFuzzer\n", ">>> fuzzer = GrammarCoverageFuzzer(grammar)\n", ">>> [fuzzer.fuzz() for i in range(5)]\n", "['https://www.cispa.saarland:80/#ref',\n", " 'http://www.fuzzingbook.org/',\n", " 'http://user:pass@www.google.com:80/?q=path#News',\n", " 'https://www.fuzzingbook.org/?q=path#ref',\n", " 'http://www.cispa.saarland:80/#News']\n", "```\n", "Being able to automatically extract a grammar and to use this grammar for fuzzing makes for very effective test generation with a minimum of manual work.\n", "\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Grammar Challenge\n", "\n", "Consider the `process_inventory()` method from the [chapter on parsers](Parser.ipynb):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:03.860157Z", "iopub.status.busy": "2025-01-16T09:56:03.859981Z", "iopub.status.idle": "2025-01-16T09:56:03.862533Z", "shell.execute_reply": "2025-01-16T09:56:03.862251Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import bookutils.setup" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:03.864119Z", "iopub.status.busy": "2025-01-16T09:56:03.864019Z", "iopub.status.idle": "2025-01-16T09:56:03.865693Z", "shell.execute_reply": "2025-01-16T09:56:03.865473Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from typing import List, Tuple, Callable, Any\n", "from collections.abc import Iterable" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:03.867285Z", "iopub.status.busy": "2025-01-16T09:56:03.867174Z", "iopub.status.idle": "2025-01-16T09:56:04.290141Z", "shell.execute_reply": "2025-01-16T09:56:04.289774Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import process_inventory, process_vehicle, process_car, process_van, lr_graph # minor dependency" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "It takes inputs of the following form." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.292398Z", "iopub.status.busy": "2025-01-16T09:56:04.292233Z", "iopub.status.idle": "2025-01-16T09:56:04.294161Z", "shell.execute_reply": "2025-01-16T09:56:04.293856Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "INVENTORY = \"\"\"\\\n", "1997,van,Ford,E350\n", "2000,car,Mercury,Cougar\n", "1999,car,Chevy,Venture\\\n", "\"\"\"" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.295827Z", "iopub.status.busy": "2025-01-16T09:56:04.295709Z", "iopub.status.idle": "2025-01-16T09:56:04.297589Z", "shell.execute_reply": "2025-01-16T09:56:04.297306Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "We have a Ford E350 van from 1997 vintage.\n", "It is an old but reliable model!\n", "We have a Mercury Cougar car from 2000 vintage.\n", "It is an old but reliable model!\n", "We have a Chevy Venture car from 1999 vintage.\n", "It is an old but reliable model!\n" ] } ], "source": [ "print(process_inventory(INVENTORY))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We found from the [chapter on parsers](Parser.ipynb) that coarse grammars do not work well for fuzzing when the input format includes details expressed only in code. That is, even though we have the formal specification of CSV files ([RFC 4180](https://tools.ietf.org/html/rfc4180)), the inventory system includes further rules as to what is expected at each index of the CSV file. The solution of simply recombining existing inputs, while practical, is incomplete. In particular, it relies on a formal input specification being available in the first place. However, we have no assurance that the program obeys the input specification given." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "One of the ways out of this predicament is to interrogate the program under test as to what its input specification is. That is, if the program under test is written in a style such that specific methods are responsible for handling specific parts of the input, one can recover the parse tree by observing the process of parsing. Further, one can recover a reasonable approximation of the grammar by abstraction from multiple input trees." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ " _We start with the assumption (1) that the program is written in such a fashion that specific methods are responsible for parsing specific fragments of the program -- This includes almost all ad hoc parsers._\n", "\n", "The idea is as follows:\n", "\n", "* Hook into the Python execution and observe the fragments of input string as they are produced and named in different methods.\n", "* Stitch the input fragments together in a tree structure to retrieve the **Parse Tree**.\n", "* Abstract common elements from multiple parse trees to produce the **Context Free Grammar** of the input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## A Simple Grammar Miner" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Say we want to obtain the input grammar for the function `process_vehicle()`. We first collect the sample inputs for this function." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.299358Z", "iopub.status.busy": "2025-01-16T09:56:04.299232Z", "iopub.status.idle": "2025-01-16T09:56:04.300956Z", "shell.execute_reply": "2025-01-16T09:56:04.300672Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "VEHICLES = INVENTORY.split('\\n')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The set of methods responsible for processing inventory are the following." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.302652Z", "iopub.status.busy": "2025-01-16T09:56:04.302538Z", "iopub.status.idle": "2025-01-16T09:56:04.304237Z", "shell.execute_reply": "2025-01-16T09:56:04.303966Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "INVENTORY_METHODS = {\n", " 'process_inventory',\n", " 'process_vehicle',\n", " 'process_van',\n", " 'process_car'}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We have seen from the chapter on [configuration fuzzing](ConfigurationFuzzer.ipynb) that one can hook into the Python runtime to observe the arguments to a function and any local variables created. We have also seen that one can obtain the context of execution by inspecting the `frame` argument. Here is a simple tracer that can return the local variables and other contextual information in a traced function. We reuse the `Coverage` tracing class." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Tracer" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.305952Z", "iopub.status.busy": "2025-01-16T09:56:04.305831Z", "iopub.status.idle": "2025-01-16T09:56:04.307419Z", "shell.execute_reply": "2025-01-16T09:56:04.307187Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Coverage import Coverage" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.308978Z", "iopub.status.busy": "2025-01-16T09:56:04.308869Z", "iopub.status.idle": "2025-01-16T09:56:04.310520Z", "shell.execute_reply": "2025-01-16T09:56:04.310264Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import inspect" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.312032Z", "iopub.status.busy": "2025-01-16T09:56:04.311925Z", "iopub.status.idle": "2025-01-16T09:56:04.314116Z", "shell.execute_reply": "2025-01-16T09:56:04.313874Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Coverage):\n", " def traceit(self, frame, event, arg):\n", " method_name = inspect.getframeinfo(frame).function\n", " if method_name not in INVENTORY_METHODS:\n", " return\n", " file_name = inspect.getframeinfo(frame).filename\n", "\n", " param_names = inspect.getargvalues(frame).args\n", " lineno = inspect.getframeinfo(frame).lineno\n", " local_vars = inspect.getargvalues(frame).locals\n", " print(event, file_name, lineno, method_name, param_names, local_vars)\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We run the code under trace context." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.315838Z", "iopub.status.busy": "2025-01-16T09:56:04.315732Z", "iopub.status.idle": "2025-01-16T09:56:04.360094Z", "shell.execute_reply": "2025-01-16T09:56:04.359816Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call Parser.ipynb 29 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}\n", "line Parser.ipynb 30 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350'}\n", "line Parser.ipynb 31 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n", "line Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n", "call Parser.ipynb 40 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line Parser.ipynb 40 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line Parser.ipynb 41 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line Parser.ipynb 42 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.']}\n", "line Parser.ipynb 43 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}\n", "line Parser.ipynb 46 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.'], 'iyear': 1997}\n", "line Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}\n", "return Parser.ipynb 47 process_van ['year', 'company', 'model'] {'year': '1997', 'company': 'Ford', 'model': 'E350', 'res': ['We have a Ford E350 van from 1997 vintage.', 'It is an old but reliable model!'], 'iyear': 1997}\n", "return Parser.ipynb 32 process_vehicle ['vehicle'] {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350', '_': []}\n" ] } ], "source": [ "with Tracer() as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The main thing that we want out of tracing is a list of assignments of input fragments to different variables. We can use the tracing facility `settrace()` to get that as we showed above.\n", "\n", "However, the `settrace()` function hooks into the Python debugging facility. When it is in operation, no debugger can hook into the program. That is, if there is a problem with our grammar miner, we will not be able to attach a debugger to it to understand what is happening. This is not ideal. Hence, we limit the tracer to the simplest implementation possible, and implement the core of grammar mining in later stages." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `traceit()` function relies on information from the `frame` variable which exposes Python internals. We define a `context` class that encapsulates the information that we need from the `frame`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Context\n", "\n", "The `Context` class provides easy access to the information such as the current module, and parameter names." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.361875Z", "iopub.status.busy": "2025-01-16T09:56:04.361752Z", "iopub.status.idle": "2025-01-16T09:56:04.364059Z", "shell.execute_reply": "2025-01-16T09:56:04.363776Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Context:\n", " def __init__(self, frame, track_caller=True):\n", " self.method = inspect.getframeinfo(frame).function\n", " self.parameter_names = inspect.getargvalues(frame).args\n", " self.file_name = inspect.getframeinfo(frame).filename\n", " self.line_no = inspect.getframeinfo(frame).lineno\n", "\n", " def _t(self):\n", " return (self.file_name, self.line_no, self.method,\n", " ','.join(self.parameter_names))\n", "\n", " def __repr__(self):\n", " return \"%s:%d:%s(%s)\" % self._t()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here we add a few convenience methods that operate on the `frame` to `Context`." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.365610Z", "iopub.status.busy": "2025-01-16T09:56:04.365505Z", "iopub.status.idle": "2025-01-16T09:56:04.367621Z", "shell.execute_reply": "2025-01-16T09:56:04.367360Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Context(Context):\n", " def extract_vars(self, frame):\n", " return inspect.getargvalues(frame).locals\n", "\n", " def parameters(self, all_vars):\n", " return {k: v for k, v in all_vars.items() if k in self.parameter_names}\n", "\n", " def qualified(self, all_vars):\n", " return {\"%s:%s\" % (self.method, k): v for k, v in all_vars.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We hook printing the context to our `traceit()` to see it in action. First we define a `log_event()` for displaying events." ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.369199Z", "iopub.status.busy": "2025-01-16T09:56:04.369093Z", "iopub.status.idle": "2025-01-16T09:56:04.370737Z", "shell.execute_reply": "2025-01-16T09:56:04.370513Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def log_event(event, var):\n", " print({'call': '->', 'return': '<-'}.get(event, ' '), var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "And use the `log_event()` in the `traceit()` function." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.372271Z", "iopub.status.busy": "2025-01-16T09:56:04.372167Z", "iopub.status.idle": "2025-01-16T09:56:04.373802Z", "shell.execute_reply": "2025-01-16T09:56:04.373566Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def traceit(self, frame, event, arg):\n", " log_event(event, Context(frame))\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Running `process_vehicle()` under trace prints the contexts encountered." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.375480Z", "iopub.status.busy": "2025-01-16T09:56:04.375378Z", "iopub.status.idle": "2025-01-16T09:56:04.378950Z", "shell.execute_reply": "2025-01-16T09:56:04.378699Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> Parser.ipynb:29:process_vehicle(vehicle)\n", " Parser.ipynb:30:process_vehicle(vehicle)\n", " Parser.ipynb:31:process_vehicle(vehicle)\n", " Parser.ipynb:32:process_vehicle(vehicle)\n", "-> Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:41:process_van(year,company,model)\n", " Parser.ipynb:42:process_van(year,company,model)\n", " Parser.ipynb:43:process_van(year,company,model)\n", " Parser.ipynb:46:process_van(year,company,model)\n", " Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:32:process_vehicle(vehicle)\n", "-> Coverage.ipynb:102:__exit__(self,exc_type,exc_value,tb)\n", " Coverage.ipynb:105:__exit__(self,exc_type,exc_value,tb)\n" ] } ], "source": [ "with Tracer() as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The trace produced by executing any function can get overwhelmingly large. Hence, we need to restrict our attention to specific modules. Further, we also restrict our attention exclusively to `str` variables since these variables are more likely to contain input fragments. (We will show how to deal with complex objects later in exercises.)\n", "\n", "The `Context` class we developed earlier is used to decide which modules to monitor, and which variables to trace.\n", "\n", "We store the current *input string* so that it can be used to determine if any particular string fragments came from the current input string. Any optional arguments are processed separately." ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.380446Z", "iopub.status.busy": "2025-01-16T09:56:04.380364Z", "iopub.status.idle": "2025-01-16T09:56:04.382049Z", "shell.execute_reply": "2025-01-16T09:56:04.381816Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def __init__(self, my_input, **kwargs):\n", " self.options(kwargs)\n", " self.my_input, self.trace = my_input, []" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We use an optional argument `files` to indicate the specific source files we are interested in, and `methods` to indicate which specific methods are of interest. Further, we also use `log` to specify whether verbose logging should be enabled during trace. We use the `log_event()` method we defined earlier for logging." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The options processing is as below." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.383507Z", "iopub.status.busy": "2025-01-16T09:56:04.383427Z", "iopub.status.idle": "2025-01-16T09:56:04.385257Z", "shell.execute_reply": "2025-01-16T09:56:04.385032Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def options(self, kwargs):\n", " self.files = kwargs.get('files', [])\n", " self.methods = kwargs.get('methods', [])\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `files` and `methods` are checked to determine, if a particular event should be traced or not" ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.386699Z", "iopub.status.busy": "2025-01-16T09:56:04.386622Z", "iopub.status.idle": "2025-01-16T09:56:04.388621Z", "shell.execute_reply": "2025-01-16T09:56:04.388360Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def tracing_context(self, cxt, event, arg):\n", " fres = not self.files or any(\n", " cxt.file_name.endswith(f) for f in self.files)\n", " mres = not self.methods or any(cxt.method == m for m in self.methods)\n", " return fres and mres" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similar to the context of events, we also want to restrict our attention to specific variables. For now, we want to focus only on strings. (See the Exercises at the end of the chapter on how to extend it to other kinds of objects)." ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.390477Z", "iopub.status.busy": "2025-01-16T09:56:04.390378Z", "iopub.status.idle": "2025-01-16T09:56:04.392322Z", "shell.execute_reply": "2025-01-16T09:56:04.392001Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def tracing_var(self, k, v):\n", " return isinstance(v, str)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We modify the `traceit()` to call an `on_event()` function with the context information only on the specific events we are interested in." ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.394021Z", "iopub.status.busy": "2025-01-16T09:56:04.393936Z", "iopub.status.idle": "2025-01-16T09:56:04.396606Z", "shell.execute_reply": "2025-01-16T09:56:04.396367Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Tracer(Tracer):\n", " def on_event(self, event, arg, cxt, my_vars):\n", " self.trace.append((event, arg, cxt, my_vars))\n", " \n", " def create_context(self, frame):\n", " return Context(frame)\n", "\n", " def traceit(self, frame, event, arg):\n", " cxt = self.create_context(frame)\n", " if not self.tracing_context(cxt, event, arg):\n", " return self.traceit\n", " self.log(event, cxt)\n", "\n", " my_vars = {\n", " k: v\n", " for k, v in cxt.extract_vars(frame).items()\n", " if self.tracing_var(k, v)\n", " }\n", " self.on_event(event, arg, cxt, my_vars)\n", " return self.traceit" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `Tracer` class can now focus on specific kinds of events on specific files. Further, it provides a first level filter for variables that we find interesting. For example, we want to focus specifically on variables from `process_*` methods that contain input fragments. Here is how our updated `Tracer` can be used." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.398159Z", "iopub.status.busy": "2025-01-16T09:56:04.398080Z", "iopub.status.idle": "2025-01-16T09:56:04.401976Z", "shell.execute_reply": "2025-01-16T09:56:04.401751Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> Parser.ipynb:29:process_vehicle(vehicle)\n", " Parser.ipynb:30:process_vehicle(vehicle)\n", " Parser.ipynb:31:process_vehicle(vehicle)\n", " Parser.ipynb:32:process_vehicle(vehicle)\n", "-> Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:41:process_van(year,company,model)\n", " Parser.ipynb:42:process_van(year,company,model)\n", " Parser.ipynb:43:process_van(year,company,model)\n", " Parser.ipynb:46:process_van(year,company,model)\n", " Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:32:process_vehicle(vehicle)\n" ] } ], "source": [ "with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_vehicle(VEHICLES[0])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The execution produced the following trace." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.403599Z", "iopub.status.busy": "2025-01-16T09:56:04.403499Z", "iopub.status.idle": "2025-01-16T09:56:04.405418Z", "shell.execute_reply": "2025-01-16T09:56:04.405195Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call process_vehicle {'vehicle': '1997,van,Ford,E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n", "line process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n", "call process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "line process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "return process_van {'year': '1997', 'company': 'Ford', 'model': 'E350'}\n", "return process_vehicle {'vehicle': '1997,van,Ford,E350', 'year': '1997', 'kind': 'van', 'company': 'Ford', 'model': 'E350'}\n" ] } ], "source": [ "for t in tracer.trace:\n", " print(t[0], t[2].method, dict(t[3]))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Since we are saving the input already in `Tracer`, it is redundant to specify it separately again as an argument." ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.407176Z", "iopub.status.busy": "2025-01-16T09:56:04.407041Z", "iopub.status.idle": "2025-01-16T09:56:04.411019Z", "shell.execute_reply": "2025-01-16T09:56:04.410765Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "-> Parser.ipynb:29:process_vehicle(vehicle)\n", " Parser.ipynb:30:process_vehicle(vehicle)\n", " Parser.ipynb:31:process_vehicle(vehicle)\n", " Parser.ipynb:32:process_vehicle(vehicle)\n", "-> Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:40:process_van(year,company,model)\n", " Parser.ipynb:41:process_van(year,company,model)\n", " Parser.ipynb:42:process_van(year,company,model)\n", " Parser.ipynb:43:process_van(year,company,model)\n", " Parser.ipynb:46:process_van(year,company,model)\n", " Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:47:process_van(year,company,model)\n", "<- Parser.ipynb:32:process_vehicle(vehicle)\n" ] } ], "source": [ "with Tracer(VEHICLES[0], methods=INVENTORY_METHODS, log=True) as tracer:\n", " process_vehicle(tracer.my_input)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### DefineTracker" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define a `DefineTracker` class that processes the trace from the `Tracer`. The idea is to store different variable definitions which are input fragments.\n", "\n", "The tracker identifies string fragments that are part of the input string, and stores them in a dictionary `my_assignments`. It saves the trace, and the corresponding input for processing. Finally, it calls `process()` to process the `trace` it was given. We will start with a simple tracker that relies on certain assumptions, and later see how these assumptions can be relaxed." ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.412886Z", "iopub.status.busy": "2025-01-16T09:56:04.412768Z", "iopub.status.idle": "2025-01-16T09:56:04.414666Z", "shell.execute_reply": "2025-01-16T09:56:04.414321Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class DefineTracker:\n", " def __init__(self, my_input, trace, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", " self.trace = trace\n", " self.my_assignments = {}\n", " self.process()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems of using substring search is that short string sequences tend to be included in other string sequences even though they may not have come from the original string. That is, say the input fragment is `v`, it could have equally come from either `van` or `chevy`. We rely on being able to predict the exact place in the input where a given fragment occurred. Hence, we define a constant `FRAGMENT_LEN` such that we ignore strings up to that length. We also incorporate a logging facility as before." ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.416732Z", "iopub.status.busy": "2025-01-16T09:56:04.416615Z", "iopub.status.idle": "2025-01-16T09:56:04.418612Z", "shell.execute_reply": "2025-01-16T09:56:04.418335Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "FRAGMENT_LEN = 3" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.420341Z", "iopub.status.busy": "2025-01-16T09:56:04.420247Z", "iopub.status.idle": "2025-01-16T09:56:04.422355Z", "shell.execute_reply": "2025-01-16T09:56:04.422083Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def options(self, kwargs):\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None\n", " self.fragment_len = kwargs.get('fragment_len', FRAGMENT_LEN)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Our tracer simply records the variable values as they occur. We next need to check if the variables contain values from the **input string**. Common ways to do this is to rely on symbolic execution or at least dynamic tainting, which are powerful, but also complex. However, one can obtain a reasonable approximation by simply relying on substring search. That is, we consider any value produced that is a substring of the original input string to have come from the original input." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We define an `is_input_fragment()` method that relies on string inclusion to detect if the string came from the input." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.424095Z", "iopub.status.busy": "2025-01-16T09:56:04.423988Z", "iopub.status.idle": "2025-01-16T09:56:04.425976Z", "shell.execute_reply": "2025-01-16T09:56:04.425543Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def is_input_fragment(self, var, value):\n", " return len(value) >= self.fragment_len and value in self.my_input" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can use `is_input_fragment()` to select only a subset of variables defined, as implemented below in `fragments()`." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.427652Z", "iopub.status.busy": "2025-01-16T09:56:04.427542Z", "iopub.status.idle": "2025-01-16T09:56:04.429398Z", "shell.execute_reply": "2025-01-16T09:56:04.429163Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def fragments(self, variables):\n", " return {k: v for k, v in variables.items(\n", " ) if self.is_input_fragment(k, v)}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The tracker processes each event, and at each event, it updates the dictionary `my_assignments` with the current local variables that contain strings that are part of the input. Note that there is a choice here with respect to what happens during reassignment. We can either discard all the reassignments, or keep only the last assignment. Here, we choose the latter. If you want the former behavior, check whether the value exists in `my_assignments` before storing a fragment." ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.430862Z", "iopub.status.busy": "2025-01-16T09:56:04.430757Z", "iopub.status.idle": "2025-01-16T09:56:04.432647Z", "shell.execute_reply": "2025-01-16T09:56:04.432397Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def track_event(self, event, arg, cxt, my_vars):\n", " self.log(event, (cxt.method, my_vars))\n", " self.my_assignments.update(self.fragments(my_vars))\n", "\n", " def process(self):\n", " for event, arg, cxt, my_vars in self.trace:\n", " self.track_event(event, arg, cxt, my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using the tracker, we can obtain the input fragments. For example, say we are only interested in strings that are at least `5` characters long." ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.434152Z", "iopub.status.busy": "2025-01-16T09:56:04.434014Z", "iopub.status.idle": "2025-01-16T09:56:04.436012Z", "shell.execute_reply": "2025-01-16T09:56:04.435734Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vehicle = '1997,van,Ford,E350'\n" ] } ], "source": [ "tracker = DefineTracker(tracer.my_input, tracer.trace, fragment_len=5)\n", "for k, v in tracker.my_assignments.items():\n", " print(k, '=', repr(v))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Or strings that are `2` characters long (the default)." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.437922Z", "iopub.status.busy": "2025-01-16T09:56:04.437815Z", "iopub.status.idle": "2025-01-16T09:56:04.439768Z", "shell.execute_reply": "2025-01-16T09:56:04.439512Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "vehicle = '1997,van,Ford,E350'\n", "year = '1997'\n", "kind = 'van'\n", "company = 'Ford'\n", "model = 'E350'\n" ] } ], "source": [ "tracker = DefineTracker(tracer.my_input, tracer.trace)\n", "for k, v in tracker.my_assignments.items():\n", " print(k, '=', repr(v))" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.441420Z", "iopub.status.busy": "2025-01-16T09:56:04.441312Z", "iopub.status.idle": "2025-01-16T09:56:04.442979Z", "shell.execute_reply": "2025-01-16T09:56:04.442757Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class DefineTracker(DefineTracker):\n", " def assignments(self):\n", " return self.my_assignments.items()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Assembling a Derivation Tree" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.444463Z", "iopub.status.busy": "2025-01-16T09:56:04.444370Z", "iopub.status.idle": "2025-01-16T09:56:04.445981Z", "shell.execute_reply": "2025-01-16T09:56:04.445761Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import START_SYMBOL, syntax_diagram, \\\n", " is_nonterminal, Grammar" ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.447435Z", "iopub.status.busy": "2025-01-16T09:56:04.447347Z", "iopub.status.idle": "2025-01-16T09:56:04.448986Z", "shell.execute_reply": "2025-01-16T09:56:04.448760Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from GrammarFuzzer import GrammarFuzzer, display_tree, \\\n", " DerivationTree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The input fragments from the `DefineTracker` only tell half the story. The fragments may be created at different stages of parsing. Hence, we need to assemble the fragments to a derivation tree of the input. The basic idea is as follows:\n", "\n", "Our input from the previous step was:\n", "\n", "```python\n", "\"1997,van,Ford,E350\"\n", "```\n", "\n", "We start a derivation tree, and associate it with the start symbol in the grammar." ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.450494Z", "iopub.status.busy": "2025-01-16T09:56:04.450408Z", "iopub.status.idle": "2025-01-16T09:56:04.452157Z", "shell.execute_reply": "2025-01-16T09:56:04.451907Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, [(\"1997,van,Ford,E350\", [])])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.453545Z", "iopub.status.busy": "2025-01-16T09:56:04.453465Z", "iopub.status.idle": "2025-01-16T09:56:04.872477Z", "shell.execute_reply": "2025-01-16T09:56:04.872123Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The next input was:\n", "```python\n", "vehicle = \"1997,van,Ford,E350\"\n", "```\n", "Since vehicle covers the `` node's value completely, we replace the value with the vehicle node." ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.874354Z", "iopub.status.busy": "2025-01-16T09:56:04.874236Z", "iopub.status.idle": "2025-01-16T09:56:04.876060Z", "shell.execute_reply": "2025-01-16T09:56:04.875830Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [(\"1997,van,Ford,E350\", [])],\n", " [])])" ] }, { "cell_type": "code", "execution_count": 40, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:04.877685Z", "iopub.status.busy": "2025-01-16T09:56:04.877574Z", "iopub.status.idle": "2025-01-16T09:56:05.340217Z", "shell.execute_reply": "2025-01-16T09:56:05.339837Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The next input was:\n", "```python\n", "year = '1997'\n", "```\n", "Traversing the derivation tree from ``, we see that it replaces a portion of the `` node's value. Hence we split the `` node's value to two children, where one corresponds to the value `\"1997\"` and the other to `\",van,Ford,E350\"`, and replace the first one with the node ``." ] }, { "cell_type": "code", "execution_count": 41, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:05.342683Z", "iopub.status.busy": "2025-01-16T09:56:05.342376Z", "iopub.status.idle": "2025-01-16T09:56:05.344885Z", "shell.execute_reply": "2025-01-16T09:56:05.344519Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",van,Ford,E350\", [])], [])])" ] }, { "cell_type": "code", "execution_count": 42, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:05.346787Z", "iopub.status.busy": "2025-01-16T09:56:05.346659Z", "iopub.status.idle": "2025-01-16T09:56:05.828516Z", "shell.execute_reply": "2025-01-16T09:56:05.827749Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We perform similar operations for \n", "```python\n", "company = 'Ford'\n", "```" ] }, { "cell_type": "code", "execution_count": 43, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:05.830521Z", "iopub.status.busy": "2025-01-16T09:56:05.830373Z", "iopub.status.idle": "2025-01-16T09:56:05.832671Z", "shell.execute_reply": "2025-01-16T09:56:05.832256Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",van,\", []),\n", " ('', [('Ford', [])]),\n", " (\",E350\", [])], [])])" ] }, { "cell_type": "code", "execution_count": 44, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:05.835106Z", "iopub.status.busy": "2025-01-16T09:56:05.834783Z", "iopub.status.idle": "2025-01-16T09:56:06.305774Z", "shell.execute_reply": "2025-01-16T09:56:06.305366Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 44, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Similarly for\n", "```python\n", "kind = 'van'\n", "```\n", "and\n", "```python\n", "model = 'E350'\n", "```" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.307928Z", "iopub.status.busy": "2025-01-16T09:56:06.307771Z", "iopub.status.idle": "2025-01-16T09:56:06.310439Z", "shell.execute_reply": "2025-01-16T09:56:06.310096Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "derivation_tree: DerivationTree = (START_SYMBOL, \n", " [('', [('', [('1997', [])]),\n", " (\",\", []),\n", " (\"\", [('van', [])]),\n", " (\",\", []),\n", " ('', [('Ford', [])]),\n", " (\",\", []),\n", " (\"\", [('E350', [])])\n", " ], [])])" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.312731Z", "iopub.status.busy": "2025-01-16T09:56:06.312594Z", "iopub.status.idle": "2025-01-16T09:56:06.798045Z", "shell.execute_reply": "2025-01-16T09:56:06.797609Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(derivation_tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We now develop the complete algorithm with the above described steps.\n", "The derivation tree `TreeMiner` is initialized with the input string, and the variable assignments, and it converts the assignments to the corresponding derivation tree." ] }, { "cell_type": "code", "execution_count": 47, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.800529Z", "iopub.status.busy": "2025-01-16T09:56:06.800333Z", "iopub.status.idle": "2025-01-16T09:56:06.803132Z", "shell.execute_reply": "2025-01-16T09:56:06.802878Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner:\n", " def __init__(self, my_input, my_assignments, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", " self.my_assignments = my_assignments\n", " self.tree = self.get_derivation_tree()\n", "\n", " def options(self, kwargs):\n", " self.log = log_call if kwargs.get('log') else lambda _i, _v: None\n", "\n", " def get_derivation_tree(self):\n", " return (START_SYMBOL, [])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `log_call()` is as follows." ] }, { "cell_type": "code", "execution_count": 48, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.805299Z", "iopub.status.busy": "2025-01-16T09:56:06.805178Z", "iopub.status.idle": "2025-01-16T09:56:06.807232Z", "shell.execute_reply": "2025-01-16T09:56:06.806912Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def log_call(indent, var):\n", " print('\\t' * indent, var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The basic idea is as follows:\n", "* **For now, we assume that the value assigned to a variable is stable. That is, it is never reassigned. In particular, there are no recursive calls, or multiple calls to the same function from different parts.** (We will show how to overcome this limitation later).\n", "* For each pair _var_, _value_ found in `my_assignments`:\n", " 1. We search for occurrences of _value_ `val` in the derivation tree recursively.\n", " 2. If an occurrence was found as a value `V1` of a node `P1`, we partition the value of the node `P1` into three parts, with the central part matching the _value_ `val`, and the first and last part, the corresponding prefix and suffix in `V1`.\n", " 3. Reconstitute the node `P1` with three children, where prefix and suffix mentioned earlier are string values, and the matching value `val` is replaced by a node `var` with a single value `val`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "First, we define a wrapper to generate a nonterminal from a variable name." ] }, { "cell_type": "code", "execution_count": 49, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.809213Z", "iopub.status.busy": "2025-01-16T09:56:06.809069Z", "iopub.status.idle": "2025-01-16T09:56:06.810907Z", "shell.execute_reply": "2025-01-16T09:56:06.810608Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def to_nonterminal(var):\n", " return \"<\" + var.lower() + \">\"" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `string_part_of_value()` method checks whether the given `part` value was part of the whole." ] }, { "cell_type": "code", "execution_count": 50, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.812694Z", "iopub.status.busy": "2025-01-16T09:56:06.812566Z", "iopub.status.idle": "2025-01-16T09:56:06.814478Z", "shell.execute_reply": "2025-01-16T09:56:06.814171Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def string_part_of_value(self, part, value):\n", " return (part in value)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `partition_by_part()` splits the `value` by the given part if it matches, and returns a list containing the first part, the part that was replaced, and the last part. This is a format that can be used as a part of the list of children." ] }, { "cell_type": "code", "execution_count": 51, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.816319Z", "iopub.status.busy": "2025-01-16T09:56:06.816189Z", "iopub.status.idle": "2025-01-16T09:56:06.818261Z", "shell.execute_reply": "2025-01-16T09:56:06.817917Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def partition(self, part, value):\n", " return value.partition(part)" ] }, { "cell_type": "code", "execution_count": 52, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.820245Z", "iopub.status.busy": "2025-01-16T09:56:06.820091Z", "iopub.status.idle": "2025-01-16T09:56:06.822455Z", "shell.execute_reply": "2025-01-16T09:56:06.822095Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def partition_by_part(self, pair, value):\n", " k, part = pair\n", " prefix_k_suffix = [\n", " (k, [[part, []]]) if i == 1 else (e, [])\n", " for i, e in enumerate(self.partition(part, value))\n", " if e]\n", " return prefix_k_suffix" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `insert_into_tree()` method accepts a given tree `tree` and a `(k,v)` pair. It recursively checks whether the given pair can be applied. If the pair can be applied, it applies the pair and returns `True`." ] }, { "cell_type": "code", "execution_count": 53, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.824376Z", "iopub.status.busy": "2025-01-16T09:56:06.824266Z", "iopub.status.idle": "2025-01-16T09:56:06.827114Z", "shell.execute_reply": "2025-01-16T09:56:06.826787Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def insert_into_tree(self, my_tree, pair):\n", " var, values = my_tree\n", " k, v = pair\n", " self.log(1, \"- Node: %s\\t\\t? (%s:%s)\" % (var, k, repr(v)))\n", " applied = False\n", " for i, value_ in enumerate(values):\n", " value, arr = value_\n", " self.log(2, \"-> [%d] %s\" % (i, repr(value)))\n", " if is_nonterminal(value):\n", " applied = self.insert_into_tree(value_, pair)\n", " if applied:\n", " break\n", " elif self.string_part_of_value(v, value):\n", " prefix_k_suffix = self.partition_by_part(pair, value)\n", " del values[i]\n", " for j, rep in enumerate(prefix_k_suffix):\n", " values.insert(j + i, rep)\n", " applied = True\n", "\n", " self.log(2, \" > %s\" % (repr([i[0] for i in prefix_k_suffix])))\n", " break\n", " else:\n", " continue\n", " return applied" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Here is how `insert_into_tree()` is used." ] }, { "cell_type": "code", "execution_count": 54, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.828849Z", "iopub.status.busy": "2025-01-16T09:56:06.828729Z", "iopub.status.idle": "2025-01-16T09:56:06.830500Z", "shell.execute_reply": "2025-01-16T09:56:06.830151Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "tree: DerivationTree = (START_SYMBOL, [(\"1997,van,Ford,E350\", [])])\n", "m = TreeMiner('', {}, log=True)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First, we have our input string as the only node." ] }, { "cell_type": "code", "execution_count": 55, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:06.832340Z", "iopub.status.busy": "2025-01-16T09:56:06.832205Z", "iopub.status.idle": "2025-01-16T09:56:07.297807Z", "shell.execute_reply": "2025-01-16T09:56:07.297390Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting the `` node." ] }, { "cell_type": "code", "execution_count": 56, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:07.299852Z", "iopub.status.busy": "2025-01-16T09:56:07.299677Z", "iopub.status.idle": "2025-01-16T09:56:07.302115Z", "shell.execute_reply": "2025-01-16T09:56:07.301797Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', \"1997,van,Ford,E350\"))" ] }, { "cell_type": "code", "execution_count": 57, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:07.303876Z", "iopub.status.busy": "2025-01-16T09:56:07.303746Z", "iopub.status.idle": "2025-01-16T09:56:07.840292Z", "shell.execute_reply": "2025-01-16T09:56:07.839870Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting `` node." ] }, { "cell_type": "code", "execution_count": 58, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:07.842287Z", "iopub.status.busy": "2025-01-16T09:56:07.842131Z", "iopub.status.idle": "2025-01-16T09:56:07.844650Z", "shell.execute_reply": "2025-01-16T09:56:07.844314Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['1997,van,Ford,', '']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'E350'))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:07.846518Z", "iopub.status.busy": "2025-01-16T09:56:07.846389Z", "iopub.status.idle": "2025-01-16T09:56:08.310235Z", "shell.execute_reply": "2025-01-16T09:56:08.309888Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree((tree))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 60, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:08.312065Z", "iopub.status.busy": "2025-01-16T09:56:08.311940Z", "iopub.status.idle": "2025-01-16T09:56:08.314107Z", "shell.execute_reply": "2025-01-16T09:56:08.313760Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997,van,Ford,'\n", "\t\t > ['1997,van,', '', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'Ford'))" ] }, { "cell_type": "code", "execution_count": 61, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:08.315920Z", "iopub.status.busy": "2025-01-16T09:56:08.315778Z", "iopub.status.idle": "2025-01-16T09:56:08.790976Z", "shell.execute_reply": "2025-01-16T09:56:08.790364Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 62, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:08.793066Z", "iopub.status.busy": "2025-01-16T09:56:08.792912Z", "iopub.status.idle": "2025-01-16T09:56:08.795074Z", "shell.execute_reply": "2025-01-16T09:56:08.794821Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997,van,'\n", "\t\t > ['1997,', '', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', 'van'))" ] }, { "cell_type": "code", "execution_count": 63, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:08.796505Z", "iopub.status.busy": "2025-01-16T09:56:08.796410Z", "iopub.status.idle": "2025-01-16T09:56:09.279723Z", "shell.execute_reply": "2025-01-16T09:56:09.279292Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Inserting ``." ] }, { "cell_type": "code", "execution_count": 64, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.281483Z", "iopub.status.busy": "2025-01-16T09:56:09.281355Z", "iopub.status.idle": "2025-01-16T09:56:09.283507Z", "shell.execute_reply": "2025-01-16T09:56:09.283225Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,'\n", "\t\t > ['', ',']\n" ] } ], "source": [ "v = m.insert_into_tree(tree, ('', '1997'))" ] }, { "cell_type": "code", "execution_count": 65, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.285496Z", "iopub.status.busy": "2025-01-16T09:56:09.285367Z", "iopub.status.idle": "2025-01-16T09:56:09.728131Z", "shell.execute_reply": "2025-01-16T09:56:09.727702Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "To make life simple, we define a wrapper function `nt_var()` that will convert a token to its corresponding nonterminal symbol." ] }, { "cell_type": "code", "execution_count": 66, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.730587Z", "iopub.status.busy": "2025-01-16T09:56:09.730421Z", "iopub.status.idle": "2025-01-16T09:56:09.732661Z", "shell.execute_reply": "2025-01-16T09:56:09.732415Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def nt_var(self, var):\n", " return var if is_nonterminal(var) else to_nonterminal(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Now, we need to apply a new definition to an entire grammar." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.734732Z", "iopub.status.busy": "2025-01-16T09:56:09.734460Z", "iopub.status.idle": "2025-01-16T09:56:09.736850Z", "shell.execute_reply": "2025-01-16T09:56:09.736478Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def apply_new_definition(self, tree, var, value):\n", " nt_var = self.nt_var(var)\n", " return self.insert_into_tree(tree, (nt_var, value))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "This algorithm is implemented as `get_derivation_tree()`. " ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.738598Z", "iopub.status.busy": "2025-01-16T09:56:09.738490Z", "iopub.status.idle": "2025-01-16T09:56:09.740772Z", "shell.execute_reply": "2025-01-16T09:56:09.740524Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def get_derivation_tree(self):\n", " tree = (START_SYMBOL, [(self.my_input, [])])\n", "\n", " for var, value in self.my_assignments:\n", " self.log(0, \"%s=%s\" % (var, repr(value)))\n", " self.apply_new_definition(tree, var, value)\n", " return tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `TreeMiner` is used as follows:" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.742477Z", "iopub.status.busy": "2025-01-16T09:56:09.742347Z", "iopub.status.idle": "2025-01-16T09:56:09.748010Z", "shell.execute_reply": "2025-01-16T09:56:09.747706Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " vehicle='1997,van,Ford,E350'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['']\n", " year='1997'\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['', ',van,Ford,E350']\n", " kind='van'\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ',van,Ford,E350'\n", "\t\t > [',', '', ',Ford,E350']\n", " company='Ford'\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ',Ford,E350'\n", "\t\t > [',', '', ',E350']\n", " model='E350'\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ',E350'\n", "\t\t > [',', '']\n" ] }, { "data": { "text/plain": [ "('',\n", " [('',\n", " [('', [['1997', []]]),\n", " (',', []),\n", " ('', [['van', []]]),\n", " (',', []),\n", " ('', [['Ford', []]]),\n", " (',', []),\n", " ('', [['E350', []]])])])" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Tracer(VEHICLES[0]) as tracer:\n", " process_vehicle(tracer.my_input)\n", "assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", "dt = TreeMiner(tracer.my_input, assignments, log=True)\n", "dt.tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The obtained derivation tree is as below." ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:09.750015Z", "iopub.status.busy": "2025-01-16T09:56:09.749889Z", "iopub.status.idle": "2025-01-16T09:56:10.225601Z", "shell.execute_reply": "2025-01-16T09:56:10.225158Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "display_tree(TreeMiner(tracer.my_input, assignments).tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Combining all the pieces:" ] }, { "cell_type": "code", "execution_count": 71, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.228156Z", "iopub.status.busy": "2025-01-16T09:56:10.228001Z", "iopub.status.idle": "2025-01-16T09:56:10.236231Z", "shell.execute_reply": "2025-01-16T09:56:10.235668Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,van,Ford,E350\n", "vehicle = '1997,van,Ford,E350'\n", "year = '1997'\n", "kind = 'van'\n", "company = 'Ford'\n", "model = 'E350'\n", "\n", "2000,car,Mercury,Cougar\n", "vehicle = '2000,car,Mercury,Cougar'\n", "year = '2000'\n", "kind = 'car'\n", "company = 'Mercury'\n", "model = 'Cougar'\n", "\n", "1999,car,Chevy,Venture\n", "vehicle = '1999,car,Chevy,Venture'\n", "year = '1999'\n", "kind = 'car'\n", "company = 'Chevy'\n", "model = 'Venture'\n", "\n" ] } ], "source": [ "trees = []\n", "for vehicle in VEHICLES:\n", " print(vehicle)\n", " with Tracer(vehicle) as tracer:\n", " process_vehicle(tracer.my_input)\n", " assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", " trees.append((tracer.my_input, assignments))\n", " for var, val in assignments:\n", " print(var + \" = \" + repr(val))\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The corresponding derivation trees are below." ] }, { "cell_type": "code", "execution_count": 72, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.238188Z", "iopub.status.busy": "2025-01-16T09:56:10.238027Z", "iopub.status.idle": "2025-01-16T09:56:10.241089Z", "shell.execute_reply": "2025-01-16T09:56:10.240766Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,van,Ford,E350\n", "2000,car,Mercury,Cougar\n", "1999,car,Chevy,Venture\n" ] } ], "source": [ "csv_dt = []\n", "for inputstr, assignments in trees:\n", " print(inputstr)\n", " dt = TreeMiner(inputstr, assignments)\n", " csv_dt.append(dt)\n", " display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recovering Grammars from Derivation Trees\n", "\n", "We define a class `Miner` that can combine multiple derivation trees to produce the grammar. The initial grammar is empty." ] }, { "cell_type": "code", "execution_count": 73, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.243436Z", "iopub.status.busy": "2025-01-16T09:56:10.243316Z", "iopub.status.idle": "2025-01-16T09:56:10.245768Z", "shell.execute_reply": "2025-01-16T09:56:10.245316Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner:\n", " def __init__(self):\n", " self.grammar = {}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `tree_to_grammar()` method converts our derivation tree to a grammar by picking one node at a time, and adding it to the grammar. The node name becomes the key, and any list of children it has becomes another alternative for that key." ] }, { "cell_type": "code", "execution_count": 74, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.247760Z", "iopub.status.busy": "2025-01-16T09:56:10.247606Z", "iopub.status.idle": "2025-01-16T09:56:10.251196Z", "shell.execute_reply": "2025-01-16T09:56:10.250745Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def tree_to_grammar(self, tree):\n", " node, children = tree\n", " one_alt = [ck for ck, gc in children]\n", " hsh = {node: [one_alt] if one_alt else []}\n", " for child in children:\n", " if not is_nonterminal(child[0]):\n", " continue\n", " chsh = self.tree_to_grammar(child)\n", " for k in chsh:\n", " if k not in hsh:\n", " hsh[k] = chsh[k]\n", " else:\n", " hsh[k].extend(chsh[k])\n", " return hsh" ] }, { "cell_type": "code", "execution_count": 75, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.253088Z", "iopub.status.busy": "2025-01-16T09:56:10.252942Z", "iopub.status.idle": "2025-01-16T09:56:10.255694Z", "shell.execute_reply": "2025-01-16T09:56:10.255332Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'': [['']],\n", " '': [['', ',', '', ',', '', ',', '']],\n", " '': [['1997']],\n", " '': [['van']],\n", " '': [['Ford']],\n", " '': [['E350']]}" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gm = GrammarMiner()\n", "gm.tree_to_grammar(csv_dt[0].tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The grammar being generated here is `canonical`. We define a function `readable()` that takes in a canonical grammar and returns it in a readable form." ] }, { "cell_type": "code", "execution_count": 76, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.257712Z", "iopub.status.busy": "2025-01-16T09:56:10.257568Z", "iopub.status.idle": "2025-01-16T09:56:10.259850Z", "shell.execute_reply": "2025-01-16T09:56:10.259453Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def readable(grammar):\n", " def readable_rule(rule):\n", " return ''.join(rule)\n", "\n", " return {k: list(set(readable_rule(a) for a in grammar[k]))\n", " for k in grammar}" ] }, { "cell_type": "code", "execution_count": 77, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.261765Z", "iopub.status.busy": "2025-01-16T09:56:10.261633Z", "iopub.status.idle": "2025-01-16T09:56:10.271485Z", "shell.execute_reply": "2025-01-16T09:56:10.271101Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(readable(gm.tree_to_grammar(csv_dt[0].tree)))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The `add_tree()` method gets a combined list of non-terminals from current grammar, and the tree to be added to the grammar, and updates the definitions of each non-terminal." ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.274877Z", "iopub.status.busy": "2025-01-16T09:56:10.274662Z", "iopub.status.idle": "2025-01-16T09:56:10.276910Z", "shell.execute_reply": "2025-01-16T09:56:10.276440Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import itertools" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.279043Z", "iopub.status.busy": "2025-01-16T09:56:10.278910Z", "iopub.status.idle": "2025-01-16T09:56:10.281150Z", "shell.execute_reply": "2025-01-16T09:56:10.280883Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def add_tree(self, t):\n", " t_grammar = self.tree_to_grammar(t.tree)\n", " self.grammar = {\n", " key: self.grammar.get(key, []) + t_grammar.get(key, [])\n", " for key in itertools.chain(self.grammar.keys(), t_grammar.keys())\n", " }" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `add_tree()` is used as follows:" ] }, { "cell_type": "code", "execution_count": 80, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.283465Z", "iopub.status.busy": "2025-01-16T09:56:10.283325Z", "iopub.status.idle": "2025-01-16T09:56:10.285344Z", "shell.execute_reply": "2025-01-16T09:56:10.285064Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar_miner = GrammarMiner()\n", "for dt in csv_dt:\n", " inventory_grammar_miner.add_tree(dt)" ] }, { "cell_type": "code", "execution_count": 81, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.287230Z", "iopub.status.busy": "2025-01-16T09:56:10.287094Z", "iopub.status.idle": "2025-01-16T09:56:10.296064Z", "shell.execute_reply": "2025-01-16T09:56:10.295725Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(readable(inventory_grammar_miner.grammar))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Given execution traces from various inputs, one can define `update_grammar()` to obtain the complete grammar from the traces." ] }, { "cell_type": "code", "execution_count": 82, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.299011Z", "iopub.status.busy": "2025-01-16T09:56:10.298797Z", "iopub.status.idle": "2025-01-16T09:56:10.301508Z", "shell.execute_reply": "2025-01-16T09:56:10.301173Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def update_grammar(self, inputstr, trace):\n", " at = self.create_tracker(inputstr, trace)\n", " dt = self.create_tree_miner(inputstr, at.assignments())\n", " self.add_tree(dt)\n", " return self.grammar\n", "\n", " def create_tracker(self, *args):\n", " return DefineTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return TreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The complete grammar recovery is implemented in `recover_grammar()`." ] }, { "cell_type": "code", "execution_count": 83, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.303637Z", "iopub.status.busy": "2025-01-16T09:56:10.303481Z", "iopub.status.idle": "2025-01-16T09:56:10.306025Z", "shell.execute_reply": "2025-01-16T09:56:10.305750Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def recover_grammar(fn: Callable, inputs: Iterable[str], \n", " **kwargs: Any) -> Grammar:\n", " miner = GrammarMiner()\n", "\n", " for inputstr in inputs:\n", " with Tracer(inputstr, **kwargs) as tracer:\n", " fn(tracer.my_input)\n", " miner.update_grammar(tracer.my_input, tracer.trace)\n", "\n", " return readable(miner.grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Note that the grammar could have been retrieved directly from the tracker, without the intermediate derivation tree stage. However, going through the derivation tree allows one to inspect the inputs being fragmented and verify that it happens correctly." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 1. Recovering the Inventory Grammar" ] }, { "cell_type": "code", "execution_count": 84, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.307998Z", "iopub.status.busy": "2025-01-16T09:56:10.307884Z", "iopub.status.idle": "2025-01-16T09:56:10.314782Z", "shell.execute_reply": "2025-01-16T09:56:10.314341Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar = recover_grammar(process_vehicle, VEHICLES)" ] }, { "cell_type": "code", "execution_count": 85, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.317608Z", "iopub.status.busy": "2025-01-16T09:56:10.317382Z", "iopub.status.idle": "2025-01-16T09:56:10.320089Z", "shell.execute_reply": "2025-01-16T09:56:10.319715Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'': [''],\n", " '': [',,,'],\n", " '': ['1999', '2000', '1997'],\n", " '': ['car', 'van'],\n", " '': ['Mercury', 'Chevy', 'Ford'],\n", " '': ['E350', 'Cougar', 'Venture']}" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "inventory_grammar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 2. Recovering URL Grammar" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Our algorithm is robust enough to recover grammar from real world programs. For example, the `urlparse` function in the Python `urlib` module accepts the following sample URLs." ] }, { "cell_type": "code", "execution_count": 86, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.322692Z", "iopub.status.busy": "2025-01-16T09:56:10.322554Z", "iopub.status.idle": "2025-01-16T09:56:10.324737Z", "shell.execute_reply": "2025-01-16T09:56:10.324313Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "URLS = [\n", " 'http://user:pass@www.google.com:80/?q=path#ref',\n", " 'https://www.cispa.saarland:80/',\n", " 'http://www.fuzzingbook.org/#News',\n", "]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The urllib caches its intermediate results for faster access. Hence, we need to disable it using `clear_cache()` after every invocation." ] }, { "cell_type": "code", "execution_count": 87, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.327133Z", "iopub.status.busy": "2025-01-16T09:56:10.326961Z", "iopub.status.idle": "2025-01-16T09:56:10.329414Z", "shell.execute_reply": "2025-01-16T09:56:10.328776Z" }, "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from urllib.parse import urlparse, clear_cache # type: ignore" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We use the sample URLs to recover grammar as follows. The `urlparse` function tends to cache its previous parsing results. Hence, we define a new method `url_parse()` that clears the cache before each call." ] }, { "cell_type": "code", "execution_count": 88, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.332171Z", "iopub.status.busy": "2025-01-16T09:56:10.331978Z", "iopub.status.idle": "2025-01-16T09:56:10.334203Z", "shell.execute_reply": "2025-01-16T09:56:10.333841Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def url_parse(url):\n", " clear_cache()\n", " urlparse(url)" ] }, { "cell_type": "code", "execution_count": 89, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.336065Z", "iopub.status.busy": "2025-01-16T09:56:10.335910Z", "iopub.status.idle": "2025-01-16T09:56:10.452152Z", "shell.execute_reply": "2025-01-16T09:56:10.451718Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://user:pass@www.google.com:80/?q=path#ref\n", "url = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "scheme = 'http'\n", "netloc = 'user:pass@www.google.com:80'\n", "fragment = 'ref'\n", "query = 'q=path'\n", "\n", "https://www.cispa.saarland:80/\n", "url = 'https://www.cispa.saarland:80/'\n", "scheme = 'https'\n", "netloc = 'www.cispa.saarland:80'\n", "\n", "http://www.fuzzingbook.org/#News\n", "url = 'http://www.fuzzingbook.org/#News'\n", "scheme = 'http'\n", "netloc = 'www.fuzzingbook.org'\n", "fragment = 'News'\n", "\n", "http://user:pass@www.google.com:80/?q=path#ref\n", "https://www.cispa.saarland:80/\n", "http://www.fuzzingbook.org/#News\n" ] } ], "source": [ "trees = []\n", "for url in URLS:\n", " print(url)\n", " with Tracer(url) as tracer:\n", " url_parse(tracer.my_input)\n", " assignments = DefineTracker(tracer.my_input, tracer.trace).assignments()\n", " trees.append((tracer.my_input, assignments))\n", " for var, val in assignments:\n", " print(var + \" = \" + repr(val))\n", " print()\n", "\n", "\n", "url_dt = []\n", "for inputstr, assignments in trees:\n", " print(inputstr)\n", " dt = TreeMiner(inputstr, assignments)\n", " url_dt.append(dt)\n", " display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us use `url_parse()` to recover the grammar:" ] }, { "cell_type": "code", "execution_count": 90, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.454263Z", "iopub.status.busy": "2025-01-16T09:56:10.454110Z", "iopub.status.idle": "2025-01-16T09:56:10.567659Z", "shell.execute_reply": "2025-01-16T09:56:10.567334Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS, files=['urllib/parse.py'])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.569810Z", "iopub.status.busy": "2025-01-16T09:56:10.569683Z", "iopub.status.idle": "2025-01-16T09:56:10.579296Z", "shell.execute_reply": "2025-01-16T09:56:10.579040Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The recovered grammar describes the URL format reasonably well." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Fuzzing" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We can now use our recovered grammar for fuzzing as follows.\n", "\n", "First, the inventory grammar." ] }, { "cell_type": "code", "execution_count": 92, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.581452Z", "iopub.status.busy": "2025-01-16T09:56:10.581326Z", "iopub.status.idle": "2025-01-16T09:56:10.584213Z", "shell.execute_reply": "2025-01-16T09:56:10.583921Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,car,Mercury,E350\n", "2000,car,Chevy,Cougar\n", "1997,van,Mercury,Venture\n", "1999,car,Ford,Venture\n", "2000,car,Mercury,E350\n", "2000,car,Mercury,Cougar\n", "1997,car,Chevy,E350\n", "1997,car,Chevy,E350\n", "1997,car,Mercury,Cougar\n", "1999,car,Chevy,E350\n" ] } ], "source": [ "f = GrammarFuzzer(inventory_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, the URL grammar." ] }, { "cell_type": "code", "execution_count": 93, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.585962Z", "iopub.status.busy": "2025-01-16T09:56:10.585830Z", "iopub.status.idle": "2025-01-16T09:56:10.588527Z", "shell.execute_reply": "2025-01-16T09:56:10.588269Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "https://user:pass@www.google.com:80/\n", "http://www.cispa.saarland:80/?q=path#News\n", "https://user:pass@www.google.com:80/\n", "https://user:pass@www.google.com:80/#ref\n", "http://user:pass@www.google.com:80/\n", "http://user:pass@www.google.com:80/#ref\n", "http://user:pass@www.google.com:80/?q=path#News\n", "http://www.fuzzingbook.org/?q=path#News\n", "http://www.fuzzingbook.org/?q=path#ref\n", "http://www.cispa.saarland:80/\n" ] } ], "source": [ "f = GrammarFuzzer(url_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "What this means is that we can now take a program and a few samples, extract its grammar, and then use this very grammar for fuzzing. Now that's quite an opportunity!" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problems with the Simple Miner" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems with our simple grammar miner is the assumption that the values assigned to variables are stable. Unfortunately, that may not hold true in all cases. For example, here is a URL with a slightly different format." ] }, { "cell_type": "code", "execution_count": 94, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.590245Z", "iopub.status.busy": "2025-01-16T09:56:10.590122Z", "iopub.status.idle": "2025-01-16T09:56:10.591887Z", "shell.execute_reply": "2025-01-16T09:56:10.591561Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "URLS_X = URLS + ['ftp://freebsd.org/releases/5.8']" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The grammar generated from this set of samples is not as nice as what we got earlier" ] }, { "cell_type": "code", "execution_count": 95, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.593856Z", "iopub.status.busy": "2025-01-16T09:56:10.593712Z", "iopub.status.idle": "2025-01-16T09:56:10.738406Z", "shell.execute_reply": "2025-01-16T09:56:10.738135Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])" ] }, { "cell_type": "code", "execution_count": 96, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.740340Z", "iopub.status.busy": "2025-01-16T09:56:10.740236Z", "iopub.status.idle": "2025-01-16T09:56:10.749808Z", "shell.execute_reply": "2025-01-16T09:56:10.749592Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Clearly, something has gone wrong.\n", "\n", "To investigate why the `url` definition has gone wrong, let us inspect the trace for the URL." ] }, { "cell_type": "code", "execution_count": 97, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.751739Z", "iopub.status.busy": "2025-01-16T09:56:10.751624Z", "iopub.status.idle": "2025-01-16T09:56:10.786679Z", "shell.execute_reply": "2025-01-16T09:56:10.786358Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 374 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "1 394 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "5 129 ({'arg': ''},)\n", "6 126 ({'arg': ''},)\n", "7 131 ({'arg': ''},)\n", "8 132 ({'arg': ''},)\n", "10 395 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "11 452 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "12 474 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "16 129 ({'arg': ''},)\n", "17 126 ({'arg': ''},)\n", "18 131 ({'arg': ''},)\n", "19 132 ({'arg': ''},)\n", "21 477 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "22 478 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "23 480 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "24 481 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "25 482 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "26 480 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\t'},)\n", "27 481 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "28 482 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "29 480 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\r'},)\n", "30 481 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "31 482 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "32 480 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "33 484 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "34 485 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n'},)\n", "35 486 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "36 487 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "37 488 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': ''},)\n", "38 489 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)\n", "39 488 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'h'},)\n", "40 489 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "41 488 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "42 489 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "43 488 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 't'},)\n", "44 489 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "45 488 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "46 492 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': '', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "47 493 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "48 494 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': '', 'query': '', 'fragment': '', 'c': 'p'},)\n", "49 413 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "50 414 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "51 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref'},)\n", "52 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "53 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "54 418 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "55 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '/'},)\n", "56 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "57 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "58 418 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "59 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '?'},)\n", "60 416 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "61 417 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "62 418 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "63 415 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "64 419 ({'url': '//user:pass@www.google.com:80/?q=path#ref', 'c': '#'},)\n", "66 495 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "67 496 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "68 498 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "69 501 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "70 502 ({'url': '/?q=path#ref', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': '', 'c': 'p'},)\n", "71 503 ({'url': '/?q=path', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)\n", "72 504 ({'url': '/?q=path', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': '', 'fragment': 'ref', 'c': 'p'},)\n", "73 505 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "74 421 ({'netloc': 'user:pass@www.google.com:80'},)\n", "75 422 ({'netloc': 'user:pass@www.google.com:80'},)\n", "76 423 ({'netloc': 'user:pass@www.google.com:80'},)\n", "78 506 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "82 507 ({'url': '/', 'scheme': 'http', 'b': '\\n', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'c': 'p'},)\n", "87 396 ({'url': 'http://user:pass@www.google.com:80/?q=path#ref', 'scheme': ''},)\n", "88 397 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)\n", "89 400 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref'},)\n", "90 401 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)\n", "94 402 ({'url': '/', 'scheme': 'http', 'netloc': 'user:pass@www.google.com:80', 'query': 'q=path', 'fragment': 'ref', 'params': ''},)\n" ] } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[0]) as tracer:\n", " urlparse(tracer.my_input)\n", "for i, t in enumerate(tracer.trace):\n", " if t[0] in {'call', 'line'} and 'parse.py' in str(t[2]) and t[3]:\n", " print(i, t[2]._t()[1], t[3:])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice how the value of `url` changes as the parsing progresses? This violates our assumption that the value assigned to a variable is stable. We next look at how this limitation can be removed." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Grammar Miner with Reassignment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One way to uniquely identify different variables is to annotate them with *line numbers* both when they are defined and also when their value changes. Consider the code fragment below" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Tracking variable assignment locations" ] }, { "cell_type": "code", "execution_count": 98, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.788855Z", "iopub.status.busy": "2025-01-16T09:56:10.788726Z", "iopub.status.idle": "2025-01-16T09:56:10.790744Z", "shell.execute_reply": "2025-01-16T09:56:10.790447Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def C(cp_1):\n", " c_2 = cp_1 + '@2'\n", " c_3 = c_2 + '@3'\n", " return c_3" ] }, { "cell_type": "code", "execution_count": 99, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.792456Z", "iopub.status.busy": "2025-01-16T09:56:10.792332Z", "iopub.status.idle": "2025-01-16T09:56:10.794102Z", "shell.execute_reply": "2025-01-16T09:56:10.793802Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def B(bp_7):\n", " b_8 = bp_7 + '@8'\n", " return C(b_8)" ] }, { "cell_type": "code", "execution_count": 100, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.795751Z", "iopub.status.busy": "2025-01-16T09:56:10.795629Z", "iopub.status.idle": "2025-01-16T09:56:10.797492Z", "shell.execute_reply": "2025-01-16T09:56:10.797251Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def A(ap_12):\n", " a_13 = ap_12 + '@13'\n", " a_14 = B(a_13) + '@14'\n", " a_14 = a_14 + '@15'\n", " a_13 = a_14 + '@16'\n", " a_14 = B(a_13) + '@17'\n", " a_14 = B(a_13) + '@18'" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Notice how all variables are either named corresponding to either where they are defined, or the value is annotated to indicate that it was changed.\n", "\n", "Let us run this under the trace." ] }, { "cell_type": "code", "execution_count": 101, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.799230Z", "iopub.status.busy": "2025-01-16T09:56:10.799097Z", "iopub.status.idle": "2025-01-16T09:56:10.846915Z", "shell.execute_reply": "2025-01-16T09:56:10.846620Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "call 1:A {'ap_12': '____'}\n", "line 2:A {'ap_12': '____'}\n", "line 3:A {'ap_12': '____', 'a_13': '____@13'}\n", "call 1:B {'bp_7': '____@13'}\n", "line 2:B {'bp_7': '____@13'}\n", "line 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}\n", "call 1:C {'cp_1': '____@13@8'}\n", "line 2:C {'cp_1': '____@13@8'}\n", "line 3:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2'}\n", "line 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8', 'c_2': '____@13@8@2', 'c_3': '____@13@8@2@3'}\n", "return 3:B {'bp_7': '____@13', 'b_8': '____@13@8'}\n", "line 4:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14'}\n", "line 5:A {'ap_12': '____', 'a_13': '____@13', 'a_14': '____@13@8@2@3@14@15'}\n", "line 6:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15'}\n", "call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}\n", "line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "line 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@17'}\n", "call 1:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 2:B {'bp_7': '____@13@8@2@3@14@15@16'}\n", "line 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "call 1:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 2:C {'cp_1': '____@13@8@2@3@14@15@16@8'}\n", "line 3:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2'}\n", "line 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 4:C {'cp_1': '____@13@8@2@3@14@15@16@8', 'c_2': '____@13@8@2@3@14@15@16@8@2', 'c_3': '____@13@8@2@3@14@15@16@8@2@3'}\n", "return 3:B {'bp_7': '____@13@8@2@3@14@15@16', 'b_8': '____@13@8@2@3@14@15@16@8'}\n", "return 7:A {'ap_12': '____', 'a_13': '____@13@8@2@3@14@15@16', 'a_14': '____@13@8@2@3@14@15@16@8@2@3@18'}\n", "call 102:__exit__ {}\n", "line 105:__exit__ {}\n" ] } ], "source": [ "with Tracer('____') as tracer:\n", " A(tracer.my_input)\n", "\n", "for t in tracer.trace:\n", " print(t[0], \"%d:%s\" % (t[2].line_no, t[2].method), t[3])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Each variable was referenced first as follows:\n", "\n", "* `cp_1` -- *call* `1:C`\n", "* `c_2` -- *line* `3:C` (but the previous event was *line* `2:C`)\n", "* `c_3` -- *line* `4:C` (but the previous event was *line* `3:C`)\n", "* `bp_7` -- *call* `7:B`\n", "* `b_8` -- *line* `9:B` (but the previous event was *line* `8:B`)\n", "* `ap_12` -- *call* `12:A`\n", "* `a_13` -- *line* `14:A` (but the previous event was *line* `13:A`)\n", "* `a_14` -- *line* `15:A` (the previous event was *return* `9:B`. However, the previous event in `A()` was *line* `14:A`)\n", "* reassign `a_14` at *15* -- *line* `16:A` (the previous event was *line* `15:A`)\n", "* reassign `a_13` at *16* -- *line* `17:A` (the previous event was *line* `16:A`)\n", "* reassign `a_14` at *17* -- *return* `17:A` (the previous event in `A()` was *line* `17:A`)\n", "* reassign `a_14` at *18* -- *return* `18:A` (the previous event in `A()` was *line* `18:A`)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "So, our observations are that, if it is a call, the current location is the right one for any new variables being defined. On the other hand, if the variable being referenced for the first time (or reassigned a new value), then the right location to consider is the previous location *in the same method invocation*. Next, let us see how we can incorporate this information into variable naming." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we need a way to track the individual method calls as they are being made. For this we define the class `CallStack`. Each method invocation gets a separate identifier, and when the method call is over, the identifier is reset." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### CallStack" ] }, { "cell_type": "code", "execution_count": 102, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.848700Z", "iopub.status.busy": "2025-01-16T09:56:10.848583Z", "iopub.status.idle": "2025-01-16T09:56:10.850928Z", "shell.execute_reply": "2025-01-16T09:56:10.850685Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class CallStack:\n", " def __init__(self, **kwargs):\n", " self.options(kwargs)\n", " self.method_id = (START_SYMBOL, 0)\n", " self.method_register = 0\n", " self.mstack = [self.method_id]\n", "\n", " def enter(self, method):\n", " self.method_register += 1\n", " self.method_id = (method, self.method_register)\n", " self.log('call', \"%s%s\" % (self.indent(), str(self)))\n", " self.mstack.append(self.method_id)\n", "\n", " def leave(self):\n", " self.mstack.pop()\n", " self.log('return', \"%s%s\" % (self.indent(), str(self)))\n", " self.method_id = self.mstack[-1]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "A few extra functions to make life simpler." ] }, { "cell_type": "code", "execution_count": 103, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.852716Z", "iopub.status.busy": "2025-01-16T09:56:10.852603Z", "iopub.status.idle": "2025-01-16T09:56:10.855035Z", "shell.execute_reply": "2025-01-16T09:56:10.854771Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class CallStack(CallStack):\n", " def options(self, kwargs):\n", " self.log = log_event if kwargs.get('log') else lambda _evt, _var: None\n", "\n", " def indent(self):\n", " return len(self.mstack) * \"\\t\"\n", "\n", " def at(self, n):\n", " return self.mstack[n]\n", "\n", " def __len__(self):\n", " return len(mstack) - 1\n", "\n", " def __str__(self):\n", " return \"%s:%d\" % self.method_id\n", "\n", " def __repr__(self):\n", " return repr(self.method_id)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We also define a convenience method to display a given stack." ] }, { "cell_type": "code", "execution_count": 104, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.856818Z", "iopub.status.busy": "2025-01-16T09:56:10.856683Z", "iopub.status.idle": "2025-01-16T09:56:10.858891Z", "shell.execute_reply": "2025-01-16T09:56:10.858628Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def display_stack(istack):\n", " def stack_to_tree(stack):\n", " current, *rest = stack\n", " if not rest:\n", " return (repr(current), [])\n", " return (repr(current), [stack_to_tree(rest)])\n", " display_tree(stack_to_tree(istack.mstack), graph_attr=lr_graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how we can use the `CallStack`." ] }, { "cell_type": "code", "execution_count": 105, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.860525Z", "iopub.status.busy": "2025-01-16T09:56:10.860408Z", "iopub.status.idle": "2025-01-16T09:56:10.862682Z", "shell.execute_reply": "2025-01-16T09:56:10.862447Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('', 0)" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs = CallStack()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.864094Z", "iopub.status.busy": "2025-01-16T09:56:10.863993Z", "iopub.status.idle": "2025-01-16T09:56:10.865996Z", "shell.execute_reply": "2025-01-16T09:56:10.865791Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 106, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('hello')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.867316Z", "iopub.status.busy": "2025-01-16T09:56:10.867222Z", "iopub.status.idle": "2025-01-16T09:56:10.869174Z", "shell.execute_reply": "2025-01-16T09:56:10.868969Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('world', 2)" ] }, "execution_count": 107, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('world')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.870593Z", "iopub.status.busy": "2025-01-16T09:56:10.870496Z", "iopub.status.idle": "2025-01-16T09:56:10.872566Z", "shell.execute_reply": "2025-01-16T09:56:10.872339Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.leave()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 109, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.874042Z", "iopub.status.busy": "2025-01-16T09:56:10.873943Z", "iopub.status.idle": "2025-01-16T09:56:10.875980Z", "shell.execute_reply": "2025-01-16T09:56:10.875760Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "('world', 3)" ] }, "execution_count": 109, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.enter('world')\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "code", "execution_count": 110, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.877331Z", "iopub.status.busy": "2025-01-16T09:56:10.877236Z", "iopub.status.idle": "2025-01-16T09:56:10.879165Z", "shell.execute_reply": "2025-01-16T09:56:10.878959Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('hello', 1)" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cs.leave()\n", "display_stack(cs)\n", "cs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "In order to account for variable reassignments, we need to have a more intelligent data structure than a dictionary for storing variables. We first define a simple interface `Vars`. It acts as a container for variables, and is instantiated at `my_assignments`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Vars" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `Vars` stores references to variables as they occur during parsing in its internal dictionary `defs`. We initialize the dictionary with the original string." ] }, { "cell_type": "code", "execution_count": 111, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.880572Z", "iopub.status.busy": "2025-01-16T09:56:10.880474Z", "iopub.status.idle": "2025-01-16T09:56:10.882096Z", "shell.execute_reply": "2025-01-16T09:56:10.881896Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class Vars:\n", " def __init__(self, original):\n", " self.defs = {}\n", " self.my_input = original" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The dictionary needs two methods: `update()` that takes a set of key-value pairs to update itself, and `_set_kv()` that updates a particular key-value pair." ] }, { "cell_type": "code", "execution_count": 112, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.883518Z", "iopub.status.busy": "2025-01-16T09:56:10.883431Z", "iopub.status.idle": "2025-01-16T09:56:10.885544Z", "shell.execute_reply": "2025-01-16T09:56:10.885300Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class Vars(Vars):\n", " def _set_kv(self, k, v):\n", " self.defs[k] = v\n", "\n", " def __setitem__(self, k, v):\n", " self._set_kv(k, v)\n", "\n", " def update(self, v):\n", " for k, v in v.items():\n", " self._set_kv(k, v)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `Vars` is a proxy for the internal dictionary. For example, here is how one can use it." ] }, { "cell_type": "code", "execution_count": 113, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.887153Z", "iopub.status.busy": "2025-01-16T09:56:10.887035Z", "iopub.status.idle": "2025-01-16T09:56:10.888985Z", "shell.execute_reply": "2025-01-16T09:56:10.888755Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v = Vars('')\n", "v.defs" ] }, { "cell_type": "code", "execution_count": 114, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.890453Z", "iopub.status.busy": "2025-01-16T09:56:10.890348Z", "iopub.status.idle": "2025-01-16T09:56:10.892283Z", "shell.execute_reply": "2025-01-16T09:56:10.892060Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{'x': 'X'}" ] }, "execution_count": 114, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v['x'] = 'X'\n", "v.defs" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.893845Z", "iopub.status.busy": "2025-01-16T09:56:10.893728Z", "iopub.status.idle": "2025-01-16T09:56:10.895984Z", "shell.execute_reply": "2025-01-16T09:56:10.895737Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{'x': 'x', 'y': 'y'}" ] }, "execution_count": 115, "metadata": {}, "output_type": "execute_result" } ], "source": [ "v.update({'x': 'x', 'y': 'y'})\n", "v.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### AssignmentVars" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "We now extend the simple `Vars` to account for variable reassignments. For this, we define `AssignmentVars`.\n", "\n", "The idea for detecting reassignments and renaming variables is as follows: We keep track of the previous reassignments to particular variables using `accessed_seq_var`. It contains the last rename of any particular variable as its corresponding value. The `new_vars` contains a list of all new variables that were added on this iteration." ] }, { "cell_type": "code", "execution_count": 116, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.897646Z", "iopub.status.busy": "2025-01-16T09:56:10.897533Z", "iopub.status.idle": "2025-01-16T09:56:10.899469Z", "shell.execute_reply": "2025-01-16T09:56:10.899231Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(Vars):\n", " def __init__(self, original):\n", " super().__init__(original)\n", " self.accessed_seq_var = {}\n", " self.var_def_lines = {}\n", " self.current_event = None\n", " self.new_vars = set()\n", " self.method_init()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `method_init()` method takes care of keeping track of method invocations using records saved in the `call_stack`. `event_locations` is for keeping track of the locations accessed *within this method*. This is used for line number tracking of variable definitions." ] }, { "cell_type": "code", "execution_count": 117, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.901140Z", "iopub.status.busy": "2025-01-16T09:56:10.901020Z", "iopub.status.idle": "2025-01-16T09:56:10.902873Z", "shell.execute_reply": "2025-01-16T09:56:10.902649Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def method_init(self):\n", " self.call_stack = CallStack()\n", " self.event_locations = {self.call_stack.method_id: []}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `update()` is now modified to track the changed line numbers if any, using `var_location_register()`. We reinitialize the `new_vars` after use for the next event." ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.904452Z", "iopub.status.busy": "2025-01-16T09:56:10.904345Z", "iopub.status.idle": "2025-01-16T09:56:10.906130Z", "shell.execute_reply": "2025-01-16T09:56:10.905909Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def update(self, v):\n", " for k, v in v.items():\n", " self._set_kv(k, v)\n", " self.var_location_register(self.new_vars)\n", " self.new_vars = set()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The variable name now incorporates an index of how many reassignments it has gone through, effectively making each reassignment a unique variable." ] }, { "cell_type": "code", "execution_count": 119, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.907626Z", "iopub.status.busy": "2025-01-16T09:56:10.907532Z", "iopub.status.idle": "2025-01-16T09:56:10.909276Z", "shell.execute_reply": "2025-01-16T09:56:10.909031Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_name(self, var):\n", " return (var, self.accessed_seq_var[var])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "While storing variables, we need to first check whether it was previously known. If it is not, we need to initialize the rename count. This is accomplished by `var_access`." ] }, { "cell_type": "code", "execution_count": 120, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.910792Z", "iopub.status.busy": "2025-01-16T09:56:10.910674Z", "iopub.status.idle": "2025-01-16T09:56:10.912627Z", "shell.execute_reply": "2025-01-16T09:56:10.912393Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_access(self, var):\n", " if var not in self.accessed_seq_var:\n", " self.accessed_seq_var[var] = 0\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "During a variable reassignment, we update the `accessed_seq_var` to reflect the new count." ] }, { "cell_type": "code", "execution_count": 121, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.914172Z", "iopub.status.busy": "2025-01-16T09:56:10.914074Z", "iopub.status.idle": "2025-01-16T09:56:10.915761Z", "shell.execute_reply": "2025-01-16T09:56:10.915521Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_assign(self, var):\n", " self.accessed_seq_var[var] += 1\n", " self.new_vars.add(self.var_name(var))\n", " return self.var_name(var)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "These methods can be used as follows" ] }, { "cell_type": "code", "execution_count": 122, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.917490Z", "iopub.status.busy": "2025-01-16T09:56:10.917382Z", "iopub.status.idle": "2025-01-16T09:56:10.919282Z", "shell.execute_reply": "2025-01-16T09:56:10.919055Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{}" ] }, "execution_count": 122, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav = AssignmentVars('')\n", "sav.defs" ] }, { "cell_type": "code", "execution_count": 123, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.920782Z", "iopub.status.busy": "2025-01-16T09:56:10.920680Z", "iopub.status.idle": "2025-01-16T09:56:10.922637Z", "shell.execute_reply": "2025-01-16T09:56:10.922421Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 0)" ] }, "execution_count": 123, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_access('v1')" ] }, { "cell_type": "code", "execution_count": 124, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.924061Z", "iopub.status.busy": "2025-01-16T09:56:10.923962Z", "iopub.status.idle": "2025-01-16T09:56:10.926002Z", "shell.execute_reply": "2025-01-16T09:56:10.925625Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 1)" ] }, "execution_count": 124, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_assign('v1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Assigning to it again increments the counter." ] }, { "cell_type": "code", "execution_count": 125, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.927585Z", "iopub.status.busy": "2025-01-16T09:56:10.927477Z", "iopub.status.idle": "2025-01-16T09:56:10.929437Z", "shell.execute_reply": "2025-01-16T09:56:10.929212Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "('v1', 2)" ] }, "execution_count": 125, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav.var_assign('v1')" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The core of the logic is in `_set_kv()`. When a variable is being assigned, we get the sequenced variable name `s_var`. If the sequenced variable name was previously unknown in `defs`, then we have no further concerns. We add the sequenced variable to `defs`.\n", "\n", "If the variable is previously known, then it is an indication of a possible reassignment. In this case, we look at the value the variable is holding. We check if the value changed. If it has not, then it is not.\n", "\n", "If the value has changed, it is a reassignment. We first increment the variable usage sequence using `var_assign`, retrieve the new name, update the new name in `defs`." ] }, { "cell_type": "code", "execution_count": 126, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.930970Z", "iopub.status.busy": "2025-01-16T09:56:10.930873Z", "iopub.status.idle": "2025-01-16T09:56:10.932681Z", "shell.execute_reply": "2025-01-16T09:56:10.932471Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def _set_kv(self, var, val):\n", " s_var = self.var_access(var)\n", " if s_var in self.defs and self.defs[s_var] == val:\n", " return\n", " self.defs[self.var_assign(var)] = val" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how it can be used. Assigning a variable the first time initializes its counter." ] }, { "cell_type": "code", "execution_count": 127, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.934166Z", "iopub.status.busy": "2025-01-16T09:56:10.934063Z", "iopub.status.idle": "2025-01-16T09:56:10.936071Z", "shell.execute_reply": "2025-01-16T09:56:10.935857Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X'}" ] }, "execution_count": 127, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav = AssignmentVars('')\n", "sav['x'] = 'X'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "If the variable is assigned again with the same value, it is probably not a reassignment." ] }, { "cell_type": "code", "execution_count": 128, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.937653Z", "iopub.status.busy": "2025-01-16T09:56:10.937540Z", "iopub.status.idle": "2025-01-16T09:56:10.939543Z", "shell.execute_reply": "2025-01-16T09:56:10.939313Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X'}" ] }, "execution_count": 128, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav['x'] = 'X'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "However, if the value changed, it is a reassignment." ] }, { "cell_type": "code", "execution_count": 129, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.941051Z", "iopub.status.busy": "2025-01-16T09:56:10.940947Z", "iopub.status.idle": "2025-01-16T09:56:10.943026Z", "shell.execute_reply": "2025-01-16T09:56:10.942781Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "text/plain": [ "{('x', 1): 'X', ('x', 2): 'Y'}" ] }, "execution_count": 129, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sav['x'] = 'Y'\n", "sav.defs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "There is a subtlety here. It is possible for a child method to be called from the middle of a parent method, and for both to use the same variable name with different values. In this case, when the child returns, parent will have the old variable with old value in context. With our implementation, we consider this as a reassignment. However, this is OK because adding a new reassignment is harmless, but missing one is not. Further, we will discuss later how this can be avoided." ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We also define bookkeeping codes for `register_event()` `method_enter()` and `method_exit()` which are the methods responsible for keeping track of the method stack. The basic idea is that, each `method_enter()` represents a new method invocation. Hence, it merits a new method id, which is generated from the `method_register`, and saved in the `method_id`. Since this is a new method, the method stack is extended by one element with this id. In the case of `method_exit()`, we pop the method stack, and reset the current `method_id` to what was below the current one." ] }, { "cell_type": "code", "execution_count": 130, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.944683Z", "iopub.status.busy": "2025-01-16T09:56:10.944528Z", "iopub.status.idle": "2025-01-16T09:56:10.947173Z", "shell.execute_reply": "2025-01-16T09:56:10.946908Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def method_enter(self, cxt, my_vars):\n", " self.current_event = 'call'\n", " self.call_stack.enter(cxt.method)\n", " self.event_locations[self.call_stack.method_id] = []\n", " self.register_event(cxt)\n", " self.update(my_vars)\n", "\n", " def method_exit(self, cxt, my_vars):\n", " self.current_event = 'return'\n", " self.register_event(cxt)\n", " self.update(my_vars)\n", " self.call_stack.leave()\n", "\n", " def method_statement(self, cxt, my_vars):\n", " self.current_event = 'line'\n", " self.register_event(cxt)\n", " self.update(my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "For each of the method events, we also register the event using `register_event()` which keeps track of the line numbers that were referenced in *this* method." ] }, { "cell_type": "code", "execution_count": 131, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.948729Z", "iopub.status.busy": "2025-01-16T09:56:10.948615Z", "iopub.status.idle": "2025-01-16T09:56:10.950429Z", "shell.execute_reply": "2025-01-16T09:56:10.950152Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def register_event(self, cxt):\n", " self.event_locations[self.call_stack.method_id].append(cxt.line_no)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `var_location_register()` keeps the locations of newly added variables. The definition location of variables in a `call` is the *current* location. However, for a `line`, it would be the previous event in the current method." ] }, { "cell_type": "code", "execution_count": 132, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.951921Z", "iopub.status.busy": "2025-01-16T09:56:10.951803Z", "iopub.status.idle": "2025-01-16T09:56:10.954259Z", "shell.execute_reply": "2025-01-16T09:56:10.953921Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def var_location_register(self, my_vars):\n", " def loc(mid):\n", " if self.current_event == 'call':\n", " return self.event_locations[mid][-1]\n", " elif self.current_event == 'line':\n", " return self.event_locations[mid][-2]\n", " elif self.current_event == 'return':\n", " return self.event_locations[mid][-2]\n", " else:\n", " assert False\n", "\n", " my_loc = loc(self.call_stack.method_id)\n", " for var in my_vars:\n", " self.var_def_lines[var] = my_loc" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We define `defined_vars()` which returns the names of variables annotated with the line numbers as below." ] }, { "cell_type": "code", "execution_count": 133, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.955873Z", "iopub.status.busy": "2025-01-16T09:56:10.955748Z", "iopub.status.idle": "2025-01-16T09:56:10.957881Z", "shell.execute_reply": "2025-01-16T09:56:10.957627Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def defined_vars(self, formatted=True):\n", " def fmt(k):\n", " v = (k[0], self.var_def_lines[k])\n", " return \"%s@%s\" % v if formatted else v\n", "\n", " return [(fmt(k), v) for k, v in self.defs.items()]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Similar to `defined_vars()` we define `seq_vars()` which annotates different variables with the number of times they were used." ] }, { "cell_type": "code", "execution_count": 134, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.959393Z", "iopub.status.busy": "2025-01-16T09:56:10.959294Z", "iopub.status.idle": "2025-01-16T09:56:10.961272Z", "shell.execute_reply": "2025-01-16T09:56:10.961056Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentVars(AssignmentVars):\n", " def seq_vars(self, formatted=True):\n", " def fmt(k):\n", " v = (k[0], self.var_def_lines[k], k[1])\n", " return \"%s@%s:%s\" % v if formatted else v\n", "\n", " return {fmt(k): v for k, v in self.defs.items()}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### AssignmentTracker" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The `AssignmentTracker` keeps the assignment definitions using the `AssignmentVars` we defined previously." ] }, { "cell_type": "code", "execution_count": 135, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.962621Z", "iopub.status.busy": "2025-01-16T09:56:10.962522Z", "iopub.status.idle": "2025-01-16T09:56:10.964463Z", "shell.execute_reply": "2025-01-16T09:56:10.964248Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentTracker(DefineTracker):\n", " def __init__(self, my_input, trace, **kwargs):\n", " self.options(kwargs)\n", " self.my_input = my_input\n", "\n", " self.my_assignments = self.create_assignments(my_input)\n", "\n", " self.trace = trace\n", " self.process()\n", "\n", " def create_assignments(self, *args):\n", " return AssignmentVars(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "To fine-tune the process, we define an optional parameter called `track_return`. During tracing a method return, Python produces a virtual variable that contains the result of the returned value. If the `track_return` is set, we capture this value as a variable.\n", "\n", "* `track_return` -- if true, add a *virtual variable* to the Vars representing the return value" ] }, { "cell_type": "code", "execution_count": 136, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.965904Z", "iopub.status.busy": "2025-01-16T09:56:10.965809Z", "iopub.status.idle": "2025-01-16T09:56:10.967796Z", "shell.execute_reply": "2025-01-16T09:56:10.967507Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class AssignmentTracker(AssignmentTracker):\n", " def options(self, kwargs):\n", " self.track_return = kwargs.get('track_return', False)\n", " super().options(kwargs)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "There can be different kinds of events during a trace, which includes `call` when a function is entered, `return` when the function returns, `exception` when an exception is thrown and `line` when a statement is executed.\n", "\n", "The previous `Tracker` was too simplistic in that it did not distinguish between the different events. We rectify that and define `on_call()`, `on_return()`, and `on_line()` respectively, which get called on their corresponding events.\n", "\n", "Note that `on_line()` is called also for `on_return()`. The reason is, that Python invokes the trace function *before* the corresponding line is executed. Hence, effectively, the `on_return()` is called with the binding produced by the execution of the previous statement in the environment. Our processing in effect is done on values that were bound by the previous statement. Hence, calling `on_line()` here is appropriate as it provides the event handler a chance to work on the previous binding." ] }, { "cell_type": "code", "execution_count": 137, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.969469Z", "iopub.status.busy": "2025-01-16T09:56:10.969329Z", "iopub.status.idle": "2025-01-16T09:56:10.972190Z", "shell.execute_reply": "2025-01-16T09:56:10.971864Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class AssignmentTracker(AssignmentTracker):\n", " def on_call(self, arg, cxt, my_vars):\n", " my_vars = cxt.parameters(my_vars)\n", " self.my_assignments.method_enter(cxt, self.fragments(my_vars))\n", "\n", " def on_line(self, arg, cxt, my_vars):\n", " self.my_assignments.method_statement(cxt, self.fragments(my_vars))\n", "\n", " def on_return(self, arg, cxt, my_vars):\n", " self.on_line(arg, cxt, my_vars)\n", " my_vars = {'<-%s' % cxt.method: arg} if self.track_return else {}\n", " self.my_assignments.method_exit(cxt, my_vars)\n", "\n", " def on_exception(self, arg, cxt, my_vara):\n", " return\n", "\n", " def track_event(self, event, arg, cxt, my_vars):\n", " self.current_event = event\n", " dispatch = {\n", " 'call': self.on_call,\n", " 'return': self.on_return,\n", " 'line': self.on_line,\n", " 'exception': self.on_exception\n", " }\n", " dispatch[event](arg, cxt, my_vars)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We can now use `AssignmentTracker` to track the different variables. To verify that our variable line number inference works, we recover definitions from the functions `A()`, `B()` and `C()` (with data annotations removed so that the input fragments are correctly identified). " ] }, { "cell_type": "code", "execution_count": 138, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.973866Z", "iopub.status.busy": "2025-01-16T09:56:10.973745Z", "iopub.status.idle": "2025-01-16T09:56:10.975463Z", "shell.execute_reply": "2025-01-16T09:56:10.975195Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def C(cp_1): # type: ignore\n", " c_2 = cp_1\n", " c_3 = c_2\n", " return c_3" ] }, { "cell_type": "code", "execution_count": 139, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.977232Z", "iopub.status.busy": "2025-01-16T09:56:10.977079Z", "iopub.status.idle": "2025-01-16T09:56:10.978786Z", "shell.execute_reply": "2025-01-16T09:56:10.978560Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def B(bp_7): # type: ignore\n", " b_8 = bp_7\n", " return C(b_8)" ] }, { "cell_type": "code", "execution_count": 140, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.980297Z", "iopub.status.busy": "2025-01-16T09:56:10.980196Z", "iopub.status.idle": "2025-01-16T09:56:10.981901Z", "shell.execute_reply": "2025-01-16T09:56:10.981596Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def A(ap_12): # type: ignore\n", " a_13 = ap_12\n", " a_14 = B(a_13)\n", " a_14 = a_14\n", " a_13 = a_14\n", " a_14 = B(a_13)\n", " a_14 = B(a_14)[3:]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Running `A()` with sufficient input." ] }, { "cell_type": "code", "execution_count": 141, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:10.983501Z", "iopub.status.busy": "2025-01-16T09:56:10.983385Z", "iopub.status.idle": "2025-01-16T09:56:11.031783Z", "shell.execute_reply": "2025-01-16T09:56:11.031477Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ap_12@1:1 = '---xxx'\n", "a_13@2:1 = '---xxx'\n", "bp_7@1:1 = '---xxx'\n", "b_8@2:1 = '---xxx'\n", "cp_1@1:1 = '---xxx'\n", "c_2@2:1 = '---xxx'\n", "c_3@3:1 = '---xxx'\n", "a_14@3:1 = '---xxx'\n", "a_14@7:2 = 'xxx'\n", "\n", "ap_12@1 = '---xxx'\n", "a_13@2 = '---xxx'\n", "bp_7@1 = '---xxx'\n", "b_8@2 = '---xxx'\n", "cp_1@1 = '---xxx'\n", "c_2@2 = '---xxx'\n", "c_3@3 = '---xxx'\n", "a_14@3 = '---xxx'\n", "a_14@7 = 'xxx'\n" ] } ], "source": [ "with Tracer('---xxx') as tracer:\n", " A(tracer.my_input)\n", "tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)\n", "for k, v in tracker.my_assignments.seq_vars().items():\n", " print(k, '=', repr(v))\n", "print()\n", "for k, v in tracker.my_assignments.defined_vars(formatted=True):\n", " print(k, '=', repr(v))" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "As can be seen, the line numbers are now correctly identified for each variable." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ " Let us try retrieving the assignments for a real world example." ] }, { "cell_type": "code", "execution_count": 142, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:11.033534Z", "iopub.status.busy": "2025-01-16T09:56:11.033412Z", "iopub.status.idle": "2025-01-16T09:56:11.160882Z", "shell.execute_reply": "2025-01-16T09:56:11.160489Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "url@374 = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "url@492 = '//user:pass@www.google.com:80/?q=path#ref'\n", "scheme@492 = 'http'\n", "url@494 = '/?q=path#ref'\n", "netloc@494 = 'user:pass@www.google.com:80'\n", "url@502 = '/?q=path'\n", "fragment@502 = 'ref'\n", "query@504 = 'q=path'\n", "url@395 = 'http://user:pass@www.google.com:80/?q=path#ref'\n", "\n", "url@374 = 'https://www.cispa.saarland:80/'\n", "url@492 = '//www.cispa.saarland:80/'\n", "scheme@492 = 'https'\n", "netloc@494 = 'www.cispa.saarland:80'\n", "url@395 = 'https://www.cispa.saarland:80/'\n", "\n", "url@374 = 'http://www.fuzzingbook.org/#News'\n", "url@492 = '//www.fuzzingbook.org/#News'\n", "scheme@492 = 'http'\n", "url@494 = '/#News'\n", "netloc@494 = 'www.fuzzingbook.org'\n", "fragment@502 = 'News'\n", "url@395 = 'http://www.fuzzingbook.org/#News'\n", "\n", "url@374 = 'ftp://freebsd.org/releases/5.8'\n", "url@492 = '//freebsd.org/releases/5.8'\n", "scheme@492 = 'ftp'\n", "url@494 = '/releases/5.8'\n", "netloc@494 = 'freebsd.org'\n", "url@395 = 'ftp://freebsd.org/releases/5.8'\n", "url@396 = '/releases/5.8'\n", "\n" ] } ], "source": [ "traces = []\n", "for inputstr in URLS_X:\n", " clear_cache()\n", " with Tracer(inputstr, files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", " traces.append((tracer.my_input, tracer.trace))\n", "\n", " tracker = AssignmentTracker(tracer.my_input, tracer.trace, log=True)\n", " for k, v in tracker.my_assignments.defined_vars():\n", " print(k, '=', repr(v))\n", " print()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "The line numbers of variables can be verified from the source code of [urllib/parse.py](https://github.com/python/cpython/blob/3.6/Lib/urllib/parse.py)." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recovering a Derivation Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Does handling variable reassignments help with our URL examples? We look at these next." ] }, { "cell_type": "code", "execution_count": 143, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:11.162997Z", "iopub.status.busy": "2025-01-16T09:56:11.162838Z", "iopub.status.idle": "2025-01-16T09:56:11.165204Z", "shell.execute_reply": "2025-01-16T09:56:11.164931Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class TreeMiner(TreeMiner):\n", " def get_derivation_tree(self):\n", " tree = (START_SYMBOL, [(self.my_input, [])])\n", " for var, value in self.my_assignments:\n", " self.log(0, \"%s=%s\" % (var, repr(value)))\n", " self.apply_new_definition(tree, var, value)\n", " return tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 1: Recovering URL Derivation Tree" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "First we obtain the derivation tree of the URL 1" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### URL 1 derivation tree" ] }, { "cell_type": "code", "execution_count": 144, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:11.167371Z", "iopub.status.busy": "2025-01-16T09:56:11.167191Z", "iopub.status.idle": "2025-01-16T09:56:11.632543Z", "shell.execute_reply": "2025-01-16T09:56:11.632137Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 144, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[0], files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we obtain the derivation tree of URL 4" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "##### URL 4 derivation tree" ] }, { "cell_type": "code", "execution_count": 145, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:11.634700Z", "iopub.status.busy": "2025-01-16T09:56:11.634541Z", "iopub.status.idle": "2025-01-16T09:56:12.124342Z", "shell.execute_reply": "2025-01-16T09:56:12.123937Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 145, "metadata": {}, "output_type": "execute_result" } ], "source": [ "clear_cache()\n", "with Tracer(URLS_X[-1], files=['urllib/parse.py']) as tracer:\n", " urlparse(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The derivation trees seem to belong to the same grammar. Hence, we obtain the grammar for the complete set. First, we update the `recover_grammar()` to use `AssignTracker`." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Recover Grammar" ] }, { "cell_type": "code", "execution_count": 146, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.126545Z", "iopub.status.busy": "2025-01-16T09:56:12.126379Z", "iopub.status.idle": "2025-01-16T09:56:12.129045Z", "shell.execute_reply": "2025-01-16T09:56:12.128750Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class GrammarMiner(GrammarMiner):\n", " def update_grammar(self, inputstr, trace):\n", " at = self.create_tracker(inputstr, trace)\n", " dt = self.create_tree_miner(inputstr, at.my_assignments.defined_vars())\n", " self.add_tree(dt)\n", " return self.grammar\n", "\n", " def create_tracker(self, *args):\n", " return AssignmentTracker(*args)\n", "\n", " def create_tree_miner(self, *args):\n", " return TreeMiner(*args)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Next, we use the modified `recover_grammar()` on derivation trees obtained from URLs." ] }, { "cell_type": "code", "execution_count": 147, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.130838Z", "iopub.status.busy": "2025-01-16T09:56:12.130721Z", "iopub.status.idle": "2025-01-16T09:56:12.284121Z", "shell.execute_reply": "2025-01-16T09:56:12.283722Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "url_grammar = recover_grammar(url_parse, URLS_X, files=['urllib/parse.py'])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "The recovered grammar is below." ] }, { "cell_type": "code", "execution_count": 148, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.286382Z", "iopub.status.busy": "2025-01-16T09:56:12.286238Z", "iopub.status.idle": "2025-01-16T09:56:12.300872Z", "shell.execute_reply": "2025-01-16T09:56:12.300190Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@374\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "scheme@492\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@492\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "netloc@494\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@494\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@502\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "query@504\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "fragment@502\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "url@396\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(url_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Let us fuzz a little to see if the produced values are sane." ] }, { "cell_type": "code", "execution_count": 149, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.302919Z", "iopub.status.busy": "2025-01-16T09:56:12.302818Z", "iopub.status.idle": "2025-01-16T09:56:12.306883Z", "shell.execute_reply": "2025-01-16T09:56:12.306057Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "http://freebsd.org/\n", "ftp://freebsd.org/releases/5.8\n", "http://www.cispa.saarland:80/\n", "ftp://freebsd.org/releases/5.8\n", "https://user:pass@www.google.com:80/releases/5.8\n", "https://freebsd.org/\n", "ftp://www.cispa.saarland:80/?q=path#News\n", "http://www.fuzzingbook.org/\n", "https://www.cispa.saarland:80/\n", "ftp://user:pass@www.google.com:80/\n" ] } ], "source": [ "f = GrammarFuzzer(url_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Our modifications do seem to help. Next, we check whether we can still retrieve the grammar for inventory." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Example 2: Recovering Inventory Grammar" ] }, { "cell_type": "code", "execution_count": 150, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.308945Z", "iopub.status.busy": "2025-01-16T09:56:12.308845Z", "iopub.status.idle": "2025-01-16T09:56:12.315560Z", "shell.execute_reply": "2025-01-16T09:56:12.315298Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "inventory_grammar = recover_grammar(process_vehicle, VEHICLES)" ] }, { "cell_type": "code", "execution_count": 151, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.317346Z", "iopub.status.busy": "2025-01-16T09:56:12.317256Z", "iopub.status.idle": "2025-01-16T09:56:12.327208Z", "shell.execute_reply": "2025-01-16T09:56:12.326878Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "vehicle@29\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "year@30\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "kind@30\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "company@30\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "model@30\n" ] }, { "data": { "image/svg+xml": [ "" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(inventory_grammar)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Using fuzzing to produce values from the grammar." ] }, { "cell_type": "code", "execution_count": 152, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.329287Z", "iopub.status.busy": "2025-01-16T09:56:12.329154Z", "iopub.status.idle": "2025-01-16T09:56:12.332261Z", "shell.execute_reply": "2025-01-16T09:56:12.331945Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1997,van,Chevy,E350\n", "1999,van,Mercury,E350\n", "2000,van,Chevy,Venture\n", "2000,van,Ford,E350\n", "1997,van,Mercury,Cougar\n", "1997,car,Ford,E350\n", "1997,car,Mercury,Venture\n", "1997,car,Mercury,E350\n", "2000,van,Mercury,Cougar\n", "1997,car,Chevy,Venture\n" ] } ], "source": [ "f = GrammarFuzzer(inventory_grammar)\n", "for _ in range(10):\n", " print(f.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### Problems with the Grammar Miner with Reassignment" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "One of the problems with our grammar miner is that it doesn't yet account for the current context. That is, when replacing, a variable can replace tokens that it does not have access to (and hence, it is not a fragment of). Consider this example." ] }, { "cell_type": "code", "execution_count": 153, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.335001Z", "iopub.status.busy": "2025-01-16T09:56:12.334830Z", "iopub.status.idle": "2025-01-16T09:56:12.861857Z", "shell.execute_reply": "2025-01-16T09:56:12.861360Z" }, "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "execution_count": 153, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with Tracer(INVENTORY) as tracer:\n", " process_inventory(tracer.my_input)\n", "sm = AssignmentTracker(tracer.my_input, tracer.trace)\n", "dt = TreeMiner(tracer.my_input, sm.my_assignments.defined_vars())\n", "display_tree(dt.tree, graph_attr=lr_graph)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen, the derivation tree obtained is not quite what we expected. The issue is easily seen if we enable logging in the `TreeMiner`." ] }, { "cell_type": "code", "execution_count": 154, "metadata": { "execution": { "iopub.execute_input": "2025-01-16T09:56:12.863874Z", "iopub.status.busy": "2025-01-16T09:56:12.863729Z", "iopub.status.idle": "2025-01-16T09:56:12.868178Z", "shell.execute_reply": "2025-01-16T09:56:12.867916Z" }, "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " inventory@22='1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture')\n", "\t\t -> [0] '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['']\n", " vehicle@24='1997,van,Ford,E350'\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997,van,Ford,E350')\n", "\t\t -> [0] '1997,van,Ford,E350\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['', '\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture']\n", " year@30='1997'\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'1997')\n", "\t\t -> [0] '1997,van,Ford,E350'\n", "\t\t > ['', ',van,Ford,E350']\n", " kind@30='van'\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'van')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ',van,Ford,E350'\n", "\t\t > [',', '', ',Ford,E350']\n", " company@30='Ford'\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Ford')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ',Ford,E350'\n", "\t\t > [',', '', ',E350']\n", " model@30='E350'\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'E350')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ',E350'\n", "\t\t > [',', '']\n", " vehicle@24='2000,car,Mercury,Cougar'\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'2000,car,Mercury,Cougar')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n2000,car,Mercury,Cougar\\n1999,car,Chevy,Venture'\n", "\t\t > ['\\n', '', '\\n1999,car,Chevy,Venture']\n", " year@30='2000'\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'2000')\n", "\t\t -> [0] '2000,car,Mercury,Cougar'\n", "\t\t > ['', ',car,Mercury,Cougar']\n", " kind@30='car'\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'car')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ',car,Mercury,Cougar'\n", "\t\t > [',', '', ',Mercury,Cougar']\n", " company@30='Mercury'\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Mercury')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ',Mercury,Cougar'\n", "\t\t > [',', '', ',Cougar']\n", " model@30='Cougar'\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] '1997'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'van'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'Ford'\n", "\t\t -> [5] ','\n", "\t\t -> [6] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'E350'\n", "\t\t -> [1] '\\n'\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] '2000'\n", "\t\t -> [1] ','\n", "\t\t -> [2] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'car'\n", "\t\t -> [3] ','\n", "\t\t -> [4] ''\n", "\t - Node: \t\t? (:'Cougar')\n", "\t\t -> [0] 'Mercury'\n", "\t\t -> [5] ',Cougar'\n", "\t\t > [',', '']\n", " vehicle@24='1999,car,Chevy,Venture'\n", "\t - Node: