{ "cells": [ { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Fuzzing with Input Fragments\n", "\n", "In this chapter, we introduce how to combine [parsing](Parser.ipynb) and [fuzzing](GrammarFuzzer.ipynb). This allows to _mutate_ existing inputs while preserving syntactical correctness, and to _reuse_ fragments from existing inputs while generating new ones. The combination of parsing and fuzzing, as demonstrated in this chapter, has been highly successful in practice: The _LangFuzz_ fuzzer for JavaScript has found more than 2,600 bugs in JavaScript interpreters this way." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "source": [ "**Prerequisites**\n", "\n", "* You should know [how to use a parser](Parser.ipynb).\n", "* You should have read the [chapter on efficient grammar fuzzing](GrammarFuzzer.ipynb)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import fuzzingbook_utils" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import PEGParser\n", "from GrammarFuzzer import GrammarFuzzer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "toc-hr-collapsed": true }, "source": [ "## Recombining Parsed Inputs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Recombining parsed inputs was pioneered by _Langfuzz_ \\cite{Holler2012}. The main challenge is that program inputs often carry additional constraints beyond what is described by the syntax. For example, in Java, one needs to declare a variable (using a specific format for declaration) before it can be used in an expression. This restriction is not captured in the _Java CFG_. Checking for type correctness is another example for additional restrictions carried by program definitions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "When fuzzing compilers and interpreters, naive generation of programs using the language *CFG* often fails to achieve significant deeper coverage due to these kinds of checks external to the grammar. Holler et al. suggests using pre-existing valid code fragments to get around these restrictions. The idea is that the pre-existing valid code fragments already conform to the restrictions external to the grammar, and can often provide a means to evade validity checks." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### A Grammar-based Mutational Fuzzer\n", "\n", "The idea is that one can treat the derivation tree of a preexisting program as the scaffolding, poke holes in it, and patch it with generated inputs from our grammar. Given below is a grammar for a language that allows assignment of variables." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import crange, syntax_diagram" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "VAR_GRAMMAR = {\n", " '': [''],\n", " '': [';', ''],\n", " '': [''],\n", " '': ['='],\n", " '': [''],\n", " '': ['', ''],\n", " '': list(string.ascii_letters),\n", " '': ['+', '-', ''],\n", " '': ['*', '/', ''],\n", " '':\n", " ['+', '-', '()', '', ''],\n", " '': ['.', ''],\n", " '': ['', ''],\n", " '': crange('0', '9')\n", "}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "syntax_diagram(VAR_GRAMMAR)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Let us use our new grammar to parse a program." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mystring = 'va=10;vb=20'" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, _n, symbol, _a): return symbol in {\n", " '', ''}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import PEGParser, highlight_node\n", "from GrammarFuzzer import display_tree" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "parser = PEGParser(VAR_GRAMMAR)\n", "for tree in parser.parse(mystring):\n", " display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen from the above example, our grammar is rather detailed. So we need to define our token nodes, which correspond to the *red* nodes above." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "VAR_TOKENS = {'', ''}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "These tokens are pruned using the `prune_tree` method that we mentioned previously.\n", "Here is a slightly more complex program to parse, but with the tree pruned using tokens:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "mystring = 'avar=1.3;bvar=avar-3*(4+300)'\n", "parser = PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS)\n", "for tree in parser.parse(mystring):\n", " display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "We develop a `LangFuzzer` class that generates recombined inputs. To apply the _Langfuzz_ approach, we need a few parsed strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mystrings = [\n", " 'abc=12+(3+3.3)',\n", " 'a=1;b=2;c=a+b',\n", " 'avar=1.3;bvar=avar-3*(4+300)',\n", " 'a=1.3;b=a-1*(4+3+(2/a))',\n", " 'a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)',\n", " 'x=10;y=20;z=(x+y)*(x-y)',\n", " 'x=23;y=51;z=x*x-y*y',\n", "]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "We recurse through any given tree, collecting parsed fragments corresponding to each nonterminal. Further, we also name each node so that we can address each node separately." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Fuzzer import Fuzzer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class LangFuzzer(Fuzzer):\n", " def __init__(self, parser):\n", " self.parser = parser\n", " self.fragments = {k: [] for k in self.parser.cgrammar}\n", "\n", " def traverse_tree(self, node):\n", " counter = 1\n", " nodes = {}\n", "\n", " def helper(node, id):\n", " nonlocal counter\n", " name, children = node\n", " new_children = []\n", " nodes[id] = node\n", " for child in children:\n", " counter += 1\n", " new_children.append(helper(child, counter))\n", " return name, new_children, id\n", "\n", " return helper(node, counter), nodes\n", "\n", " def fragment(self, strings):\n", " self.trees = []\n", " for string in strings:\n", " for tree in self.parser.parse(string):\n", " tree, nodes = self.traverse_tree(tree)\n", " self.trees.append((tree, nodes))\n", " for node in nodes:\n", " symbol = nodes[node][0]\n", " if symbol in self.fragments:\n", " self.fragments[symbol].append(nodes[node])\n", " return self.fragments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We thus obtain all valid fragments from our parsed strings." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS))\n", "fragments = lf.fragment(mystrings)\n", "for key in fragments:\n", " print(\"%s: %d\" % (key, len(fragments[key])))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "All that remains is to actually find a place to poke a hole using `candidate()`, and patch that hole using `generate_new_tree()`. We will explain how to do this next.\n", "\n", "But before that, we update our `initialization` method with a call to `fragment()." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import random" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def __init__(self, parser, strings):\n", " self.parser = parser\n", " self.fragments = {k: [] for k in self.parser.cgrammar}\n", " self.fragment(strings)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Candidate\n", "`LangFuzzer` accepts a list of strings, which are stored as derivation trees in the object.\n", "\n", "The method `candidate()` chooses one of the derivation trees randomly as the template, and identifies a node such that it can be replaced by another node that is different from itself. That is, it chooses a node such that, if the non-terminal name of the node is `node_type`, there is at least one other entry in `fragment[node_type])` " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def candidate(self):\n", " tree, nodes = random.choice(self.trees)\n", " interesting_nodes = [\n", " n for n in nodes if nodes[n][0] in self.fragments\n", " and len(self.fragments[nodes[n][0]]) > 1\n", " ]\n", " node = random.choice(interesting_nodes)\n", " return tree, node" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how it is used -- the *red* node is the node chosen." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "random.seed(1)\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "tree, node = lf.candidate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, nid, _s, _a): return nid in {node}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Generate New Tree\n", "Once we have identified the node, one can generate a new tree by replacing that node with another node of similar type from our fragment pool." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def generate_new_tree(self, node, choice):\n", " name, children, id = node\n", " if id == choice:\n", " return random.choice(self.fragments[name])\n", " else:\n", " return (name, [self.generate_new_tree(c, choice)\n", " for c in children])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Again, the red node indicates where the replacement has occurred." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "random.seed(1)\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "tree, node = lf.candidate()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, nid, _s, _a): return nid in {node}" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from GrammarFuzzer import tree_to_string" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "new_tree = lf.generate_new_tree(tree, node)\n", "for s in [tree_to_string(i) for i in [tree, new_tree]]:\n", " print(s)\n", "display_tree(new_tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Fuzz\n", "\n", "The `fuzz()` method simply calls the procedures defined before in order." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def fuzz(self):\n", " tree, node = self.candidate()\n", " modified = self.generate_new_tree(tree, node)\n", " return tree_to_string(modified)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Here is our fuzzer in action." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "for i in range(10):\n", " print(lf.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "How effective was our fuzzing? Let us find out!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Timer import Timer" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "trials = 100\n", "\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "valid = []\n", "time = 0\n", "for i in range(trials):\n", " with Timer() as t:\n", " s = lf.fuzz()\n", " try:\n", " exec(s, {}, {})\n", " valid.append((s, t.elapsed_time()))\n", " except:\n", " pass\n", " time += t.elapsed_time()\n", "print(\"%d valid strings, that is LangFuzzer generated %f%% valid entries\" %\n", " (len(valid), len(valid) * 100.0 / trials))\n", "print(\"Total time of %f seconds\" % time)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "gf = GrammarFuzzer(VAR_GRAMMAR)\n", "valid = []\n", "time = 0\n", "for i in range(trials):\n", " with Timer() as t:\n", " s = gf.fuzz()\n", " try:\n", " exec(s, {}, {})\n", " valid.append(s)\n", " except:\n", " pass\n", " time += t.elapsed_time()\n", "print(\"%d valid strings, that is GrammarFuzzer generated %f%% valid entries\" %\n", " (len(valid), len(valid) * 100.0 / trials))\n", "print(\"Total time of %f seconds\" % time)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "That is, our `LangFuzzer` is rather effective on generating valid entries when compared to the `GrammarFuzzer`." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Grammar-Based Mutation\n", "\n", "General idea: Take a derivation tree and a matching grammar; apply a random mutation." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import EXPR_GRAMMAR\n", "from GrammarFuzzer import display_tree\n", "from Parser import EarleyParser" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "parser = EarleyParser(EXPR_GRAMMAR)\n", "tree = parser.parse(\"1 + 2 * 3\")[0]\n", "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. Pick any node in the tree\n", "2. Produce a new expansion.\n", "\n", "We have seen this for `LangFuzzer` already, right?\n", "\n", "How about we factor this out (from the Parser notebook), and have two notebook on mutational (and genetic fuzzing):\n", "\n", "1. `LangFuzzer` – a chapter on\n", " * Mutating trees (randomly)\n", " * Mutating trees from a given population (the LangFuzz approach)\n", " * Tree recombination (and crossover)\n", " \n", "2. `EvoGrammarFuzzer` – a chapter on\n", " * Genetic improvement (using coverage only)\n", " * Genetic improvement (using a fitness function from search-based fuzzing)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def mutate_tree(tree, grammar):\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Lessons Learned\n", "\n", "* We can generate a pool of fragments using the _LangFuzz_ approach, and use it to generate nearly valid strings." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Next Steps\n", "\n", "* In the [chapter on evolutionary fuzzing](EvoGrammarFuzzer.ipynb), we discuss how to systematically evolve a population of inputs through mutation and crossover operations (as discussed in this chapter) to achieve a specific goal such as code coverage." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Background\n", "\n", "Recombining parsed inputs was pioneered by _Langfuzz_ \\cite{Holler2012}, which also gave its name to the main class of this chapter." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercises\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden", "solution2_first": true }, "source": [ "### Exercise 1: A Different LangFuzzer\n", "\n", "Sometimes we do not want to use our pool of strings for various reasons – the number of items in the pool may be inadequate, or not varied enough. Extend the `LangFuzzer` to use a separate function to check if the number of items in the pool corresponding to the selected non-terminal is large enough (say greater than 10), and if not, use the tree expansion technique from `GrammarFuzzer` to patch the hole." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "source": [ "**Solution.** Here is a possible solution. Before we can make use of `GrammarFuzzer`, we need to change it a little bit. GrammarFuzzer relies on the grammar being in the `fuzzing` format with the expansions represented as strings. Our `LangFuzzer` expects the expansions to be a list of tokens. So we fix the output of `GrammarFuzzer`." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "outputs": [], "source": [ "class LangFuzzer2(LangFuzzer):\n", " def __init__(self, parser, strings):\n", " super().__init__(parser, strings)\n", " self.gfuzz = GrammarFuzzer(parser.grammar())\n", "\n", " def check_diversity(self, pool):\n", " return len(pool) > 10\n", "\n", " def candidate(self):\n", " tree, nodes = random.choice(self.trees)\n", " interesting_nodes = [\n", " n for n in nodes if nodes[n][0] in self.fragments\n", " and nodes[n][0] is not self.parser.start_symbol()\n", " and len(self.fragments[nodes[n][0]]) > 0\n", " ]\n", " node = random.choice(interesting_nodes)\n", " return tree, node\n", "\n", " def generate_new_tree(self, node, choice):\n", " name, children, id = node\n", " if id == choice:\n", " pool = self.fragments[name]\n", " if self.check_diversity(pool):\n", " return random.choice(self.fragments[name])\n", " else:\n", " return None\n", " else:\n", " return (name,\n", " [self.generate_new_tree(c, choice) for c in children])\n", "\n", " def fuzz(self):\n", " tree, node = self.candidate()\n", " tree_with_a_hole = self.generate_new_tree(tree, node)\n", " modified = self.gfuzz.expand_tree(tree_with_a_hole)\n", " return tree_to_string(modified)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "outputs": [], "source": [ "parser = EarleyParser(VAR_GRAMMAR, tokens=VAR_TOKENS)\n", "lf = LangFuzzer2(parser, mystrings)\n", "for i in range(100):\n", " print(lf.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "With these changes, our `LangFuzzer2` is able to use the pool of fragments when necessary, but can rely on the grammar when the pool is not sufficient." ] } ], "metadata": { "ipub": { "bibliography": "fuzzingbook.bib", "toc": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "toc-autonumbering": false }, "nbformat": 4, "nbformat_minor": 2 }