{ "cells": [ { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "# Fuzzing with Input Fragments\n", "\n", "In this chapter, we introduce how to combine [parsing](Parser.ipynb) and [fuzzing](GrammarFuzzer.ipynb). This allows to _mutate_ existing inputs while preserving syntactical correctness, and to _reuse_ fragments from existing inputs while generating new ones. The combination of parsing and fuzzing, as demonstrated in this chapter, has been highly successful in practice: The _LangFuzz_ fuzzer for JavaScript has found more than 2,600 bugs in JavaScript interpreters this way." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "subslide" } }, "source": [ "**Prerequisites**\n", "\n", "* You should know [how to use a parser](Parser.ipynb).\n", "* You should have read the [chapter on efficient grammar fuzzing](GrammarFuzzer.ipynb)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "import fuzzingbook_utils" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import PEGParser\n", "from GrammarFuzzer import GrammarFuzzer" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" }, "toc-hr-collapsed": true }, "source": [ "## Recombining Parsed Inputs" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Recombining parsed inputs was pioneered by _Langfuzz_ \\cite{Holler2012}. The main challenge is that program inputs often carry additional constraints beyond what is described by the syntax. For example, in Java, one needs to declare a variable (using a specific format for declaration) before it can be used in an expression. This restriction is not captured in the _Java CFG_. Checking for type correctness is another example for additional restrictions carried by program definitions." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "When fuzzing compilers and interpreters, naive generation of programs using the language *CFG* often fails to achieve significant deeper coverage due to these kinds of checks external to the grammar. Holler et al. suggests using pre-existing valid code fragments to get around these restrictions. The idea is that the pre-existing valid code fragments already conform to the restrictions external to the grammar, and can often provide a means to evade validity checks." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "### A Grammar-based Mutational Fuzzer\n", "\n", "The idea is that one can treat the derivation tree of a preexisting program as the scaffolding, poke holes in it, and patch it with generated inputs from our grammar. Given below is a grammar for a language that allows assignment of variables." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import string" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import crange, syntax_diagram" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "VAR_GRAMMAR = {\n", " '': [''],\n", " '': [';', ''],\n", " '': [''],\n", " '': ['='],\n", " '': [''],\n", " '': ['', ''],\n", " '': list(string.ascii_letters),\n", " '': ['+', '-', ''],\n", " '': ['*', '/', ''],\n", " '':\n", " ['+', '-', '()', '', ''],\n", " '': ['.', ''],\n", " '': ['', ''],\n", " '': crange('0', '9')\n", "}" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "start\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "statements" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "statements\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "statement\n", ";\n", "statements\n", "\n", "statement" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "statement\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "assignment" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "assignment\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "identifier\n", "=\n", "expr" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "identifier\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "word" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "word\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "alpha\n", "word\n", "\n", "alpha" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "alpha\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "e\n", "\n", "d\n", "\n", "c\n", "\n", "b\n", "\n", "a\n", "\n", "f\n", "\n", "g\n", "\n", "h\n", "\n", "i\n", "\n", "j\n", "\n", "\n", "o\n", "\n", "n\n", "\n", "m\n", "\n", "l\n", "\n", "k\n", "\n", "p\n", "\n", "q\n", "\n", "r\n", "\n", "s\n", "\n", "t\n", "\n", "\n", "y\n", "\n", "x\n", "\n", "w\n", "\n", "v\n", "\n", "u\n", "\n", "z\n", "\n", "A\n", "\n", "B\n", "\n", "C\n", "\n", "D\n", "\n", "\n", "I\n", "\n", "H\n", "\n", "G\n", "\n", "F\n", "\n", "E\n", "\n", "J\n", "\n", "K\n", "\n", "L\n", "\n", "M\n", "\n", "N\n", "\n", "\n", "S\n", "\n", "R\n", "\n", "Q\n", "\n", "P\n", "\n", "O\n", "\n", "T\n", "\n", "U\n", "\n", "V\n", "\n", "W\n", "\n", "X\n", "\n", "\n", "Y\n", "\n", "Z" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "expr\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "term\n", "+\n", "expr\n", "\n", "term\n", "-\n", "expr\n", "\n", "term" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "term\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "factor\n", "*\n", "term\n", "\n", "factor\n", "/\n", "term\n", "\n", "factor" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "factor\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "-\n", "factor\n", "\n", "+\n", "factor\n", "\n", "(\n", "expr\n", ")\n", "\n", "identifier\n", "\n", "number" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "number\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "integer\n", ".\n", "integer\n", "\n", "integer" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "integer\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "digit\n", "integer\n", "\n", "digit" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "digit\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "0\n", "\n", "1\n", "\n", "\n", "2\n", "\n", "3\n", "\n", "\n", "4\n", "\n", "5\n", "\n", "\n", "6\n", "\n", "7\n", "\n", "\n", "8\n", "\n", "9" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "syntax_diagram(VAR_GRAMMAR)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Let us use our new grammar to parse a program." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mystring = 'va=10;vb=20'" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, _n, symbol, _a): return symbol in {\n", " '', ''}" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Parser import PEGParser, highlight_node\n", "from GrammarFuzzer import display_tree" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<statements>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<statement>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "22\n", ";\n", "\n", "\n", "\n", "1->22\n", "\n", "\n", "\n", "\n", "\n", "23\n", "<statements>\n", "\n", "\n", "\n", "1->23\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<assignment>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<identifier>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "11\n", "=\n", "\n", "\n", "\n", "3->11\n", "\n", "\n", "\n", "\n", "\n", "12\n", "<expr>\n", "\n", "\n", "\n", "3->12\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<word>\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "<alpha>\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<word>\n", "\n", "\n", "\n", "5->8\n", "\n", "\n", "\n", "\n", "\n", "7\n", "v\n", "\n", "\n", "\n", "6->7\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<alpha>\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "a\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "13\n", "<term>\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "14\n", "<factor>\n", "\n", "\n", "\n", "13->14\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<number>\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<integer>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "17\n", "<digit>\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<integer>\n", "\n", "\n", "\n", "16->19\n", "\n", "\n", "\n", "\n", "\n", "18\n", "1\n", "\n", "\n", "\n", "17->18\n", "\n", "\n", "\n", "\n", "\n", "20\n", "<digit>\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "21\n", "0\n", "\n", "\n", "\n", "20->21\n", "\n", "\n", "\n", "\n", "\n", "24\n", "<statement>\n", "\n", "\n", "\n", "23->24\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<assignment>\n", "\n", "\n", "\n", "24->25\n", "\n", "\n", "\n", "\n", "\n", "26\n", "<identifier>\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "33\n", "=\n", "\n", "\n", "\n", "25->33\n", "\n", "\n", "\n", "\n", "\n", "34\n", "<expr>\n", "\n", "\n", "\n", "25->34\n", "\n", "\n", "\n", "\n", "\n", "27\n", "<word>\n", "\n", "\n", "\n", "26->27\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<alpha>\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "30\n", "<word>\n", "\n", "\n", "\n", "27->30\n", "\n", "\n", "\n", "\n", "\n", "29\n", "v\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "31\n", "<alpha>\n", "\n", "\n", "\n", "30->31\n", "\n", "\n", "\n", "\n", "\n", "32\n", "b\n", "\n", "\n", "\n", "31->32\n", "\n", "\n", "\n", "\n", "\n", "35\n", "<term>\n", "\n", "\n", "\n", "34->35\n", "\n", "\n", "\n", "\n", "\n", "36\n", "<factor>\n", "\n", "\n", "\n", "35->36\n", "\n", "\n", "\n", "\n", "\n", "37\n", "<number>\n", "\n", "\n", "\n", "36->37\n", "\n", "\n", "\n", "\n", "\n", "38\n", "<integer>\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "\n", "\n", "\n", "39\n", "<digit>\n", "\n", "\n", "\n", "38->39\n", "\n", "\n", "\n", "\n", "\n", "41\n", "<integer>\n", "\n", "\n", "\n", "38->41\n", "\n", "\n", "\n", "\n", "\n", "40\n", "2\n", "\n", "\n", "\n", "39->40\n", "\n", "\n", "\n", "\n", "\n", "42\n", "<digit>\n", "\n", "\n", "\n", "41->42\n", "\n", "\n", "\n", "\n", "\n", "43\n", "0\n", "\n", "\n", "\n", "42->43\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parser = PEGParser(VAR_GRAMMAR)\n", "for tree in parser.parse(mystring):\n", " display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "As can be seen from the above example, our grammar is rather detailed. So we need to define our token nodes, which correspond to the *red* nodes above." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "VAR_TOKENS = {'', ''}" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "These tokens are pruned using the `prune_tree` method that we mentioned previously.\n", "Here is a slightly more complex program to parse, but with the tree pruned using tokens:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<statements>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<statement>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "12\n", ";\n", "\n", "\n", "\n", "1->12\n", "\n", "\n", "\n", "\n", "\n", "13\n", "<statements>\n", "\n", "\n", "\n", "1->13\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<assignment>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<identifier>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "6\n", "=\n", "\n", "\n", "\n", "3->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<expr>\n", "\n", "\n", "\n", "3->7\n", "\n", "\n", "\n", "\n", "\n", "5\n", "avar\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<term>\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<factor>\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<number>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "1.3\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "14\n", "<statement>\n", "\n", "\n", "\n", "13->14\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<assignment>\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<identifier>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "18\n", "=\n", "\n", "\n", "\n", "15->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<expr>\n", "\n", "\n", "\n", "15->19\n", "\n", "\n", "\n", "\n", "\n", "17\n", "bvar\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "<term>\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "24\n", "-\n", "\n", "\n", "\n", "19->24\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<expr>\n", "\n", "\n", "\n", "19->25\n", "\n", "\n", "\n", "\n", "\n", "21\n", "<factor>\n", "\n", "\n", "\n", "20->21\n", "\n", "\n", "\n", "\n", "\n", "22\n", "<identifier>\n", "\n", "\n", "\n", "21->22\n", "\n", "\n", "\n", "\n", "\n", "23\n", "avar\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "\n", "\n", "\n", "26\n", "<term>\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "27\n", "<factor>\n", "\n", "\n", "\n", "26->27\n", "\n", "\n", "\n", "\n", "\n", "30\n", "*\n", "\n", "\n", "\n", "26->30\n", "\n", "\n", "\n", "\n", "\n", "31\n", "<term>\n", "\n", "\n", "\n", "26->31\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<number>\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "29\n", "3\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "32\n", "<factor>\n", "\n", "\n", "\n", "31->32\n", "\n", "\n", "\n", "\n", "\n", "33\n", "(\n", "\n", "\n", "\n", "32->33\n", "\n", "\n", "\n", "\n", "\n", "34\n", "<expr>\n", "\n", "\n", "\n", "32->34\n", "\n", "\n", "\n", "\n", "\n", "45\n", ")\n", "\n", "\n", "\n", "32->45\n", "\n", "\n", "\n", "\n", "\n", "35\n", "<term>\n", "\n", "\n", "\n", "34->35\n", "\n", "\n", "\n", "\n", "\n", "39\n", "+\n", "\n", "\n", "\n", "34->39\n", "\n", "\n", "\n", "\n", "\n", "40\n", "<expr>\n", "\n", "\n", "\n", "34->40\n", "\n", "\n", "\n", "\n", "\n", "36\n", "<factor>\n", "\n", "\n", "\n", "35->36\n", "\n", "\n", "\n", "\n", "\n", "37\n", "<number>\n", "\n", "\n", "\n", "36->37\n", "\n", "\n", "\n", "\n", "\n", "38\n", "4\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "\n", "\n", "\n", "41\n", "<term>\n", "\n", "\n", "\n", "40->41\n", "\n", "\n", "\n", "\n", "\n", "42\n", "<factor>\n", "\n", "\n", "\n", "41->42\n", "\n", "\n", "\n", "\n", "\n", "43\n", "<number>\n", "\n", "\n", "\n", "42->43\n", "\n", "\n", "\n", "\n", "\n", "44\n", "300\n", "\n", "\n", "\n", "43->44\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "mystring = 'avar=1.3;bvar=avar-3*(4+300)'\n", "parser = PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS)\n", "for tree in parser.parse(mystring):\n", " display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "We develop a `LangFuzzer` class that generates recombined inputs. To apply the _Langfuzz_ approach, we need a few parsed strings." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "mystrings = [\n", " 'abc=12+(3+3.3)',\n", " 'a=1;b=2;c=a+b',\n", " 'avar=1.3;bvar=avar-3*(4+300)',\n", " 'a=1.3;b=a-1*(4+3+(2/a))',\n", " 'a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)',\n", " 'x=10;y=20;z=(x+y)*(x-y)',\n", " 'x=23;y=51;z=x*x-y*y',\n", "]" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "We recurse through any given tree, collecting parsed fragments corresponding to each nonterminal. Further, we also name each node so that we can address each node separately." ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Fuzzer import Fuzzer" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class LangFuzzer(Fuzzer):\n", " def __init__(self, parser):\n", " self.parser = parser\n", " self.fragments = {k: [] for k in self.parser.cgrammar}\n", "\n", " def traverse_tree(self, node):\n", " counter = 1\n", " nodes = {}\n", "\n", " def helper(node, id):\n", " nonlocal counter\n", " name, children = node\n", " new_children = []\n", " nodes[id] = node\n", " for child in children:\n", " counter += 1\n", " new_children.append(helper(child, counter))\n", " return name, new_children, id\n", "\n", " return helper(node, counter), nodes\n", "\n", " def fragment(self, strings):\n", " self.trees = []\n", " for string in strings:\n", " for tree in self.parser.parse(string):\n", " tree, nodes = self.traverse_tree(tree)\n", " self.trees.append((tree, nodes))\n", " for node in nodes:\n", " symbol = nodes[node][0]\n", " if symbol in self.fragments:\n", " self.fragments[symbol].append(nodes[node])\n", " return self.fragments" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" } }, "source": [ "We thus obtain all valid fragments from our parsed strings." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ": 7\n", ": 18\n", ": 18\n", ": 18\n", ": 37\n", ": 0\n", ": 0\n", ": 39\n", ": 50\n", ": 51\n", ": 23\n", ": 0\n", ": 0\n" ] } ], "source": [ "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS))\n", "fragments = lf.fragment(mystrings)\n", "for key in fragments:\n", " print(\"%s: %d\" % (key, len(fragments[key])))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "All that remains is to actually find a place to poke a hole using `candidate()`, and patch that hole using `generate_new_tree()`. We will explain how to do this next.\n", "\n", "But before that, we update our `initialization` method with a call to `fragment()." ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "import random" ] }, { "cell_type": "code", "execution_count": 18, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def __init__(self, parser, strings):\n", " self.parser = parser\n", " self.fragments = {k: [] for k in self.parser.cgrammar}\n", " self.fragment(strings)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Candidate\n", "`LangFuzzer` accepts a list of strings, which are stored as derivation trees in the object.\n", "\n", "The method `candidate()` chooses one of the derivation trees randomly as the template, and identifies a node such that it can be replaced by another node that is different from itself. That is, it chooses a node such that, if the non-terminal name of the node is `node_type`, there is at least one other entry in `fragment[node_type])` " ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def candidate(self):\n", " tree, nodes = random.choice(self.trees)\n", " interesting_nodes = [\n", " n for n in nodes if nodes[n][0] in self.fragments\n", " and len(self.fragments[nodes[n][0]]) > 1\n", " ]\n", " node = random.choice(interesting_nodes)\n", " return tree, node" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Here is how it is used -- the *red* node is the node chosen." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "random.seed(1)\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "tree, node = lf.candidate()" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, nid, _s, _a): return nid in {node}" ] }, { "cell_type": "code", "execution_count": 22, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<statements>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<statement>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "12\n", ";\n", "\n", "\n", "\n", "1->12\n", "\n", "\n", "\n", "\n", "\n", "13\n", "<statements>\n", "\n", "\n", "\n", "1->13\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<assignment>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<identifier>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "6\n", "=\n", "\n", "\n", "\n", "3->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<expr>\n", "\n", "\n", "\n", "3->7\n", "\n", "\n", "\n", "\n", "\n", "5\n", "a\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<term>\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<factor>\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<number>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "1\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "14\n", "<statement>\n", "\n", "\n", "\n", "13->14\n", "\n", "\n", "\n", "\n", "\n", "24\n", ";\n", "\n", "\n", "\n", "13->24\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<statements>\n", "\n", "\n", "\n", "13->25\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<assignment>\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<identifier>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "18\n", "=\n", "\n", "\n", "\n", "15->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<expr>\n", "\n", "\n", "\n", "15->19\n", "\n", "\n", "\n", "\n", "\n", "17\n", "b\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "<term>\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "21\n", "<factor>\n", "\n", "\n", "\n", "20->21\n", "\n", "\n", "\n", "\n", "\n", "22\n", "<number>\n", "\n", "\n", "\n", "21->22\n", "\n", "\n", "\n", "\n", "\n", "23\n", "2\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "\n", "\n", "\n", "26\n", "<statement>\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "27\n", "<assignment>\n", "\n", "\n", "\n", "26->27\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<identifier>\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "30\n", "=\n", "\n", "\n", "\n", "27->30\n", "\n", "\n", "\n", "\n", "\n", "31\n", "<expr>\n", "\n", "\n", "\n", "27->31\n", "\n", "\n", "\n", "\n", "\n", "29\n", "c\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "32\n", "<term>\n", "\n", "\n", "\n", "31->32\n", "\n", "\n", "\n", "\n", "\n", "36\n", "+\n", "\n", "\n", "\n", "31->36\n", "\n", "\n", "\n", "\n", "\n", "37\n", "<expr>\n", "\n", "\n", "\n", "31->37\n", "\n", "\n", "\n", "\n", "\n", "33\n", "<factor>\n", "\n", "\n", "\n", "32->33\n", "\n", "\n", "\n", "\n", "\n", "34\n", "<identifier>\n", "\n", "\n", "\n", "33->34\n", "\n", "\n", "\n", "\n", "\n", "35\n", "a\n", "\n", "\n", "\n", "34->35\n", "\n", "\n", "\n", "\n", "\n", "38\n", "<term>\n", "\n", "\n", "\n", "37->38\n", "\n", "\n", "\n", "\n", "\n", "39\n", "<factor>\n", "\n", "\n", "\n", "38->39\n", "\n", "\n", "\n", "\n", "\n", "40\n", "<identifier>\n", "\n", "\n", "\n", "39->40\n", "\n", "\n", "\n", "\n", "\n", "41\n", "b\n", "\n", "\n", "\n", "40->41\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "display_tree(tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Generate New Tree\n", "Once we have identified the node, one can generate a new tree by replacing that node with another node of similar type from our fragment pool." ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def generate_new_tree(self, node, choice):\n", " name, children, id = node\n", " if id == choice:\n", " return random.choice(self.fragments[name])\n", " else:\n", " return (name, [self.generate_new_tree(c, choice)\n", " for c in children])" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "Again, the red node indicates where the replacement has occurred." ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "random.seed(1)\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "tree, node = lf.candidate()" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "def hl_predicate(_d, nid, _s, _a): return nid in {node}" ] }, { "cell_type": "code", "execution_count": 26, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from GrammarFuzzer import tree_to_string" ] }, { "cell_type": "code", "execution_count": 27, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a=1;b=2;c=a+b\n", "a=1;b=2;b=2\n" ] }, { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<statements>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<statement>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "12\n", ";\n", "\n", "\n", "\n", "1->12\n", "\n", "\n", "\n", "\n", "\n", "13\n", "<statements>\n", "\n", "\n", "\n", "1->13\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<assignment>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<identifier>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "6\n", "=\n", "\n", "\n", "\n", "3->6\n", "\n", "\n", "\n", "\n", "\n", "7\n", "<expr>\n", "\n", "\n", "\n", "3->7\n", "\n", "\n", "\n", "\n", "\n", "5\n", "a\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<term>\n", "\n", "\n", "\n", "7->8\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<factor>\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<number>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "11\n", "1\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "14\n", "<statement>\n", "\n", "\n", "\n", "13->14\n", "\n", "\n", "\n", "\n", "\n", "24\n", ";\n", "\n", "\n", "\n", "13->24\n", "\n", "\n", "\n", "\n", "\n", "25\n", "<statements>\n", "\n", "\n", "\n", "13->25\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<assignment>\n", "\n", "\n", "\n", "14->15\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<identifier>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "18\n", "=\n", "\n", "\n", "\n", "15->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "<expr>\n", "\n", "\n", "\n", "15->19\n", "\n", "\n", "\n", "\n", "\n", "17\n", "b\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "20\n", "<term>\n", "\n", "\n", "\n", "19->20\n", "\n", "\n", "\n", "\n", "\n", "21\n", "<factor>\n", "\n", "\n", "\n", "20->21\n", "\n", "\n", "\n", "\n", "\n", "22\n", "<number>\n", "\n", "\n", "\n", "21->22\n", "\n", "\n", "\n", "\n", "\n", "23\n", "2\n", "\n", "\n", "\n", "22->23\n", "\n", "\n", "\n", "\n", "\n", "26\n", "<statement>\n", "\n", "\n", "\n", "25->26\n", "\n", "\n", "\n", "\n", "\n", "27\n", "<assignment>\n", "\n", "\n", "\n", "26->27\n", "\n", "\n", "\n", "\n", "\n", "28\n", "<identifier>\n", "\n", "\n", "\n", "27->28\n", "\n", "\n", "\n", "\n", "\n", "30\n", "=\n", "\n", "\n", "\n", "27->30\n", "\n", "\n", "\n", "\n", "\n", "31\n", "<expr>\n", "\n", "\n", "\n", "27->31\n", "\n", "\n", "\n", "\n", "\n", "29\n", "b\n", "\n", "\n", "\n", "28->29\n", "\n", "\n", "\n", "\n", "\n", "32\n", "<term>\n", "\n", "\n", "\n", "31->32\n", "\n", "\n", "\n", "\n", "\n", "33\n", "<factor>\n", "\n", "\n", "\n", "32->33\n", "\n", "\n", "\n", "\n", "\n", "34\n", "<number>\n", "\n", "\n", "\n", "33->34\n", "\n", "\n", "\n", "\n", "\n", "35\n", "2\n", "\n", "\n", "\n", "34->35\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "new_tree = lf.generate_new_tree(tree, node)\n", "for s in [tree_to_string(i) for i in [tree, new_tree]]:\n", " print(s)\n", "display_tree(new_tree, node_attr=highlight_node(hl_predicate))" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "#### Fuzz\n", "\n", "The `fuzz()` method simply calls the procedures defined before in order." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [], "source": [ "class LangFuzzer(LangFuzzer):\n", " def fuzz(self):\n", " tree, node = self.candidate()\n", " modified = self.generate_new_tree(tree, node)\n", " return tree_to_string(modified)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "notes" } }, "source": [ "Here is our fuzzer in action." ] }, { "cell_type": "code", "execution_count": 29, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "x=23;bvar=avar-3*(4+300)\n", "x=23;y=51;z=x*x-(x+y)*(x-y)\n", "x=10;y=20;z=(1.3)*(x-y)\n", "abc=12+(12+3.3)\n", "x=23;y=51;z=y*x-y*y\n", "a=10;b=20;c=34;d=-b+(b*b-4*y)/(2*a)\n", "abc=12+((4+3+(2/a))+3.3)\n", "x=10;y=20;z=(x+y)*(x-y)\n", "abc=12+(3+3.3)\n", "x=10;y=20;z=(x+y)*(x-y)\n" ] } ], "source": [ "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "for i in range(10):\n", " print(lf.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "fragment" } }, "source": [ "How effective was our fuzzing? Let us find out!" ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Timer import Timer" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "61 valid strings, that is LangFuzzer generated 61.000000% valid entries\n", "Total time of 0.011114 seconds\n" ] } ], "source": [ "trials = 100\n", "\n", "lf = LangFuzzer(PEGParser(VAR_GRAMMAR, tokens=VAR_TOKENS), mystrings)\n", "valid = []\n", "time = 0\n", "for i in range(trials):\n", " with Timer() as t:\n", " s = lf.fuzz()\n", " try:\n", " exec(s, {}, {})\n", " valid.append((s, t.elapsed_time()))\n", " except:\n", " pass\n", " time += t.elapsed_time()\n", "print(\"%d valid strings, that is LangFuzzer generated %f%% valid entries\" %\n", " (len(valid), len(valid) * 100.0 / trials))\n", "print(\"Total time of %f seconds\" % time)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "4 valid strings, that is GrammarFuzzer generated 4.000000% valid entries\n", "Total time of 1.366280 seconds\n" ] } ], "source": [ "gf = GrammarFuzzer(VAR_GRAMMAR)\n", "valid = []\n", "time = 0\n", "for i in range(trials):\n", " with Timer() as t:\n", " s = gf.fuzz()\n", " try:\n", " exec(s, {}, {})\n", " valid.append(s)\n", " except:\n", " pass\n", " time += t.elapsed_time()\n", "print(\"%d valid strings, that is GrammarFuzzer generated %f%% valid entries\" %\n", " (len(valid), len(valid) * 100.0 / trials))\n", "print(\"Total time of %f seconds\" % time)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "That is, our `LangFuzzer` is rather effective on generating valid entries when compared to the `GrammarFuzzer`." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Grammar-Based Mutation\n", "\n", "General idea: Take a derivation tree and a matching grammar; apply a random mutation." ] }, { "cell_type": "code", "execution_count": 33, "metadata": { "slideshow": { "slide_type": "skip" } }, "outputs": [], "source": [ "from Grammars import EXPR_GRAMMAR\n", "from GrammarFuzzer import display_tree\n", "from Parser import EarleyParser" ] }, { "cell_type": "code", "execution_count": 34, "metadata": { "slideshow": { "slide_type": "fragment" } }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", "\n", "\n", "%3\n", "\n", "\n", "\n", "0\n", "<start>\n", "\n", "\n", "\n", "1\n", "<expr>\n", "\n", "\n", "\n", "0->1\n", "\n", "\n", "\n", "\n", "\n", "2\n", "<term>\n", "\n", "\n", "\n", "1->2\n", "\n", "\n", "\n", "\n", "\n", "7\n", " + \n", "\n", "\n", "\n", "1->7\n", "\n", "\n", "\n", "\n", "\n", "8\n", "<expr>\n", "\n", "\n", "\n", "1->8\n", "\n", "\n", "\n", "\n", "\n", "3\n", "<factor>\n", "\n", "\n", "\n", "2->3\n", "\n", "\n", "\n", "\n", "\n", "4\n", "<integer>\n", "\n", "\n", "\n", "3->4\n", "\n", "\n", "\n", "\n", "\n", "5\n", "<digit>\n", "\n", "\n", "\n", "4->5\n", "\n", "\n", "\n", "\n", "\n", "6\n", "1\n", "\n", "\n", "\n", "5->6\n", "\n", "\n", "\n", "\n", "\n", "9\n", "<term>\n", "\n", "\n", "\n", "8->9\n", "\n", "\n", "\n", "\n", "\n", "10\n", "<factor>\n", "\n", "\n", "\n", "9->10\n", "\n", "\n", "\n", "\n", "\n", "14\n", " * \n", "\n", "\n", "\n", "9->14\n", "\n", "\n", "\n", "\n", "\n", "15\n", "<term>\n", "\n", "\n", "\n", "9->15\n", "\n", "\n", "\n", "\n", "\n", "11\n", "<integer>\n", "\n", "\n", "\n", "10->11\n", "\n", "\n", "\n", "\n", "\n", "12\n", "<digit>\n", "\n", "\n", "\n", "11->12\n", "\n", "\n", "\n", "\n", "\n", "13\n", "2\n", "\n", "\n", "\n", "12->13\n", "\n", "\n", "\n", "\n", "\n", "16\n", "<factor>\n", "\n", "\n", "\n", "15->16\n", "\n", "\n", "\n", "\n", "\n", "17\n", "<integer>\n", "\n", "\n", "\n", "16->17\n", "\n", "\n", "\n", "\n", "\n", "18\n", "<digit>\n", "\n", "\n", "\n", "17->18\n", "\n", "\n", "\n", "\n", "\n", "19\n", "3\n", "\n", "\n", "\n", "18->19\n", "\n", "\n", "\n", "\n", "\n" ], "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "parser = EarleyParser(EXPR_GRAMMAR)\n", "tree,*_ = parser.parse(\"1 + 2 * 3\")\n", "display_tree(tree)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "1. Pick any node in the tree\n", "2. Produce a new expansion.\n", "\n", "We have seen this for `LangFuzzer` already, right?\n", "\n", "How about we factor this out (from the Parser notebook), and have two notebook on mutational (and genetic fuzzing):\n", "\n", "1. `LangFuzzer` – a chapter on\n", " * Mutating trees (randomly)\n", " * Mutating trees from a given population (the LangFuzz approach)\n", " * Tree recombination (and crossover)\n", " \n", "2. `EvoGrammarFuzzer` – a chapter on\n", " * Genetic improvement (using coverage only)\n", " * Genetic improvement (using a fitness function from search-based fuzzing)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "def mutate_tree(tree, grammar):\n", " pass" ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Lessons Learned\n", "\n", "* We can generate a pool of fragments using the _LangFuzz_ approach, and use it to generate nearly valid strings." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": false, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Next Steps\n", "\n", "* In the [chapter on evolutionary fuzzing](EvoGrammarFuzzer.ipynb), we discuss how to systematically evolve a population of inputs through mutation and crossover operations (as discussed in this chapter) to achieve a specific goal such as code coverage." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Background\n", "\n", "Recombining parsed inputs was pioneered by _Langfuzz_ \\cite{Holler2012}, which also gave its name to the main class of this chapter." ] }, { "cell_type": "markdown", "metadata": { "button": false, "new_sheet": true, "run_control": { "read_only": false }, "slideshow": { "slide_type": "slide" } }, "source": [ "## Exercises\n" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden", "solution2_first": true }, "source": [ "### Exercise 1: A Different LangFuzzer\n", "\n", "Sometimes we do not want to use our pool of strings for various reasons – the number of items in the pool may be inadequate, or not varied enough. Extend the `LangFuzzer` to use a separate function to check if the number of items in the pool corresponding to the selected non-terminal is large enough (say greater than 10), and if not, use the tree expansion technique from `GrammarFuzzer` to patch the hole." ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "source": [ "**Solution.** Here is a possible solution. Before we can make use of `GrammarFuzzer`, we need to change it a little bit. GrammarFuzzer relies on the grammar being in the `fuzzing` format with the expansions represented as strings. Our `LangFuzzer` expects the expansions to be a list of tokens. So we fix the output of `GrammarFuzzer`." ] }, { "cell_type": "code", "execution_count": 36, "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "outputs": [], "source": [ "class LangFuzzer2(LangFuzzer):\n", " def __init__(self, parser, strings):\n", " super().__init__(parser, strings)\n", " self.gfuzz = GrammarFuzzer(parser.grammar())\n", "\n", " def check_diversity(self, pool):\n", " return len(pool) > 10\n", "\n", " def candidate(self):\n", " tree, nodes = random.choice(self.trees)\n", " interesting_nodes = [\n", " n for n in nodes if nodes[n][0] in self.fragments\n", " and nodes[n][0] is not self.parser.start_symbol()\n", " and len(self.fragments[nodes[n][0]]) > 0\n", " ]\n", " node = random.choice(interesting_nodes)\n", " return tree, node\n", "\n", " def generate_new_tree(self, node, choice):\n", " name, children, id = node\n", " if id == choice:\n", " pool = self.fragments[name]\n", " if self.check_diversity(pool):\n", " return random.choice(self.fragments[name])\n", " else:\n", " return None\n", " else:\n", " return (name,\n", " [self.generate_new_tree(c, choice) for c in children])\n", "\n", " def fuzz(self):\n", " tree, node = self.candidate()\n", " tree_with_a_hole = self.generate_new_tree(tree, node)\n", " modified = self.gfuzz.expand_tree(tree_with_a_hole)\n", " return tree_to_string(modified)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": { "slideshow": { "slide_type": "subslide" }, "solution2": "hidden" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "a=1;b=2;b=2\n", "a=1.3;b=a-1*b\n", "a=1;b=2;c=4+300\n", "a=1.3;b=a-1*(1.3+3+(2/a))\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*a)/(2*a)\n", "bvar=avar-3*(4+300);y=51;z=x*x-y*y\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(x+y)\n", "z=x*x-y*y;b=2;c=a+b\n", "x=10;y=20;z=(x+y)*(a-y)\n", "z=(x+y)*(x-y);y=20;z=(x+y)*(x-y)\n", "x=23;y=51;z=x*x-3\n", "x=a;y=51;z=x*x-y*y\n", "a=10;b=20;c=34;d=-b+(b*x-4*a*c)/(2*a)\n", "a=1;b=2;c=y+b\n", "x=23;y=51;z=x*x-(2/a)\n", "a=300;b=2;c=a+b\n", "avar=1.3;a=1.3;b=a-1*(4+3+(2/a))\n", "x=23;y=51;z=x*b-y*y\n", "a=10;b=34;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "a=1.3;b=a-1*(4+4*a*c+(2/a))\n", "x=23;y=51;z=x*x-y*y\n", "x=23;y=51;z=x*x-y*(x-y)\n", "abc=12+(1.3+3.3)\n", "a=10;b=20;c=34;d=-b+(b*b-1)/(2*a)\n", "avar=1.3;bvar=avar-3*(4+300)\n", "a=1;b=2;c=a+b\n", "abc=12+(avar+3.3)\n", "a=1;b=20;c=a+b\n", "abc=12+(2+3.3)\n", "avar=51;bvar=avar-3*(4+300)\n", "avar=2;bvar=avar-3*(4+300)\n", "x=10;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "abc=12+(3*(4+300)+3.3)\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(4*a*c)\n", "x=23;y=51;d=-b+(b*b-4*a*c)/(2*a)\n", "a=1.3;b=4-1*(4+3+(2/a))\n", "x=2;y=20;z=(x+y)*(x-y)\n", "z=(x+y)*(x-y);b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "x=23;y=(2/a);z=x*x-y*y\n", "b=2;c=a+b\n", "a=10;b=2;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "a=1;b=2;c=a+a\n", "avar=1.3;bvar=y-3*(4+300)\n", "x=23;y=51;z=x*2-y*y\n", "x=300;y=20;z=(x+y)*(x-y)\n", "abc=12+10\n", "avar=1.3;bvar=avar-3*(4+(2/a))\n", "x=10;y=20;z=1\n", "a=a;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "a=1.3;b=a-1*300\n", "a=10;b=20;c=34;d=-b+(b*(3+3.3)-4*a*c)/(2*a)\n", "x=23;abc=12+(3+3.3);z=x*x-y*y\n", "a=10;b=20;c=2;d=-b+(b*b-4*a*c)/(2*a)\n", "a=1.3;b=a-1*(300+3+(2/a))\n", "x=23;y=51;c=34\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(b*a)\n", "a=1.3;bvar=avar-3*(4+300)\n", "x=23;y=51;z=(b*b-4*a*c)/(2*a)\n", "x=23;y=51;z=(x+y)*(x-y)\n", "a=1;b=2;c=300+b\n", "x=23;x=51;z=x*x-y*y\n", "x=23;y=51;z=x*x-y*y\n", "a=2;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "abc=b+(3+3.3)\n", "a=1;z=2;c=a+b\n", "a=1.3;b=a-1*(4+3+(51))\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(20)\n", "x=10;y=20;z=(x+10)*(x-y)\n", "a=1;b=10;c=a+b\n", "avar=1.3;abc=12+(3+3.3)\n", "a=1.3;b=a-1*(4+3+4)\n", "a=10;a=1;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "a=10;b=20;c=34;d=-b+(a-1*(4+3+(2/a)))/(2*a)\n", "avar=1.3;bvar=avar-1.3\n", "a=10;b=20;x=23;y=51;z=x*x-y*y\n", "abc=12+(3+3.3)\n", "a=10;y=20;z=(x+y)*(x-y)\n", "x=23;y=51;z=x*x-y*a\n", "a=1.3;b=a-1*(2/a)\n", "abc=10\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(3)\n", "b=a-1*(4+3+(2/a))\n", "bvar=avar-3*(4+300);bvar=avar-3*(4+300)\n", "b=1.3;b=a-1*(4+3+(2/a))\n", "x=23;y=51;z=x*x-y*avar\n", "abc=1.3+(3+3.3)\n", "a=10;a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n", "x=3;y=20;z=(x+y)*(x-y)\n", "a=10;b=20;c=34;d=-10+(b*b-4*a*c)/(2*a)\n", "a=1;b=2;c=b+b\n", "a=1.3;b=a-1*(4+3+4*a*c)\n", "a=10;b=20;c=34;d=-b+(b*b-4*(4+300))/(2*a)\n", "avar=3+(2/a);bvar=avar-3*(4+300)\n", "a=10;b=20;c=34;d=-b+(b*b-4*a*c)/(y)\n", "c=34;y=51;z=x*x-y*y\n", "avar=1.3;y=20\n", "abc=12+3+3.3\n", "a=1.3;bvar=avar-3*(4+300)\n", "avar=1.3;bvar=avar-3*(b+300)\n", "a=avar-3*(4+300);b=20;c=34;d=-b+(b*b-4*a*c)/(2*a)\n" ] } ], "source": [ "parser = EarleyParser(VAR_GRAMMAR, tokens=VAR_TOKENS)\n", "lf = LangFuzzer2(parser, mystrings)\n", "for i in range(100):\n", " print(lf.fuzz())" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "skip" }, "solution2": "hidden" }, "source": [ "With these changes, our `LangFuzzer2` is able to use the pool of fragments when necessary, but can rely on the grammar when the pool is not sufficient." ] } ], "metadata": { "ipub": { "bibliography": "fuzzingbook.bib", "toc": true }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.8" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": true, "title_cell": "", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "toc-autonumbering": false }, "nbformat": 4, "nbformat_minor": 2 }