{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": true,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "slide"
},
"tags": [],
"toc-hr-collapsed": false
},
"source": [
"# Fuzzing with Grammars\n",
"\n",
"In the chapter on [\"Mutation-Based Fuzzing\"](MutationFuzzer.ipynb), we have seen how to use extra hints – such as sample input files – to speed up test generation. In this chapter, we take this idea one step further, by providing a _specification_ of the legal inputs to a program. Specifying inputs via a _grammar_ allows for very systematic and efficient test generation, in particular for complex input formats. Grammars also serve as the base for configuration fuzzing, API fuzzing, GUI fuzzing, and many more."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-12T13:42:35.964322Z",
"iopub.status.busy": "2022-01-12T13:42:35.963611Z",
"iopub.status.idle": "2022-01-12T13:42:36.088356Z",
"shell.execute_reply": "2022-01-12T13:42:36.088839Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from bookutils import YouTubeVideo\n",
"YouTubeVideo('mswyS3Wok1c')"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"**Prerequisites**\n",
"\n",
"* You should know how basic fuzzing works, e.g. from the [Chapter introducing fuzzing](Fuzzer.ipynb).\n",
"* Knowledge on [mutation-based fuzzing](MutationFuzzer.ipynb) and [coverage](Coverage.ipynb) is _not_ required yet, but still recommended."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"button": false,
"execution": {
"iopub.execute_input": "2022-01-12T13:42:36.092675Z",
"iopub.status.busy": "2022-01-12T13:42:36.092117Z",
"iopub.status.idle": "2022-01-12T13:42:36.094080Z",
"shell.execute_reply": "2022-01-12T13:42:36.094491Z"
},
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import bookutils"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-12T13:42:36.098228Z",
"iopub.status.busy": "2022-01-12T13:42:36.097649Z",
"iopub.status.idle": "2022-01-12T13:42:36.099279Z",
"shell.execute_reply": "2022-01-12T13:42:36.099660Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from typing import List, Dict, Union, Any, Tuple, Optional"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-12T13:42:36.103498Z",
"iopub.status.busy": "2022-01-12T13:42:36.102958Z",
"iopub.status.idle": "2022-01-12T13:42:36.845688Z",
"shell.execute_reply": "2022-01-12T13:42:36.846237Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import Fuzzer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Synopsis\n",
"\n",
"\n",
"To [use the code provided in this chapter](Importing.ipynb), write\n",
"\n",
"```python\n",
">>> from fuzzingbook.Grammars import \n",
"```\n",
"\n",
"and then make use of the following features.\n",
"\n",
"\n",
"This chapter introduces _grammars_ as a simple means to specify input languages, and to use them for testing programs with syntactically valid inputs. A grammar is defined as a mapping of nonterminal symbols to lists of alternative expansions, as in the following example:\n",
"\n",
"```python\n",
">>> US_PHONE_GRAMMAR: Grammar = {\n",
">>> \"\": [\"\"],\n",
">>> \"\": [\"()-\"],\n",
">>> \"\": [\"\"],\n",
">>> \"\": [\"\"],\n",
">>> \"\": [\"\"],\n",
">>> \"\": [\"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\"],\n",
">>> \"\": [\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\"]\n",
">>> }\n",
">>> \n",
">>> assert is_valid_grammar(US_PHONE_GRAMMAR)\n",
"```\n",
"Nonterminal symbols are enclosed in angle brackets (say, ``). To generate an input string from a grammar, a _producer_ starts with the start symbol (``) and randomly chooses a random expansion for this symbol. It continues the process until all nonterminal symbols are expanded. The function `simple_grammar_fuzzer()` does just that:\n",
"\n",
"```python\n",
">>> [simple_grammar_fuzzer(US_PHONE_GRAMMAR) for i in range(5)]\n",
"['(692)449-5179',\n",
" '(519)230-7422',\n",
" '(613)761-0853',\n",
" '(979)881-3858',\n",
" '(810)914-5475']\n",
"```\n",
"In practice, though, instead of `simple_grammar_fuzzer()`, you should use [the `GrammarFuzzer` class](GrammarFuzzer.ipynb) or one of its [coverage-based](GrammarCoverageFuzzer.ipynb), [probabilistic-based](ProbabilisticGrammarFuzzer.ipynb), or [generator-based](GeneratorGrammarFuzzer.ipynb) derivatives; these are more efficient, protect against infinite growth, and provide several additional features.\n",
"\n",
"This chapter also introduces a [grammar toolbox](#A-Grammar-Toolbox) with several helper functions that ease the writing of grammars, such as using shortcut notations for character classes and repetitions, or extending grammars \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": true,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Input Languages\n",
"\n",
"All possible behaviors of a program can be triggered by its input. \"Input\" here can be a wide range of possible sources: We are talking about data that is read from files, from the environment, or over the network, data input by the user, or data acquired from interaction with other resources. The set of all these inputs determines how the program will behave – including its failures. When testing, it is thus very helpful to think about possible input sources, how to get them under control, and _how to systematically test them_."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": true,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"For the sake of simplicity, we will assume for now that the program has only one source of inputs; this is the same assumption we have been using in the previous chapters, too. The set of valid inputs to a program is called a _language_. Languages range from the simple to the complex: the CSV language denotes the set of valid comma-separated inputs, whereas the Python language denotes the set of valid Python programs. We commonly separate data languages and programming languages, although any program can also be treated as input data (say, to a compiler). The [Wikipedia page on file formats](https://en.wikipedia.org/wiki/List_of_file_formats) lists more than 1,000 different file formats, each of which is its own language."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"To formally describe languages, the field of *formal languages* has devised a number of *language specifications* that describe a language. *Regular expressions* represent the simplest class of these languages to denote sets of strings: The regular expression `[a-z]*`, for instance, denotes a (possibly empty) sequence of lowercase letters. *Automata theory* connects these languages to automata that accept these inputs; *finite state machines*, for instance, can be used to specify the language of regular expressions."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Regular expressions are great for not-too-complex input formats, and the associated finite state machines have many properties that make them great for reasoning. To specify more complex inputs, though, they quickly encounter limitations. At the other end of the language spectrum, we have *universal grammars* that denote the language accepted by *Turing machines*. A Turing machine can compute anything that can be computed; and with Python being Turing-complete, this means that we can also use a Python program $p$ to specify or even enumerate legal inputs. But then, computer science theory also tells us that each such testing program has to be written specifically for the program to be tested, which is not the level of automation we want."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "slide"
},
"toc-hr-collapsed": true,
"toc-nb-collapsed": true
},
"source": [
"## Grammars\n",
"\n",
"The middle ground between regular expressions and Turing machines is covered by *grammars*. Grammars are among the most popular (and best understood) formalisms to formally specify input languages. Using a grammar, one can express a wide range of the properties of an input language. Grammars are particularly great for expressing the *syntactical structure* of an input, and are the formalism of choice to express nested or recursive inputs. The grammars we use are so-called *context-free grammars*, one of the easiest and most popular grammar formalisms."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Rules and Expansions\n",
"\n",
"A grammar consists of a *start symbol* and a set of *expansion rules* (or simply *rules*) which indicate how the start symbol (and other symbols) can be expanded. As an example, consider the following grammar, denoting a sequence of two digits:\n",
"\n",
"```\n",
" ::= \n",
" ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9\n",
"```\n",
"\n",
"To read such a grammar, start with the start symbol (``). An expansion rule ` ::= ` means that the symbol on the left side (``) can be replaced by the string on the right side (``). In the above grammar, `` would be replaced by ``.\n",
"\n",
"In this string again, `` would be replaced by the string on the right side of the `` rule. The special operator `|` denotes *expansion alternatives* (or simply *alternatives*), meaning that any of the digits can be chosen for an expansion. Each `` thus would be expanded into one of the given digits, eventually yielding a string between `00` and `99`. There are no further expansions for `0` to `9`, so we are all set."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The interesting thing about grammars is that they can be *recursive*. That is, expansions can make use of symbols expanded earlier – which would then be expanded again. As an example, consider a grammar that describes integers:\n",
"\n",
"```\n",
" ::= \n",
" ::= | \n",
" ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9\n",
"```\n",
"\n",
"Here, a `` is either a single digit, or a digit followed by another integer. The number `1234` thus would be represented as a single digit `1`, followed by the integer `234`, which in turn is a digit `2`, followed by the integer `34`."
]
},
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"If we wanted to express that an integer can be preceded by a sign (`+` or `-`), we would write the grammar as\n",
"\n",
"```\n",
" ::= \n",
" ::= | + | -\n",
" ::= | \n",
" ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9\n",
"```\n",
"\n",
"These rules formally define the language: Anything that can be derived from the start symbol is part of the language; anything that cannot is not."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-12T13:42:36.850936Z",
"iopub.status.busy": "2022-01-12T13:42:36.850327Z",
"iopub.status.idle": "2022-01-12T13:42:36.852016Z",
"shell.execute_reply": "2022-01-12T13:42:36.852471Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from bookutils import quiz"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2022-01-12T13:42:36.861052Z",
"iopub.status.busy": "2022-01-12T13:42:36.860475Z",
"iopub.status.idle": "2022-01-12T13:42:36.863178Z",
"shell.execute_reply": "2022-01-12T13:42:36.863563Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" \n",
"