{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"button": false,
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "notes"
},
"tags": [],
"toc-hr-collapsed": false
},
"source": [
"# Parsing Inputs\n",
"\n",
"In the chapter on [Grammars](Grammars.ipynb), we discussed how grammars can be\n",
"used to represent various languages. We also saw how grammars can be used to\n",
"generate strings of the corresponding language. Grammars can also perform the\n",
"reverse. That is, given a string, one can decompose the string into its\n",
"constituent parts that correspond to the parts of grammar used to generate it\n",
"– the _derivation tree_ of that string. These parts (and parts from other similar\n",
"strings) can later be recombined using the same grammar to produce new strings.\n",
"\n",
"In this chapter, we use grammars to parse and decompose a given set of valid seed inputs into their corresponding derivation trees. This structural representation allows us to mutate, crossover, and recombine their parts in order to generate new valid, slightly changed inputs (i.e., fuzz)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:34.697653Z",
"iopub.status.busy": "2024-01-18T17:16:34.697187Z",
"iopub.status.idle": "2024-01-18T17:16:34.768963Z",
"shell.execute_reply": "2024-01-18T17:16:34.768588Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" "
],
"text/plain": [
""
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from bookutils import YouTubeVideo\n",
"YouTubeVideo('k39i9de0L54')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"**Prerequisites**\n",
"\n",
"* You should have read the [chapter on grammars](Grammars.ipynb).\n",
"* An understanding of derivation trees from the [chapter on grammar fuzzer](GrammarFuzzer.ipynb)\n",
" is also required."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"## Synopsis\n",
"\n",
"\n",
"To [use the code provided in this chapter](Importing.ipynb), write\n",
"\n",
"```python\n",
">>> from fuzzingbook.Parser import \n",
"```\n",
"\n",
"and then make use of the following features.\n",
"\n",
"\n",
"This chapter introduces `Parser` classes, parsing a string into a _derivation tree_ as introduced in the [chapter on efficient grammar fuzzing](GrammarFuzzer.ipynb). Two important parser classes are provided:\n",
"\n",
"* [Parsing Expression Grammar parsers](#Parsing-Expression-Grammars) (`PEGParser`). These are very efficient, but limited to specific grammar structure. Notably, the alternatives represent *ordered choice*. That is, rather than choosing all rules that can potentially match, we stop at the first match that succeed.\n",
"* [Earley parsers](#Parsing-Context-Free-Grammars) (`EarleyParser`). These accept any kind of context-free grammars, and explore all parsing alternatives (if any).\n",
"\n",
"Using any of these is fairly easy, though. First, instantiate them with a grammar:\n",
"\n",
"```python\n",
">>> from Grammars import US_PHONE_GRAMMAR\n",
">>> us_phone_parser = EarleyParser(US_PHONE_GRAMMAR)\n",
"```\n",
"Then, use the `parse()` method to retrieve a list of possible derivation trees:\n",
"\n",
"```python\n",
">>> trees = us_phone_parser.parse(\"(555)987-6543\")\n",
">>> tree = list(trees)[0]\n",
">>> display_tree(tree)\n",
"```\n",
"![](PICS/Parser-synopsis-1.svg)\n",
"\n",
"These derivation trees can then be used for test generation, notably for mutating and recombining existing inputs.\n",
"\n",
"![](PICS/Parser-synopsis-2.svg)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:34.792873Z",
"iopub.status.busy": "2024-01-18T17:16:34.792689Z",
"iopub.status.idle": "2024-01-18T17:16:34.794666Z",
"shell.execute_reply": "2024-01-18T17:16:34.794351Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import bookutils.setup"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:34.796189Z",
"iopub.status.busy": "2024-01-18T17:16:34.796067Z",
"iopub.status.idle": "2024-01-18T17:16:34.797725Z",
"shell.execute_reply": "2024-01-18T17:16:34.797471Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from typing import Dict, List, Tuple, Collection, Set, Iterable, Generator, cast"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:34.799232Z",
"iopub.status.busy": "2024-01-18T17:16:34.799130Z",
"iopub.status.idle": "2024-01-18T17:16:34.879150Z",
"shell.execute_reply": "2024-01-18T17:16:34.878815Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from Fuzzer import Fuzzer # minor dependency"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:34.880958Z",
"iopub.status.busy": "2024-01-18T17:16:34.880853Z",
"iopub.status.idle": "2024-01-18T17:16:35.265688Z",
"shell.execute_reply": "2024-01-18T17:16:35.265367Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from Grammars import EXPR_GRAMMAR, START_SYMBOL, RE_NONTERMINAL\n",
"from Grammars import is_valid_grammar, syntax_diagram, Grammar"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.267656Z",
"iopub.status.busy": "2024-01-18T17:16:35.267510Z",
"iopub.status.idle": "2024-01-18T17:16:35.295983Z",
"shell.execute_reply": "2024-01-18T17:16:35.295603Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from GrammarFuzzer import GrammarFuzzer, display_tree, tree_to_string, dot_escape\n",
"from GrammarFuzzer import DerivationTree"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.297650Z",
"iopub.status.busy": "2024-01-18T17:16:35.297559Z",
"iopub.status.idle": "2024-01-18T17:16:35.299342Z",
"shell.execute_reply": "2024-01-18T17:16:35.299100Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from ExpectError import ExpectError"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.300814Z",
"iopub.status.busy": "2024-01-18T17:16:35.300730Z",
"iopub.status.idle": "2024-01-18T17:16:35.302496Z",
"shell.execute_reply": "2024-01-18T17:16:35.302232Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from IPython.display import display"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.303922Z",
"iopub.status.busy": "2024-01-18T17:16:35.303843Z",
"iopub.status.idle": "2024-01-18T17:16:35.305519Z",
"shell.execute_reply": "2024-01-18T17:16:35.305266Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"from Timer import Timer"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Why Parsing for Fuzzing?"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Why would one want to parse existing inputs in order to fuzz? Let us illustrate the problem with an example. Here is a simple program that accepts a CSV file of vehicle details and processes this information."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.307220Z",
"iopub.status.busy": "2024-01-18T17:16:35.307108Z",
"iopub.status.idle": "2024-01-18T17:16:35.309310Z",
"shell.execute_reply": "2024-01-18T17:16:35.308945Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def process_inventory(inventory):\n",
" res = []\n",
" for vehicle in inventory.split('\\n'):\n",
" ret = process_vehicle(vehicle)\n",
" res.extend(ret)\n",
" return '\\n'.join(res)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The CSV file contains details of one vehicle per line. Each row is processed in `process_vehicle()`."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.310945Z",
"iopub.status.busy": "2024-01-18T17:16:35.310842Z",
"iopub.status.idle": "2024-01-18T17:16:35.312924Z",
"shell.execute_reply": "2024-01-18T17:16:35.312680Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def process_vehicle(vehicle):\n",
" year, kind, company, model, *_ = vehicle.split(',')\n",
" if kind == 'van':\n",
" return process_van(year, company, model)\n",
"\n",
" elif kind == 'car':\n",
" return process_car(year, company, model)\n",
"\n",
" else:\n",
" raise Exception('Invalid entry')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Depending on the kind of vehicle, the processing changes."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.314422Z",
"iopub.status.busy": "2024-01-18T17:16:35.314325Z",
"iopub.status.idle": "2024-01-18T17:16:35.316288Z",
"shell.execute_reply": "2024-01-18T17:16:35.316052Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def process_van(year, company, model):\n",
" res = [\"We have a %s %s van from %s vintage.\" % (company, model, year)]\n",
" iyear = int(year)\n",
" if iyear > 2010:\n",
" res.append(\"It is a recent model!\")\n",
" else:\n",
" res.append(\"It is an old but reliable model!\")\n",
" return res"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.317767Z",
"iopub.status.busy": "2024-01-18T17:16:35.317659Z",
"iopub.status.idle": "2024-01-18T17:16:35.319613Z",
"shell.execute_reply": "2024-01-18T17:16:35.319356Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def process_car(year, company, model):\n",
" res = [\"We have a %s %s car from %s vintage.\" % (company, model, year)]\n",
" iyear = int(year)\n",
" if iyear > 2016:\n",
" res.append(\"It is a recent model!\")\n",
" else:\n",
" res.append(\"It is an old but reliable model!\")\n",
" return res"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Here is a sample of inputs that the `process_inventory()` accepts."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.321101Z",
"iopub.status.busy": "2024-01-18T17:16:35.321003Z",
"iopub.status.idle": "2024-01-18T17:16:35.322821Z",
"shell.execute_reply": "2024-01-18T17:16:35.322551Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"We have a Ford E350 van from 1997 vintage.\n",
"It is an old but reliable model!\n",
"We have a Mercury Cougar car from 2000 vintage.\n",
"It is an old but reliable model!\n"
]
}
],
"source": [
"mystring = \"\"\"\\\n",
"1997,van,Ford,E350\n",
"2000,car,Mercury,Cougar\\\n",
"\"\"\"\n",
"print(process_inventory(mystring))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let us try to fuzz this program. Given that the `process_inventory()` takes a CSV file, we can write a simple grammar for generating comma separated values, and generate the required CSV rows. For convenience, we fuzz `process_vehicle()` directly."
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.324423Z",
"iopub.status.busy": "2024-01-18T17:16:35.324311Z",
"iopub.status.idle": "2024-01-18T17:16:35.325871Z",
"shell.execute_reply": "2024-01-18T17:16:35.325654Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import string"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.327248Z",
"iopub.status.busy": "2024-01-18T17:16:35.327166Z",
"iopub.status.idle": "2024-01-18T17:16:35.329032Z",
"shell.execute_reply": "2024-01-18T17:16:35.328804Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"CSV_GRAMMAR: Grammar = {\n",
" '': [''],\n",
" '': [''],\n",
" '': ['- ,', '
- '],\n",
" '
- ': [''],\n",
" '': ['', ''],\n",
" '': list(string.ascii_letters + string.digits + string.punctuation + ' \\t\\n')\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
" We need some infrastructure first for viewing the grammar."
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.330527Z",
"iopub.status.busy": "2024-01-18T17:16:35.330439Z",
"iopub.status.idle": "2024-01-18T17:16:35.351642Z",
"shell.execute_reply": "2024-01-18T17:16:35.351354Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"csvline\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"items\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"item\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"letters\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"letter\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"syntax_diagram(CSV_GRAMMAR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We generate `1000` values, and evaluate the `process_vehicle()` with each."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:35.353747Z",
"iopub.status.busy": "2024-01-18T17:16:35.353532Z",
"iopub.status.idle": "2024-01-18T17:16:37.931323Z",
"shell.execute_reply": "2024-01-18T17:16:37.930832Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 valid strings, that is GrammarFuzzer generated 0.000000% valid entries from 1000 inputs\n",
"Total time of 2.573146 seconds\n"
]
}
],
"source": [
"gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)\n",
"trials = 1000\n",
"valid: List[str] = []\n",
"time = 0\n",
"for i in range(trials):\n",
" with Timer() as t:\n",
" vehicle_info = gf.fuzz()\n",
" try:\n",
" process_vehicle(vehicle_info)\n",
" valid.append(vehicle_info)\n",
" except:\n",
" pass\n",
" time += t.elapsed_time()\n",
"print(\"%d valid strings, that is GrammarFuzzer generated %f%% valid entries from %d inputs\" %\n",
" (len(valid), len(valid) * 100.0 / trials, trials))\n",
"print(\"Total time of %f seconds\" % time)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"This is obviously not working. But why?"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:37.933420Z",
"iopub.status.busy": "2024-01-18T17:16:37.933259Z",
"iopub.status.idle": "2024-01-18T17:16:37.971369Z",
"shell.execute_reply": "2024-01-18T17:16:37.971074Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"'9w9J\\'/,LU<\"l,|,Y,Zv)Amvx,c\\n'\t Invalid entry\n",
"'(n8].H7,qolS'\t not enough values to unpack (expected at least 4, got 2)\n",
"'\\nQoLWQ,jSa'\t not enough values to unpack (expected at least 4, got 2)\n",
"'K1,\\n,RE,fq,%,,sT+aAb'\t Invalid entry\n",
"\"m,d,,8j4'),-yQ,B7\"\t Invalid entry\n",
"'g4,s1\\t[}{.,M,<,\\nzd,.am'\t Invalid entry\n",
"',Z[,z,c,#x1,gc.F'\t Invalid entry\n",
"'pWs,rT`,R'\t not enough values to unpack (expected at least 4, got 3)\n",
"'iN,br%,Q,R'\t Invalid entry\n",
"'ol,\\nH<\\tn,^#,=A'\t Invalid entry\n"
]
}
],
"source": [
"gf = GrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)\n",
"trials = 10\n",
"time = 0\n",
"for i in range(trials):\n",
" vehicle_info = gf.fuzz()\n",
" try:\n",
" print(repr(vehicle_info), end=\"\")\n",
" process_vehicle(vehicle_info)\n",
" except Exception as e:\n",
" print(\"\\t\", e)\n",
" else:\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"None of the entries will get through unless the fuzzer can produce either `van` or `car`.\n",
"Indeed, the reason is that the grammar itself does not capture the complete information about the format. So here is another idea. We modify the `GrammarFuzzer` to know a bit about our format."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:37.973249Z",
"iopub.status.busy": "2024-01-18T17:16:37.973156Z",
"iopub.status.idle": "2024-01-18T17:16:37.974853Z",
"shell.execute_reply": "2024-01-18T17:16:37.974589Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import copy"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:37.976467Z",
"iopub.status.busy": "2024-01-18T17:16:37.976367Z",
"iopub.status.idle": "2024-01-18T17:16:37.977985Z",
"shell.execute_reply": "2024-01-18T17:16:37.977739Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import random"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:37.979504Z",
"iopub.status.busy": "2024-01-18T17:16:37.979408Z",
"iopub.status.idle": "2024-01-18T17:16:37.982185Z",
"shell.execute_reply": "2024-01-18T17:16:37.981915Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"class PooledGrammarFuzzer(GrammarFuzzer):\n",
" def __init__(self, *args, **kwargs):\n",
" super().__init__(*args, **kwargs)\n",
" self._node_cache = {}\n",
"\n",
" def update_cache(self, key, values):\n",
" self._node_cache[key] = values\n",
"\n",
" def expand_node_randomly(self, node):\n",
" (symbol, children) = node\n",
" assert children is None\n",
" if symbol in self._node_cache:\n",
" if random.randint(0, 1) == 1:\n",
" return super().expand_node_randomly(node)\n",
" return copy.deepcopy(random.choice(self._node_cache[symbol]))\n",
" return super().expand_node_randomly(node)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Let us try again!"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:37.984433Z",
"iopub.status.busy": "2024-01-18T17:16:37.984214Z",
"iopub.status.idle": "2024-01-18T17:16:38.000466Z",
"shell.execute_reply": "2024-01-18T17:16:38.000089Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"',h,van,|'\t Invalid entry\n",
"'M,w:K,car,car,van'\t Invalid entry\n",
"'J,?Y,van,van,car,J,~D+'\t Invalid entry\n",
"'S4,car,car,o'\t invalid literal for int() with base 10: 'S4'\n",
"'2*-,van'\t not enough values to unpack (expected at least 4, got 2)\n",
"'van,%,5,]'\t Invalid entry\n",
"'van,G3{y,j,h:'\t Invalid entry\n",
"'$0;o,M,car,car'\t Invalid entry\n",
"'2d,f,e'\t not enough values to unpack (expected at least 4, got 3)\n",
"'/~NE,car,car'\t not enough values to unpack (expected at least 4, got 3)\n"
]
}
],
"source": [
"gf = PooledGrammarFuzzer(CSV_GRAMMAR, min_nonterminals=4)\n",
"gf.update_cache('
- ', [\n",
" ('
- ', [('car', [])]),\n",
" ('
- ', [('van', [])]),\n",
"])\n",
"trials = 10\n",
"time = 0\n",
"for i in range(trials):\n",
" vehicle_info = gf.fuzz()\n",
" try:\n",
" print(repr(vehicle_info), end=\"\")\n",
" process_vehicle(vehicle_info)\n",
" except Exception as e:\n",
" print(\"\\t\", e)\n",
" else:\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"At least we are getting somewhere! It would be really nice if _we could incorporate what we know about the sample data in our fuzzer._ In fact, it would be nice if we could _extract_ the template and valid values from samples, and use them in our fuzzing. How do we do that? The quick answer to this question is: Use a *parser*. "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Using a Parser\n",
"\n",
"Generally speaking, a _parser_ is the part of a a program that processes (structured) input. The parsers we discuss in this chapter transform an input string into a _derivation tree_ (discussed in the [chapter on efficient grammar fuzzing](GrammarFuzzer.ipynb)). From a user's perspective, all it takes to parse an input is two steps: \n",
"\n",
"1. Initialize the parser with a grammar, as in\n",
"```\n",
"parser = Parser(grammar)\n",
"```\n",
"\n",
"2. Using the parser to retrieve a list of derivation trees:\n",
"\n",
"```python\n",
"trees = parser.parse(input)\n",
"```\n",
"\n",
"Once we have parsed a tree, we can use it just as the derivation trees produced from grammar fuzzing.\n",
"\n",
"We discuss a number of such parsers, in particular\n",
"* [parsing expression grammar parsers](#Parsing-Expression-Grammars) (`PEGParser`), which are very efficient, but limited to specific grammar structure; and\n",
"* [Earley parsers](#Parsing-Context-Free-Grammars) (`EarleyParser`), which accept any kind of context-free grammars.\n",
"\n",
"If you just want to _use_ parsers (say, because your main focus is testing), you can just stop here and move on [to the next chapter](LangFuzzer.ipynb), where we learn how to make use of parsed inputs to mutate and recombine them. If you want to _understand_ how parsers work, though, this chapter is right for you."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## An Ad Hoc Parser\n",
"\n",
"As we saw in the previous section, programmers often have to extract parts of data that obey certain rules. For example, for *CSV* files, each element in a row is separated by *commas*, and multiple raws are used to store the data."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"To extract the information, we write an ad hoc parser `simple_parse_csv()`."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.002774Z",
"iopub.status.busy": "2024-01-18T17:16:38.002560Z",
"iopub.status.idle": "2024-01-18T17:16:38.005140Z",
"shell.execute_reply": "2024-01-18T17:16:38.004633Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def simple_parse_csv(mystring: str) -> DerivationTree:\n",
" children: List[DerivationTree] = []\n",
" tree = (START_SYMBOL, children)\n",
" for i, line in enumerate(mystring.split('\\n')):\n",
" children.append((\"record %d\" % i, [(cell, [])\n",
" for cell in line.split(',')]))\n",
" return tree"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We also change the default orientation of the graph to *left to right* rather than *top to bottom* for easier viewing using `lr_graph()`."
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.007006Z",
"iopub.status.busy": "2024-01-18T17:16:38.006850Z",
"iopub.status.idle": "2024-01-18T17:16:38.008689Z",
"shell.execute_reply": "2024-01-18T17:16:38.008399Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def lr_graph(dot):\n",
" dot.attr('node', shape='plain')\n",
" dot.graph_attr['rankdir'] = 'LR'"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The `display_tree()` shows the structure of our CSV file after parsing."
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.010476Z",
"iopub.status.busy": "2024-01-18T17:16:38.010355Z",
"iopub.status.idle": "2024-01-18T17:16:38.408545Z",
"shell.execute_reply": "2024-01-18T17:16:38.408132Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = simple_parse_csv(mystring)\n",
"display_tree(tree, graph_attr=lr_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"This is of course simple. What if we encounter slightly more complexity? Again, another example from the Wikipedia."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.410651Z",
"iopub.status.busy": "2024-01-18T17:16:38.410507Z",
"iopub.status.idle": "2024-01-18T17:16:38.412552Z",
"shell.execute_reply": "2024-01-18T17:16:38.412303Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1997,Ford,E350,\"ac, abs, moon\",3000.00\n"
]
}
],
"source": [
"mystring = '''\\\n",
"1997,Ford,E350,\"ac, abs, moon\",3000.00\\\n",
"'''\n",
"print(mystring)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We define a new annotation method `highlight_node()` to mark the nodes that are interesting."
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.414017Z",
"iopub.status.busy": "2024-01-18T17:16:38.413920Z",
"iopub.status.idle": "2024-01-18T17:16:38.415981Z",
"shell.execute_reply": "2024-01-18T17:16:38.415717Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def highlight_node(predicate):\n",
" def hl_node(dot, nid, symbol, ann):\n",
" if predicate(dot, nid, symbol, ann):\n",
" dot.node(repr(nid), dot_escape(symbol), fontcolor='red')\n",
" else:\n",
" dot.node(repr(nid), dot_escape(symbol))\n",
" return hl_node"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Using `highlight_node()` we can highlight particular nodes that we were wrongly parsed."
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.417472Z",
"iopub.status.busy": "2024-01-18T17:16:38.417377Z",
"iopub.status.idle": "2024-01-18T17:16:38.419019Z",
"shell.execute_reply": "2024-01-18T17:16:38.418794Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"tree = simple_parse_csv(mystring)\n",
"bad_nodes = {5, 6, 7, 12, 13, 20, 22, 23, 24, 25}"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.420410Z",
"iopub.status.busy": "2024-01-18T17:16:38.420309Z",
"iopub.status.idle": "2024-01-18T17:16:38.422145Z",
"shell.execute_reply": "2024-01-18T17:16:38.421885Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def hl_predicate(_d, nid, _s, _a): return nid in bad_nodes"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.423600Z",
"iopub.status.busy": "2024-01-18T17:16:38.423490Z",
"iopub.status.idle": "2024-01-18T17:16:38.819510Z",
"shell.execute_reply": "2024-01-18T17:16:38.819148Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"highlight_err_node = highlight_node(hl_predicate)\n",
"display_tree(tree, log=False, node_attr=highlight_err_node,\n",
" graph_attr=lr_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"The marked nodes indicate where our parsing went wrong. We can of course extend our parser to understand quotes. First we define some of the helper functions `parse_quote()`, `find_comma()` and `comma_split()`"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.821476Z",
"iopub.status.busy": "2024-01-18T17:16:38.821353Z",
"iopub.status.idle": "2024-01-18T17:16:38.823417Z",
"shell.execute_reply": "2024-01-18T17:16:38.823095Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def parse_quote(string, i):\n",
" v = string[i + 1:].find('\"')\n",
" return v + i + 1 if v >= 0 else -1"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.825375Z",
"iopub.status.busy": "2024-01-18T17:16:38.825215Z",
"iopub.status.idle": "2024-01-18T17:16:38.827600Z",
"shell.execute_reply": "2024-01-18T17:16:38.827346Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def find_comma(string, i):\n",
" slen = len(string)\n",
" while i < slen:\n",
" if string[i] == '\"':\n",
" i = parse_quote(string, i)\n",
" if i == -1:\n",
" return -1\n",
" if string[i] == ',':\n",
" return i\n",
" i += 1\n",
" return -1"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.829213Z",
"iopub.status.busy": "2024-01-18T17:16:38.829060Z",
"iopub.status.idle": "2024-01-18T17:16:38.831220Z",
"shell.execute_reply": "2024-01-18T17:16:38.830941Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def comma_split(string):\n",
" slen = len(string)\n",
" i = 0\n",
" while i < slen:\n",
" c = find_comma(string, i)\n",
" if c == -1:\n",
" yield string[i:]\n",
" return\n",
" else:\n",
" yield string[i:c]\n",
" i = c + 1"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We can update our `parse_csv()` procedure to use our advanced quote parser."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.832695Z",
"iopub.status.busy": "2024-01-18T17:16:38.832584Z",
"iopub.status.idle": "2024-01-18T17:16:38.834642Z",
"shell.execute_reply": "2024-01-18T17:16:38.834404Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def parse_csv(mystring):\n",
" children = []\n",
" tree = (START_SYMBOL, children)\n",
" for i, line in enumerate(mystring.split('\\n')):\n",
" children.append((\"record %d\" % i, [(cell, [])\n",
" for cell in comma_split(line)]))\n",
" return tree"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Our new `parse_csv()` can now handle quotes correctly."
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:38.836059Z",
"iopub.status.busy": "2024-01-18T17:16:38.835950Z",
"iopub.status.idle": "2024-01-18T17:16:39.222409Z",
"shell.execute_reply": "2024-01-18T17:16:39.222045Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = parse_csv(mystring)\n",
"display_tree(tree, graph_attr=lr_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"That of course does not survive long:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:39.224145Z",
"iopub.status.busy": "2024-01-18T17:16:39.224019Z",
"iopub.status.idle": "2024-01-18T17:16:39.225937Z",
"shell.execute_reply": "2024-01-18T17:16:39.225688Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1999,Chevy,\"Venture \\\"Extended Edition, Very Large\\\"\",,5000.00\n"
]
}
],
"source": [
"mystring = '''\\\n",
"1999,Chevy,\"Venture \\\\\"Extended Edition, Very Large\\\\\"\",,5000.00\\\n",
"'''\n",
"print(mystring)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"A few embedded quotes are sufficient to confuse our parser again."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:39.227321Z",
"iopub.status.busy": "2024-01-18T17:16:39.227220Z",
"iopub.status.idle": "2024-01-18T17:16:39.595129Z",
"shell.execute_reply": "2024-01-18T17:16:39.594757Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = parse_csv(mystring)\n",
"bad_nodes = {4, 5}\n",
"display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Here is another record from that CSV file:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:39.597107Z",
"iopub.status.busy": "2024-01-18T17:16:39.596958Z",
"iopub.status.idle": "2024-01-18T17:16:39.599205Z",
"shell.execute_reply": "2024-01-18T17:16:39.598900Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1996,Jeep,Grand Cherokee,\"MUST SELL!\n",
"air, moon roof, loaded\",4799.00\n",
"\n"
]
}
],
"source": [
"mystring = '''\\\n",
"1996,Jeep,Grand Cherokee,\"MUST SELL!\n",
"air, moon roof, loaded\",4799.00\n",
"'''\n",
"print(mystring)"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:39.600918Z",
"iopub.status.busy": "2024-01-18T17:16:39.600776Z",
"iopub.status.idle": "2024-01-18T17:16:40.009384Z",
"shell.execute_reply": "2024-01-18T17:16:40.008944Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = parse_csv(mystring)\n",
"bad_nodes = {5, 6, 7, 8, 9, 10}\n",
"display_tree(tree, node_attr=highlight_err_node, graph_attr=lr_graph)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Fixing this would require modifying both inner `parse_quote()` and the outer `parse_csv()` procedures. We note that each of these features actually documented in the CSV [RFC 4180](https://tools.ietf.org/html/rfc4180)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Indeed, each additional improvement falls apart even with a little extra complexity. The problem becomes severe when one encounters recursive expressions. For example, JSON is a common alternative to CSV files for saving data. Similarly, one may have to parse data from an HTML table instead of a CSV file if one is getting the data from the web.\n",
"\n",
"One might be tempted to fix it with a little more ad hoc parsing, with a bit of *regular expressions* thrown in. However, that is the [path to insanity](https://stackoverflow.com/a/1732454)."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"It is here that _formal parsers_ shine. The main idea is that, any given set of strings belong to a language, and these languages can be specified by their grammars (as we saw in the [chapter on grammars](Grammars.ipynb)). The great thing about grammars is that they can be _composed_. That is, one can introduce finer and finer details into an internal structure without affecting the external structure, and similarly, one can change the external structure without much impact on the internal structure."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"toc-hr-collapsed": true
},
"source": [
"## Grammars in Parsing\n",
"\n",
"We briefly describe grammars in the context of parsing."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"toc-hr-collapsed": true
},
"source": [
"### Excursion: Grammars and Derivation Trees"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"toc-hr-collapsed": true
},
"source": [
"A grammar, as you have read from the [chapter on grammars](Grammars.ipynb) is a set of _rules_ that explain how the start symbol can be expanded. Each rule has a name, also called a _nonterminal_, and a set of _alternative choices_ in how the nonterminal can be expanded."
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.011383Z",
"iopub.status.busy": "2024-01-18T17:16:40.011261Z",
"iopub.status.idle": "2024-01-18T17:16:40.013623Z",
"shell.execute_reply": "2024-01-18T17:16:40.013271Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"A1_GRAMMAR: Grammar = {\n",
" \"\": [\"\"],\n",
" \"\": [\"+\", \"-\", \"\"],\n",
" \"\": [\"\", \"\"],\n",
" \"\": [\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\"]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.015116Z",
"iopub.status.busy": "2024-01-18T17:16:40.015015Z",
"iopub.status.idle": "2024-01-18T17:16:40.024498Z",
"shell.execute_reply": "2024-01-18T17:16:40.024167Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"expr\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"integer\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"digit\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"syntax_diagram(A1_GRAMMAR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"In the above expression, the rule ` : [+,-,]` corresponds to how the nonterminal `` might be expanded. The expression `+` corresponds to one of the alternative choices. We call this an _alternative_ expansion for the nonterminal ``. Finally, in an expression `+`, each of ``, `+`, and `` are _symbols_ in that expansion. A symbol could be either a nonterminal or a terminal symbol based on whether its expansion is available in the grammar."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Here is a string that represents an arithmetic expression that we would like to parse, which is specified by the grammar above:"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.026534Z",
"iopub.status.busy": "2024-01-18T17:16:40.026415Z",
"iopub.status.idle": "2024-01-18T17:16:40.028195Z",
"shell.execute_reply": "2024-01-18T17:16:40.027802Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"mystring = '1+2'"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The _derivation tree_ for our expression from this grammar is given by:"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.029866Z",
"iopub.status.busy": "2024-01-18T17:16:40.029738Z",
"iopub.status.idle": "2024-01-18T17:16:40.428076Z",
"shell.execute_reply": "2024-01-18T17:16:40.427726Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = ('', [('',\n",
" [('', [('', [('', [('1', [])])])]),\n",
" ('+', []),\n",
" ('', [('', [('', [('2',\n",
" [])])])])])])\n",
"assert mystring == tree_to_string(tree)\n",
"display_tree(tree)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"While a grammar can be used to specify a given language, there could be multiple\n",
"grammars that correspond to the same language. For example, here is another \n",
"grammar to describe the same addition expression."
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.429953Z",
"iopub.status.busy": "2024-01-18T17:16:40.429791Z",
"iopub.status.idle": "2024-01-18T17:16:40.432240Z",
"shell.execute_reply": "2024-01-18T17:16:40.431952Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"A2_GRAMMAR: Grammar = {\n",
" \"\": [\"\"],\n",
" \"\": [\"\"],\n",
" \"\": [\"+\", \"-\", \"\"],\n",
" \"\": [\"\"],\n",
" \"\": [\"\", \"\"],\n",
" \"\": [\"0\", \"1\", \"2\", \"3\", \"4\", \"5\", \"6\", \"7\", \"8\", \"9\"]\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.433809Z",
"iopub.status.busy": "2024-01-18T17:16:40.433671Z",
"iopub.status.idle": "2024-01-18T17:16:40.444204Z",
"shell.execute_reply": "2024-01-18T17:16:40.443934Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"expr\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"expr_\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"integer\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"integer_\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"digit\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"syntax_diagram(A2_GRAMMAR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The corresponding derivation tree is given by:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.446088Z",
"iopub.status.busy": "2024-01-18T17:16:40.445945Z",
"iopub.status.idle": "2024-01-18T17:16:40.828127Z",
"shell.execute_reply": "2024-01-18T17:16:40.827749Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 47,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = ('', [('', [('', [('', [('1', [])]),\n",
" ('', [])]),\n",
" ('', [('+', []),\n",
" ('',\n",
" [('',\n",
" [('', [('2', [])]),\n",
" ('', [])]),\n",
" ('', [])])])])])\n",
"assert mystring == tree_to_string(tree)\n",
"display_tree(tree)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Indeed, there could be different classes of grammars that\n",
"describe the same language. For example, the first grammar `A1_GRAMMAR`\n",
"is a grammar that sports both _right_ and _left_ recursion, while the\n",
"second grammar `A2_GRAMMAR` does not have left recursion in the\n",
"nonterminals in any of its productions, but contains _epsilon_ productions.\n",
"(An epsilon production is a production that has empty string in its right-hand side.)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### End of Excursion"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Excursion: Recursion"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"You would have noticed that we reuse the term `` in its own definition. Using the same nonterminal in its own definition is called *recursion*. There are two specific kinds of recursion one should be aware of in parsing, as we see in the next section."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"#### Recursion\n",
"\n",
"A grammar is _left recursive_ if any of its nonterminals are left recursive,\n",
"and a nonterminal is _directly left-recursive_ if the left-most symbol of\n",
"any of its productions is itself."
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.830033Z",
"iopub.status.busy": "2024-01-18T17:16:40.829919Z",
"iopub.status.idle": "2024-01-18T17:16:40.831755Z",
"shell.execute_reply": "2024-01-18T17:16:40.831512Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"LR_GRAMMAR: Grammar = {\n",
" '': [''],\n",
" '': ['a', ''],\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.833109Z",
"iopub.status.busy": "2024-01-18T17:16:40.833005Z",
"iopub.status.idle": "2024-01-18T17:16:40.837362Z",
"shell.execute_reply": "2024-01-18T17:16:40.837042Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"A\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"syntax_diagram(LR_GRAMMAR)"
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:40.838948Z",
"iopub.status.busy": "2024-01-18T17:16:40.838839Z",
"iopub.status.idle": "2024-01-18T17:16:41.233537Z",
"shell.execute_reply": "2024-01-18T17:16:41.233131Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = 'aaaaaa'\n",
"display_tree(\n",
" ('', [('', [('', [('', []), ('a', [])]), ('a', [])]),\n",
" ('a', [])]))"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"A grammar is indirectly left-recursive if any\n",
"of the left-most symbols can be expanded using their definitions to\n",
"produce the nonterminal as the left-most symbol of the expansion. The left\n",
"recursion is called a _hidden-left-recursion_ if during the series of\n",
"expansions of a nonterminal, one reaches a rule where the rule contains\n",
"the same nonterminal after a prefix of other symbols, and these symbols can\n",
"derive the empty string. For example, in `A1_GRAMMAR`, `` will be\n",
"considered hidden-left recursive if `` could derive an empty string.\n",
"\n",
"Right recursive grammars are defined similarly.\n",
"Below is the derivation tree for the right recursive grammar that represents the same\n",
"language as that of `LR_GRAMMAR`."
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:41.235237Z",
"iopub.status.busy": "2024-01-18T17:16:41.235098Z",
"iopub.status.idle": "2024-01-18T17:16:41.237038Z",
"shell.execute_reply": "2024-01-18T17:16:41.236767Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"RR_GRAMMAR: Grammar = {\n",
" '': [''],\n",
" '': ['a', ''],\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": 52,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:41.238493Z",
"iopub.status.busy": "2024-01-18T17:16:41.238386Z",
"iopub.status.idle": "2024-01-18T17:16:41.242838Z",
"shell.execute_reply": "2024-01-18T17:16:41.242496Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"start\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"A\n"
]
},
{
"data": {
"image/svg+xml": [
""
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"syntax_diagram(RR_GRAMMAR)"
]
},
{
"cell_type": "code",
"execution_count": 53,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:41.244641Z",
"iopub.status.busy": "2024-01-18T17:16:41.244500Z",
"iopub.status.idle": "2024-01-18T17:16:41.638365Z",
"shell.execute_reply": "2024-01-18T17:16:41.637952Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 53,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"display_tree(('', [('', [\n",
" ('a', []), ('', [('a', []), ('', [('a', []), ('', [])])])])]\n",
" ))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"#### Ambiguity\n",
"\n",
"To complicate matters further, there could be\n",
"multiple derivation trees – also called _parses_ – corresponding to the\n",
"same string from the same grammar. For example, a string `1+2+3` can be parsed\n",
"in two ways as we see below using the `A1_GRAMMAR`"
]
},
{
"cell_type": "code",
"execution_count": 54,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:41.640312Z",
"iopub.status.busy": "2024-01-18T17:16:41.640155Z",
"iopub.status.idle": "2024-01-18T17:16:42.026776Z",
"shell.execute_reply": "2024-01-18T17:16:42.026368Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 54,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = '1+2+3'\n",
"tree = ('',\n",
" [('',\n",
" [('', [('', [('', [('', [('1', [])])])]),\n",
" ('+', []),\n",
" ('', [('',\n",
" [('', [('2', [])])])])]), ('+', []),\n",
" ('', [('', [('', [('3', [])])])])])])\n",
"assert mystring == tree_to_string(tree)\n",
"display_tree(tree)"
]
},
{
"cell_type": "code",
"execution_count": 55,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.028588Z",
"iopub.status.busy": "2024-01-18T17:16:42.028470Z",
"iopub.status.idle": "2024-01-18T17:16:42.408449Z",
"shell.execute_reply": "2024-01-18T17:16:42.407999Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"image/svg+xml": [
"\n",
"\n",
"\n",
"\n",
"\n"
],
"text/plain": [
""
]
},
"execution_count": 55,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tree = ('',\n",
" [('', [('', [('', [('', [('1', [])])])]),\n",
" ('+', []),\n",
" ('',\n",
" [('', [('', [('', [('2', [])])])]),\n",
" ('+', []),\n",
" ('', [('', [('', [('3',\n",
" [])])])])])])])\n",
"assert tree_to_string(tree) == mystring\n",
"display_tree(tree)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"There are many ways to resolve ambiguities. One approach taken by *Parsing Expression Grammars* explained in the next section is to specify a particular order of resolution, and choose the first one. Another approach is to simply return all possible derivation trees, which is the approach taken by *Earley parser* we develop later."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### End of Excursion"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
},
"tags": []
},
"source": [
"## A Parser Class"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"Next, we develop different parsers. To do that, we define a minimal interface for parsing that is obeyed by all parsers. There are two approaches to parsing a string using a grammar.\n",
"\n",
"1. The traditional approach is to use a *lexer* (also called a *tokenizer* or a *scanner*) to first tokenize the incoming string, and feed the grammar one token at a time. The lexer is typically a smaller parser that accepts a *regular language*. The advantage of this approach is that the grammar used by the parser can eschew the details of tokenization. Further, one gets a shallow derivation tree at the end of the parsing which can be directly used for generating the *Abstract Syntax Tree*.\n",
"2. The second approach is to use a tree pruner after the complete parse. With this approach, one uses a grammar that incorporates complete details of the syntax. Next, the nodes corresponding to tokens are pruned and replaced with their corresponding strings as leaf nodes. The utility of this approach is that the parser is more powerful, and further there is no artificial distinction between *lexing* and *parsing*.\n",
"\n",
"In this chapter, we use the second approach. This approach is implemented in the `prune_tree` method."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"The *Parser* class we define below provides the minimal interface. The main methods that need to be implemented by the classes implementing this interface are `parse_prefix` and `parse`. The `parse_prefix` returns a tuple, which contains the index until which parsing was completed successfully, and the parse forest until that index. The method `parse` returns a list of derivation trees if the parse was successful."
]
},
{
"cell_type": "code",
"execution_count": 56,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.410761Z",
"iopub.status.busy": "2024-01-18T17:16:42.410474Z",
"iopub.status.idle": "2024-01-18T17:16:42.416676Z",
"shell.execute_reply": "2024-01-18T17:16:42.416379Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class Parser:\n",
" \"\"\"Base class for parsing.\"\"\"\n",
"\n",
" def __init__(self, grammar: Grammar, *,\n",
" start_symbol: str = START_SYMBOL,\n",
" log: bool = False,\n",
" coalesce: bool = True,\n",
" tokens: Set[str] = set()) -> None:\n",
" \"\"\"Constructor.\n",
" `grammar` is the grammar to be used for parsing.\n",
" Keyword arguments:\n",
" `start_symbol` is the start symbol (default: '').\n",
" `log` enables logging (default: False).\n",
" `coalesce` defines if tokens should be coalesced (default: True).\n",
" `tokens`, if set, is a set of tokens to be used.\"\"\"\n",
" self._grammar = grammar\n",
" self._start_symbol = start_symbol\n",
" self.log = log\n",
" self.coalesce_tokens = coalesce\n",
" self.tokens = tokens\n",
"\n",
" def grammar(self) -> Grammar:\n",
" \"\"\"Return the grammar of this parser.\"\"\"\n",
" return self._grammar\n",
"\n",
" def start_symbol(self) -> str:\n",
" \"\"\"Return the start symbol of this parser.\"\"\"\n",
" return self._start_symbol\n",
"\n",
" def parse_prefix(self, text: str) -> Tuple[int, Iterable[DerivationTree]]:\n",
" \"\"\"Return pair (cursor, forest) for longest prefix of text. \n",
" To be defined in subclasses.\"\"\"\n",
" raise NotImplementedError\n",
"\n",
" def parse(self, text: str) -> Iterable[DerivationTree]:\n",
" \"\"\"Parse `text` using the grammar. \n",
" Return an iterable of parse trees.\"\"\"\n",
" cursor, forest = self.parse_prefix(text)\n",
" if cursor < len(text):\n",
" raise SyntaxError(\"at \" + repr(text[cursor:]))\n",
" return [self.prune_tree(tree) for tree in forest]\n",
"\n",
" def parse_on(self, text: str, start_symbol: str) -> Generator:\n",
" old_start = self._start_symbol\n",
" try:\n",
" self._start_symbol = start_symbol\n",
" yield from self.parse(text)\n",
" finally:\n",
" self._start_symbol = old_start\n",
"\n",
" def coalesce(self, children: List[DerivationTree]) -> List[DerivationTree]:\n",
" last = ''\n",
" new_lst: List[DerivationTree] = []\n",
" for cn, cc in children:\n",
" if cn not in self._grammar:\n",
" last += cn\n",
" else:\n",
" if last:\n",
" new_lst.append((last, []))\n",
" last = ''\n",
" new_lst.append((cn, cc))\n",
" if last:\n",
" new_lst.append((last, []))\n",
" return new_lst\n",
"\n",
" def prune_tree(self, tree: DerivationTree) -> DerivationTree:\n",
" name, children = tree\n",
" assert isinstance(children, list)\n",
"\n",
" if self.coalesce_tokens:\n",
" children = self.coalesce(cast(List[DerivationTree], children))\n",
" if name in self.tokens:\n",
" return (name, [(tree_to_string(tree), [])])\n",
" else:\n",
" return (name, [self.prune_tree(c) for c in children])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### Excursion: Canonical Grammars"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"The `EXPR_GRAMMAR` we import from the [chapter on grammars](Grammars.ipynb) is oriented towards generation. In particular, the production rules are stored as strings. We need to massage this representation a little to conform to a _canonical representation_ where each token in a rule is represented separately. The `canonical` format uses separate tokens to represent each symbol in an expansion."
]
},
{
"cell_type": "code",
"execution_count": 57,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.418323Z",
"iopub.status.busy": "2024-01-18T17:16:42.418208Z",
"iopub.status.idle": "2024-01-18T17:16:42.420106Z",
"shell.execute_reply": "2024-01-18T17:16:42.419846Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"CanonicalGrammar = Dict[str, List[List[str]]]"
]
},
{
"cell_type": "code",
"execution_count": 58,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.421549Z",
"iopub.status.busy": "2024-01-18T17:16:42.421432Z",
"iopub.status.idle": "2024-01-18T17:16:42.423017Z",
"shell.execute_reply": "2024-01-18T17:16:42.422755Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "code",
"execution_count": 59,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.424806Z",
"iopub.status.busy": "2024-01-18T17:16:42.424670Z",
"iopub.status.idle": "2024-01-18T17:16:42.427285Z",
"shell.execute_reply": "2024-01-18T17:16:42.426945Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def single_char_tokens(grammar: Grammar) -> Dict[str, List[List[Collection[str]]]]:\n",
" g_ = {}\n",
" for key in grammar:\n",
" rules_ = []\n",
" for rule in grammar[key]:\n",
" rule_ = []\n",
" for token in rule:\n",
" if token in grammar:\n",
" rule_.append(token)\n",
" else:\n",
" rule_.extend(token)\n",
" rules_.append(rule_)\n",
" g_[key] = rules_\n",
" return g_"
]
},
{
"cell_type": "code",
"execution_count": 60,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.428997Z",
"iopub.status.busy": "2024-01-18T17:16:42.428873Z",
"iopub.status.idle": "2024-01-18T17:16:42.431169Z",
"shell.execute_reply": "2024-01-18T17:16:42.430921Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [],
"source": [
"def canonical(grammar: Grammar) -> CanonicalGrammar:\n",
" def split(expansion):\n",
" if isinstance(expansion, tuple):\n",
" expansion = expansion[0]\n",
"\n",
" return [token for token in re.split(\n",
" RE_NONTERMINAL, expansion) if token]\n",
"\n",
" return {\n",
" k: [split(expression) for expression in alternatives]\n",
" for k, alternatives in grammar.items()\n",
" }"
]
},
{
"cell_type": "code",
"execution_count": 61,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.432679Z",
"iopub.status.busy": "2024-01-18T17:16:42.432576Z",
"iopub.status.idle": "2024-01-18T17:16:42.435321Z",
"shell.execute_reply": "2024-01-18T17:16:42.434958Z"
},
"slideshow": {
"slide_type": "skip"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'': [['']],\n",
" '': [['', ' + ', ''],\n",
" ['', ' - ', ''],\n",
" ['']],\n",
" '': [['', ' * ', ''],\n",
" ['', ' / ', ''],\n",
" ['']],\n",
" '': [['+', ''],\n",
" ['-', ''],\n",
" ['(', '', ')'],\n",
" ['', '.', ''],\n",
" ['']],\n",
" '': [['', ''], ['']],\n",
" '': [['0'],\n",
" ['1'],\n",
" ['2'],\n",
" ['3'],\n",
" ['4'],\n",
" ['5'],\n",
" ['6'],\n",
" ['7'],\n",
" ['8'],\n",
" ['9']]}"
]
},
"execution_count": 61,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"CE_GRAMMAR: CanonicalGrammar = canonical(EXPR_GRAMMAR)\n",
"CE_GRAMMAR"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We also provide a convenience method for easier display of canonical grammars."
]
},
{
"cell_type": "code",
"execution_count": 62,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.436861Z",
"iopub.status.busy": "2024-01-18T17:16:42.436743Z",
"iopub.status.idle": "2024-01-18T17:16:42.438950Z",
"shell.execute_reply": "2024-01-18T17:16:42.438696Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def recurse_grammar(grammar, key, order):\n",
" rules = sorted(grammar[key])\n",
" old_len = len(order)\n",
" for rule in rules:\n",
" for token in rule:\n",
" if token not in grammar: continue\n",
" if token not in order:\n",
" order.append(token)\n",
" new = order[old_len:]\n",
" for ckey in new:\n",
" recurse_grammar(grammar, ckey, order)"
]
},
{
"cell_type": "code",
"execution_count": 63,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.440516Z",
"iopub.status.busy": "2024-01-18T17:16:42.440330Z",
"iopub.status.idle": "2024-01-18T17:16:42.442383Z",
"shell.execute_reply": "2024-01-18T17:16:42.442034Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"def show_grammar(grammar, start_symbol=START_SYMBOL):\n",
" order = [start_symbol]\n",
" recurse_grammar(grammar, start_symbol, order)\n",
" return {k: sorted(grammar[k]) for k in order}"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.443841Z",
"iopub.status.busy": "2024-01-18T17:16:42.443719Z",
"iopub.status.idle": "2024-01-18T17:16:42.446532Z",
"shell.execute_reply": "2024-01-18T17:16:42.446177Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'': [['']],\n",
" '': [[''],\n",
" ['', ' + ', ''],\n",
" ['', ' - ', '']],\n",
" '': [[''],\n",
" ['', ' * ', ''],\n",
" ['', ' / ', '']],\n",
" '': [['(', '', ')'],\n",
" ['+', ''],\n",
" ['-', ''],\n",
" [''],\n",
" ['', '.', '']],\n",
" '': [[''], ['', '']],\n",
" '': [['0'],\n",
" ['1'],\n",
" ['2'],\n",
" ['3'],\n",
" ['4'],\n",
" ['5'],\n",
" ['6'],\n",
" ['7'],\n",
" ['8'],\n",
" ['9']]}"
]
},
"execution_count": 64,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"show_grammar(CE_GRAMMAR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We provide a way to revert a canonical expression."
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.448218Z",
"iopub.status.busy": "2024-01-18T17:16:42.448098Z",
"iopub.status.idle": "2024-01-18T17:16:42.450107Z",
"shell.execute_reply": "2024-01-18T17:16:42.449853Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def non_canonical(grammar):\n",
" new_grammar = {}\n",
" for k in grammar:\n",
" rules = grammar[k]\n",
" new_rules = []\n",
" for rule in rules:\n",
" new_rules.append(''.join(rule))\n",
" new_grammar[k] = new_rules\n",
" return new_grammar"
]
},
{
"cell_type": "code",
"execution_count": 66,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.451689Z",
"iopub.status.busy": "2024-01-18T17:16:42.451585Z",
"iopub.status.idle": "2024-01-18T17:16:42.453747Z",
"shell.execute_reply": "2024-01-18T17:16:42.453499Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"{'': [''],\n",
" '': [' + ', ' - ', ''],\n",
" '': [' * ', ' / ', ''],\n",
" '': ['+',\n",
" '-',\n",
" '()',\n",
" '.',\n",
" ''],\n",
" '': ['', ''],\n",
" '': ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']}"
]
},
"execution_count": 66,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"non_canonical(CE_GRAMMAR)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"It is easier to work with the `canonical` representation during parsing. Hence, we update our parser class to store the `canonical` representation also."
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.455355Z",
"iopub.status.busy": "2024-01-18T17:16:42.455099Z",
"iopub.status.idle": "2024-01-18T17:16:42.458075Z",
"shell.execute_reply": "2024-01-18T17:16:42.457755Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class Parser(Parser):\n",
" def __init__(self, grammar, **kwargs):\n",
" self._start_symbol = kwargs.get('start_symbol', START_SYMBOL)\n",
" self.log = kwargs.get('log', False)\n",
" self.tokens = kwargs.get('tokens', set())\n",
" self.coalesce_tokens = kwargs.get('coalesce', True)\n",
" canonical_grammar = kwargs.get('canonical', False)\n",
" if canonical_grammar:\n",
" self.cgrammar = single_char_tokens(grammar)\n",
" self._grammar = non_canonical(grammar)\n",
" else:\n",
" self._grammar = dict(grammar)\n",
" self.cgrammar = single_char_tokens(canonical(grammar))\n",
" # we do not require a single rule for the start symbol\n",
" if len(grammar.get(self._start_symbol, [])) != 1:\n",
" self.cgrammar['<>'] = [[self._start_symbol]]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"We update the `prune_tree()` to account for the phony start symbol if it was insserted."
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.459835Z",
"iopub.status.busy": "2024-01-18T17:16:42.459695Z",
"iopub.status.idle": "2024-01-18T17:16:42.462326Z",
"shell.execute_reply": "2024-01-18T17:16:42.462039Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"class Parser(Parser):\n",
" def prune_tree(self, tree):\n",
" name, children = tree\n",
" if name == '<>':\n",
" assert len(children) == 1\n",
" return self.prune_tree(children[0])\n",
" if self.coalesce_tokens:\n",
" children = self.coalesce(children)\n",
" if name in self.tokens:\n",
" return (name, [(tree_to_string(tree), [])])\n",
" else:\n",
" return (name, [self.prune_tree(c) for c in children])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### End of Excursion"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"## Parsing Expression Grammars\n",
"\n",
"A _[Parsing Expression Grammar](http://bford.info/pub/lang/peg)_ (*PEG*) \\cite{Ford2004} is a type of _recognition based formal grammar_ that specifies the sequence of steps to take to parse a given string.\n",
"A _parsing expression grammar_ is very similar to a _context-free grammar_ (*CFG*) such as the ones we saw in the [chapter on grammars](Grammars.ipynb). As in a CFG, a parsing expression grammar is represented by a set of nonterminals and corresponding alternatives representing how to match each. For example, here is a PEG that matches `a` or `b`."
]
},
{
"cell_type": "code",
"execution_count": 69,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.463874Z",
"iopub.status.busy": "2024-01-18T17:16:42.463747Z",
"iopub.status.idle": "2024-01-18T17:16:42.465274Z",
"shell.execute_reply": "2024-01-18T17:16:42.465019Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"PEG1 = {\n",
" '': ['a', 'b']\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"However, unlike the _CFG_, the alternatives represent *ordered choice*. That is, rather than choosing all rules that can potentially match, we stop at the first match that succeed. For example, the below _PEG_ can match `ab` but not `abc` unlike a _CFG_ which will match both. (We call the sequence of ordered choice expressions *choice expressions* rather than alternatives to make the distinction from _CFG_ clear.)"
]
},
{
"cell_type": "code",
"execution_count": 70,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.466746Z",
"iopub.status.busy": "2024-01-18T17:16:42.466662Z",
"iopub.status.idle": "2024-01-18T17:16:42.468428Z",
"shell.execute_reply": "2024-01-18T17:16:42.468154Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"PEG2 = {\n",
" '': ['ab', 'abc']\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Each choice in a _choice expression_ represents a rule on how to satisfy that particular choice. The choice is a sequence of symbols (terminals and nonterminals) that are matched against a given text as in a _CFG_."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "skip"
}
},
"source": [
"Beyond the syntax of grammar definitions we have seen so far, a _PEG_ can also contain a few additional elements. See the exercises at the end of the chapter for additional information.\n",
"\n",
"The PEGs model the typical practice in handwritten recursive descent parsers, and hence it may be considered more intuitive to understand."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"tags": []
},
"source": [
"### The Packrat Parser for Predicate Expression Grammars\n",
"\n",
"Short of hand rolling a parser, _Packrat_ parsing is one of the simplest parsing techniques, and is one of the techniques for parsing PEGs.\n",
"The _Packrat_ parser is so named because it tries to cache all results from simpler problems in the hope that these solutions can be used to avoid re-computation later. We develop a minimal _Packrat_ parser next."
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
},
"toc-hr-collapsed": false
},
"source": [
"We derive from the `Parser` base class first, and we accept the text to be parsed in the `parse()` method, which in turn calls `unify_key()` with the `start_symbol`.\n",
"\n",
"__Note.__ While our PEG parser can produce only a single unambiguous parse tree, other parsers can produce multiple parses for ambiguous grammars. Hence, we return a list of trees (in this case with a single element)."
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.469934Z",
"iopub.status.busy": "2024-01-18T17:16:42.469845Z",
"iopub.status.idle": "2024-01-18T17:16:42.471848Z",
"shell.execute_reply": "2024-01-18T17:16:42.471598Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class PEGParser(Parser):\n",
" def parse_prefix(self, text):\n",
" cursor, tree = self.unify_key(self.start_symbol(), text, 0)\n",
" return cursor, [tree]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "subslide"
}
},
"source": [
"### Excursion: Implementing `PEGParser`"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"#### Unify Key\n",
"The `unify_key()` algorithm is simple. If given a terminal symbol, it tries to match the symbol with the current position in the text. If the symbol and text match, it returns successfully with the new parse index `at`.\n",
"\n",
"If on the other hand, it was given a nonterminal, it retrieves the choice expression corresponding to the key, and tries to match each choice *in order* using `unify_rule()`. If **any** of the rules succeed in being unified with the given text, the parse is considered a success, and we return with the new parse index returned by `unify_rule()`."
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {
"button": false,
"code_folding": [],
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.473230Z",
"iopub.status.busy": "2024-01-18T17:16:42.473146Z",
"iopub.status.idle": "2024-01-18T17:16:42.475739Z",
"shell.execute_reply": "2024-01-18T17:16:42.475505Z"
},
"new_sheet": false,
"run_control": {
"read_only": false
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class PEGParser(PEGParser):\n",
" \"\"\"Packrat parser for Parsing Expression Grammars (PEGs).\"\"\"\n",
"\n",
" def unify_key(self, key, text, at=0):\n",
" if self.log:\n",
" print(\"unify_key: %s with %s\" % (repr(key), repr(text[at:])))\n",
" if key not in self.cgrammar:\n",
" if text[at:].startswith(key):\n",
" return at + len(key), (key, [])\n",
" else:\n",
" return at, None\n",
" for rule in self.cgrammar[key]:\n",
" to, res = self.unify_rule(rule, text, at)\n",
" if res is not None:\n",
" return (to, (key, res))\n",
" return 0, None"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.477290Z",
"iopub.status.busy": "2024-01-18T17:16:42.477192Z",
"iopub.status.idle": "2024-01-18T17:16:42.480140Z",
"shell.execute_reply": "2024-01-18T17:16:42.479817Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unify_key: '1' with '1'\n"
]
},
{
"data": {
"text/plain": [
"(1, ('1', []))"
]
},
"execution_count": 73,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = \"1\"\n",
"peg = PEGParser(EXPR_GRAMMAR, log=True)\n",
"peg.unify_key('1', mystring)"
]
},
{
"cell_type": "code",
"execution_count": 74,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.481718Z",
"iopub.status.busy": "2024-01-18T17:16:42.481602Z",
"iopub.status.idle": "2024-01-18T17:16:42.483910Z",
"shell.execute_reply": "2024-01-18T17:16:42.483579Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unify_key: '1' with '2'\n"
]
},
{
"data": {
"text/plain": [
"(0, None)"
]
},
"execution_count": 74,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = \"2\"\n",
"peg.unify_key('1', mystring)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"#### Unify Rule\n",
"\n",
"The `unify_rule()` method is similar. It retrieves the tokens corresponding to the rule that it needs to unify with the text, and calls `unify_key()` on them in sequence. If **all** tokens are successfully unified with the text, the parse is a success."
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.485413Z",
"iopub.status.busy": "2024-01-18T17:16:42.485307Z",
"iopub.status.idle": "2024-01-18T17:16:42.487519Z",
"shell.execute_reply": "2024-01-18T17:16:42.487265Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [],
"source": [
"class PEGParser(PEGParser):\n",
" def unify_rule(self, rule, text, at):\n",
" if self.log:\n",
" print('unify_rule: %s with %s' % (repr(rule), repr(text[at:])))\n",
" results = []\n",
" for token in rule:\n",
" at, res = self.unify_key(token, text, at)\n",
" if res is None:\n",
" return at, None\n",
" results.append(res)\n",
" return at, results"
]
},
{
"cell_type": "code",
"execution_count": 76,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.488882Z",
"iopub.status.busy": "2024-01-18T17:16:42.488781Z",
"iopub.status.idle": "2024-01-18T17:16:42.491231Z",
"shell.execute_reply": "2024-01-18T17:16:42.490977Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unify_rule: ['0'] with '0'\n",
"unify_key: '0' with '0'\n"
]
},
{
"data": {
"text/plain": [
"(1, [('0', [])])"
]
},
"execution_count": 76,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = \"0\"\n",
"peg = PEGParser(EXPR_GRAMMAR, log=True)\n",
"peg.unify_rule(peg.cgrammar[''][0], mystring, 0)"
]
},
{
"cell_type": "code",
"execution_count": 77,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.492673Z",
"iopub.status.busy": "2024-01-18T17:16:42.492559Z",
"iopub.status.idle": "2024-01-18T17:16:42.495889Z",
"shell.execute_reply": "2024-01-18T17:16:42.495479Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"unify_rule: ['', ''] with '12'\n",
"unify_key: '' with '12'\n",
"unify_rule: ['0'] with '12'\n",
"unify_key: '0' with '12'\n",
"unify_rule: ['1'] with '12'\n",
"unify_key: '1' with '12'\n",
"unify_key: '' with '2'\n",
"unify_rule: ['', ''] with '2'\n",
"unify_key: '' with '2'\n",
"unify_rule: ['0'] with '2'\n",
"unify_key: '0' with '2'\n",
"unify_rule: ['1'] with '2'\n",
"unify_key: '1' with '2'\n",
"unify_rule: ['2'] with '2'\n",
"unify_key: '2' with '2'\n",
"unify_key: '' with ''\n",
"unify_rule: ['', ''] with ''\n",
"unify_key: '' with ''\n",
"unify_rule: ['0'] with ''\n",
"unify_key: '0' with ''\n",
"unify_rule: ['1'] with ''\n",
"unify_key: '1' with ''\n",
"unify_rule: ['2'] with ''\n",
"unify_key: '2' with ''\n",
"unify_rule: ['3'] with ''\n",
"unify_key: '3' with ''\n",
"unify_rule: ['4'] with ''\n",
"unify_key: '4' with ''\n",
"unify_rule: ['5'] with ''\n",
"unify_key: '5' with ''\n",
"unify_rule: ['6'] with ''\n",
"unify_key: '6' with ''\n",
"unify_rule: ['7'] with ''\n",
"unify_key: '7' with ''\n",
"unify_rule: ['8'] with ''\n",
"unify_key: '8' with ''\n",
"unify_rule: ['9'] with ''\n",
"unify_key: '9' with ''\n",
"unify_rule: [''] with ''\n",
"unify_key: '' with ''\n",
"unify_rule: ['0'] with ''\n",
"unify_key: '0' with ''\n",
"unify_rule: ['1'] with ''\n",
"unify_key: '1' with ''\n",
"unify_rule: ['2'] with ''\n",
"unify_key: '2' with ''\n",
"unify_rule: ['3'] with ''\n",
"unify_key: '3' with ''\n",
"unify_rule: ['4'] with ''\n",
"unify_key: '4' with ''\n",
"unify_rule: ['5'] with ''\n",
"unify_key: '5' with ''\n",
"unify_rule: ['6'] with ''\n",
"unify_key: '6' with ''\n",
"unify_rule: ['7'] with ''\n",
"unify_key: '7' with ''\n",
"unify_rule: ['8'] with ''\n",
"unify_key: '8' with ''\n",
"unify_rule: ['9'] with ''\n",
"unify_key: '9' with ''\n",
"unify_rule: [''] with '2'\n",
"unify_key: '' with '2'\n",
"unify_rule: ['0'] with '2'\n",
"unify_key: '0' with '2'\n",
"unify_rule: ['1'] with '2'\n",
"unify_key: '1' with '2'\n",
"unify_rule: ['2'] with '2'\n",
"unify_key: '2' with '2'\n"
]
},
{
"data": {
"text/plain": [
"(2, [('', [('1', [])]), ('', [('', [('2', [])])])])"
]
},
"execution_count": 77,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mystring = \"12\"\n",
"peg.unify_rule(peg.cgrammar[''][0], mystring, 0)"
]
},
{
"cell_type": "code",
"execution_count": 78,
"metadata": {
"execution": {
"iopub.execute_input": "2024-01-18T17:16:42.497630Z",
"iopub.status.busy": "2024-01-18T17:16:42.497547Z",
"iopub.status.idle": "2024-01-18T17:16:42.500467Z",
"shell.execute_reply": "2024-01-18T17:16:42.500228Z"
},
"slideshow": {
"slide_type": "subslide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"[('',\n",
" [('',\n",
" [('', [('', [('', [('', [('1', [])])])])]),\n",
" (' + ', []),\n",
" ('',\n",
" [('