parsing confidently =================== Lars Dɪᴇᴄᴋᴏᴡ 迪拉斯 ---- table of contents ================= * abstract * target audience * definition of terms * main course * part 1 * part 2 * part 3 * Q&A ---- abstract ======== This topic occupies me since 2014 when I was programming my first compiler. As I found out in the past years, many grammar parsers are unsuitable for the general case because they cannot deal with certain edge cases that can occur quite often.¹ Since then I dove deeply into the topic and read 25 scientific papers and had to conclude that the situation is worse than assumed earlier. I prospect how we can get the problems that not widely known outside academic circles under control and do it better. ¹ ---- target audience =============== It is registered as "familiar with subject" on the conference Web site, meaning I assume that you all have already used a parser and skip the basics. 1. In case you think during my talk "I already knew that (e.g. through my university studies)", mind that there are enough people who hear about it for the first time. A huge amount of programmers is self-taught, so theoretical knowledge gaps happen easily. I am one of them. 2. You must concur with me in the argument: * If you have the choice between several JSON libraries, those which make obvious mistakes are disqualified for a start. * This proposition applies to similar domains in an analogue fashion. If you do not concur, then you will find large part of the argumentation weak. ![example](https://i.imgur.com/m8r8YOY.png) ---- definition of terms =================== 1. The word "generally" as I use it has a precise meaning. The meaning is opposed to the colloquial use. When I say e.g. "generally unusable", this does not mean "completely unusable", but "sometimes unusable, but not in all cases". 2. abbreviation P6G = Perl 6 Grammar 3. Grammar parsers are parsers that allow as input a declarative grammar definition. This could look like (or similar): EXPR ⩴ EXPR 'mult' EXPR EXPR ⩴ TERM TERM ⩴ TERM 'add' TERM TERM ⩴ FACTOR FACTOR ⩴ 'num' FACTOR ⩴ 'lpar' EXPR 'rpar' FACTOR ⩴ 'star' EXPR In contrast: * hand-written parsers * parser combinators (e.g. Hammer, HOP::Parser, Parser::Combinator, Haskell Parsec family and many more) Example: hammer::sequence( hammer::ch('a'), hammer::choice( hammer::sequence( hammer::ch('+'), hammer::not( hammer::ch('+') ) ), hammer::token('++') ), hammer::ch('b') ); I had to leave them out because of time limitations. ---- part 1: what is unusable exactly? ================================= 1.1 Grammar parsers have a documentation problem. Very often, they do not tell the programmer who wants to use such a library which type of parser or algorithm the library uses under the hood, or at the very least mention the limitations.² ² There is at least one grammar parser documentation that outright lies to the programmer. 1.2 If the programmer wants to determine whether the library is suitable for his purposes, instead of deciding up-front he has no choice but to try out the software. Multiply this by every programmer who tries out the software. This is a colossal waste of time. 1.3 There is another problem. Sometimes, the programmer not only parses input under his control, but passes the parser off to an end user. The end user supplies input unforeseen by the programmer, and the parser breaks. Sometimes, it turns out the grammar was not quite correct, and the programmer can amend the grammar. However sometimes, the grammar already was correct, and the flaw is inherent to the parser's algorithm, but the flaw exhibits only under certain input, and so there is nothing the programmer can do to fix it except switch over to a different parser that does not have the algorithmic flaw. Again this is a waste of time. 1.4 A programmer should be able to make an informed decision which grammar parser is good to use, but the information is incomplete. The goal of the project [grammar-shootout](https://github.com/daxim/grammar-shootout) is to make up for the short-comings in the grammar parser documentations in order to save the interested programmer some time. This is achieved not by amending the various documentations, but by torturing parsers with all kinds of grammars and a multitude of valid inputs and comparing the results. 1.5 A programmer can then easily pick a grammar parser that passes the tests flawlessly. He can be much more confident that the parser will not break later.³ ³ In fact, there are grammar parsers whose algorithm is proven to accept any CFG, that is to say the grammar parser is free of algorithmic limitations. It is very desirable to know which grammar parsers are in this category; if one has the choice to pick a grammar parser for a new project, why would one ever pick a flawed one? 1.6 For already existing software that contains a parser, the test results show scenarios where a grammar parser might break, and the programmer can be proactive about the problem without urgency, instead of being confronted out of the blue by a bug report from an end user. 1.7 The documentation might mention the limitations in an abstract way that would be difficult to understand for some programmers, the comparison shows problems in the concrete which aids to understand them more easily. 1.8 And lastly, this comparison also functions as a survey of the various APIs of the grammar parsers. By inspecting the generated code, a programmer can get feel how much effort it is to put a grammar to use with a certain grammar parser. My personal opinion is that software should reflect the motto: easy things easy, hard things possible. It should be straight-forward to get going with simple parsing tasks, yet the software should be powerful enough to solve the most difficult parsing tasks. Grammar parsers that have unwieldy APIs for no good reason or needlessly expose complexity to the programmer score badly in my mind. ---- → comparison.html ---- One can see that P6G score badly. ---- part 2: problems with P6G and similar ===================================== 2.1 P6G can do many things, but not particularly well. What does that mean? I have picked representative features that one can find only rarely in other grammar parsers or not at all (there are Kleene plus and star at the most): % %% ~ assertions 2.2 Some problems I am addressing are due to the algorithm, some are specific to the implementation. 2.3 Anonymous terminals do not automatically become part of the tree. grammar { rule sum { <[ + \- ]> }; token digit { \d }; }.parse('3+4', rule => 'sum') 2.4 Left-recursive rules create an endless loop. The bug exists since beginning, is reported since two years on RT, but has not been adressed yet. 2.5 I also wanted to know what the parser is under the hood or its limitations. That is nowhere in the documentation. After two years of silence on RT no one could tell me. Moritz even managed to write a 200 page book about the topic without saying what is the matter of cause. 2.6 It is not anticipated to influence the return value, one can only mangle it *after* the tree has been completely constructed. There *is* a magic method `MATCH` but it is not documented on purpose. Actions, whether intrinsic or in a class, do not help. Everything has to be done by means of the magic `$/` variable. The `return` statement is ignored. This is a daft design, other programming languages break their bones in order to inhibit assignments to global variables, but Perl 6 does not care two hoots about it. P6 implementers have a motto: we torture ourselves on behalf of the end user (so that his programming is comfortable). Commonly that is true, but in this part of the implementation they did not cover themselves in glory. 2.7 Ambiguous parses are forcibly resolved over the head of the programmer, and if that was not the desired return value or does not fit the desired semantics, then you are out of luck. The documentation names an incomplete convolute of rules which in turn point to document `S05` which says about itself that it might be out of date. In any case `S05` describes how to force a certain result, and if thte programmer does not like that, there is no way to influence it, and there is also no documentation how one might be able to achieve that. That makes the parser generally unusable for parsing natural languages. For example it is not possible to parse the following sentence with P6G. (→ amb.graphml) 2.8 Then there is the problem of not being able to confidently intuit, how much P6G consumes, or if at all. Three examples that look trivial, but are quite crafty. (→ 2.8) It is still reasonable if one sits alone and work on a P6G and develops it oneself step-by-step, and runs tests with each change as described in the textbook chapter 7.5 so that one can notice immediately if something breaks. This way one can keep track. But reality looks different. One gets code from someone else put in front of oneself and and the number of elements in a realistic P6G is orders of magnitude larger. One wants to change or expand it and then it explodes because one cannot recognise or comprehend what the parser is doing by merely looking. If you think this is not a likely scenario, then be reminded that this is an explicit feature in the design: the keyword `grammar` acts like the keyword `class`, that is to say it is anticipated that one derives from P6G by means of inheritance etc., and indeed one of the `S05` rules I mentioned earlier is subject how to deal with inherited methods. 2.9 One cannot parse `Blob`s. Laconically "NYI". 2.10 One cannot parse `Seq`s. Input must be first completely loaded into memory and ruins the laziness of one's program. 2.11 When push comes to shove, developers show the white feather. They are not confident in the software and do not use it themselves. Every interesting parsing problem in the interest sphere of Perl 6 uses ad-hoc parsers instead of P6G. The exemption is the module `URI`. 2.12 The situation is quite frustrating, especially owing to the silence on RT. It is not acceptable that interested programmers are losing time due to faulty software. Or imagine a newcomer watches Andrew's parsing video on Youtube and thinks, hey, cool programming language with an easy to use parser, let me try that; and then only after a few days it breaks. Now this is a shabby first impression one forgets never again. So it is overdue that someone sits down and writes a reasonable parser. Cannot be that difficult! ---- part 3 ====== ---- So I did. ---- * Table of tests is green. (→ comparisonz.html) * All parts of the grammar definition become automatically part of the return value. * Processes all grammar definitions without problem, left-recursive, right- recursive, does not matter. * Processes parsing of natural language without problem. * If there are several possible results, one gets all of them; not any one arbitrarily. * Any intuition about how an input is processed is adhered to. Stupid replace and rewrite matches exactly the internal processing. * There are no global variables and neither there are assignments to global variables. * The return value can be adjusted with a simple callback. `t/output-values.t` * Processes infinitely large inputs. `t/inf.t` * Processes binary input. `t/binary.t` * There is no subterfuge through backtracking. ---- * The software is not production ready, I invite you to help with it. There are still lots of things to do. The most difficult part is already done. * Source code is on Github. ---- __END__ Q&A === your questions and comments