---
title: 'l7:Parsing7:Bencodee'
author: Ivan Lazar Miljenovic
date: 22 July, 2015
...
About this talk
===============
## Feel free to yell out! {data-background="images/yell.jpg" data-background-color="white"}
###### [Vortrag] by Oliver Tacke / [CC BY]
## Who is this guy? {data-background="images/huh.jpg" data-background-color="lightblue"}
###### [huh?] by elio m. / [CC BY-NC]
Notes
: * Using Haskell since late 2007
* Maintainer of various libraries, primarily graph related
* Writing my own parser
* Name very Google-able
## What's going to be covered?
> * Write a simple parser
> * Define data types
> * Encounter the dreaded "M"-word
> * Survive doing so
## Want to skip ahead?
Slides
:
: (Hint: press `s` to see my secret notes!)
Source Code
:
Why Parser Combinators?
=======================
## Consider a common alternative...
Notes
: * LCA 2013, Programming Miniconf
* "Solving Interesting Problems by Writing Parsers" by Jacinta
Richardson
* Videos available
* Perl-focussed
* So it should be no surprise when she decided to use...
## {#regex-horror data-background="images/horror.jpg" data-background-color="lightblue"}
> **Every time you mention regular expressions, someone brings up
> Jamie Zawinski.**
. . .
> **Now you have two problems.**
###### [Shock Shock Horror Horror] by Jeremy Brooks / [CC BY-NC]
Notes
: * She even admits regexes are often the wrong solution.
* Haskellers like Erik de Castro Lopo were watching
## Why not regexes?
> * Stringly-typed
> * Not re-usable
> * Difficult to read
> * Re-interpreted by compiler
Notes
: * Embeddable regexes, quasiquotes, etc.
## Compare the pair
Regular Expressions
: ```haskell
"\.\([[:alpha:]][[:alnum:]-_:\.]*\)"
```
Parser Combinators
: ```haskell
identifierAttr = do
char '.'
first <- letter
rest <- many $ alphaNum
<|> oneOf "-_:."
return (first:rest)
```
Notes
: * Sample taken from Pandoc, identifier in attribute
* regex shorter
* Have I matched all the parens properly?
* combinator version could be shorter
* Which is more readable? combinable?
* regexes more convenient for custom munging
* I forgot the `char '.'`; which is it easier to spot in?
## What about Parser Generators?
Notes
: * *Requires* a grammar
* Combinators easier to extend
* Probably faster
* Can emulate in a parser combinator
* Not embeddable in code
* Usually needs an external tool
## Availability of parser combinators
* Most (all?) FP languages
* Javascript
* R
* Java
* C#
* etc.
. . .
> **Not just for data!**
Notes
: * Multiple implementations (parsec, attoparsec, polyparse,
trifecta, etc.)
* Rite of passage!
* If Java has it, *of course* C# has to have it to prove they're
better...
Bencode
=======
## What is Bencode?
> * Encoding used by BitTorrent for storing/transmitting data
> * Comprised of four different types of values:
. . .
Type Example
------------ --------------------------
Integers `i42e`
Strings `4:spam`
Lists `l4:spami42ee`
Dictionaries `d3:bar4:spam3:fooi42ee`
Notes
: * Suggested by Mats H
* Dictionaries require keys to be strings
## Simplifications
* Use normal strings, not byte strings
* No validation checks (e.g. allow `-0`)
* Ignore dictionary ordering
Notes
: * Strings should be a sequence of bytes
* Dictionaries should be lexicographically ordered by keys
## Datatypes are Awesome! {data-background="images/awesome.jpg" data-background-color="lightblue"}
. . .
~~~haskell
data Bencode = BInt Int
| BString String
| BList [Bencode]
| BDict [(String, Bencode)]
deriving (Show)
~~~
Notes
: Implemented in `Bencode.hs`
###### [Bboy and his friends] by crises_crs / [CC BY-NC]
Defining the Parser
===================
## What is a parser?
~~~haskell
type Parser a = String -> a
~~~
Notes
: * Most basic definition of what a parser is
* Is that it?
## Consider parsing an integer
> * Let's parse `i42e`
> * Imagine a function `next :: Parser Char`
Notes
: * I first need to get the first char to get `i`
* Now what?
* The input is all used up!
## Attempt 2
~~~haskell
type Parser a = String -> (a, String)
~~~
Notes
: * Returns un-consumed input
* What happens if it _isn't_ an integer?
* This says it always returns a value!
## Attempt 3
~~~haskell
-- The result of a parser: either an error or the expected result.
data Result a = Err String
| OK a
deriving (Show)
type Parser a = String -> (Result a, String)
~~~
Notes
: * Up to isomorphism of how result is returned
* Complete in terms of being able to work
* Can be bypassed
## Final definition
~~~haskell
newtype Parser a = P { runP :: String -> (Result a, String) }
~~~
. . .
~~~haskell
-- Shhhhhh!!!!!
newtype State s a = St { runState :: s -> (a, s) }
~~~
Notes
: * Don't export constructor
* `runP` runs the parser
* `newtype` is run-time isomorphic to original
* Implemented in `SimpleParser.hs`
* Specialised version of State Monad
Let's start parsing!
====================
## Create basic parsers
~~~haskell
-- Lift a value into a parser.
toParser :: a -> Parser a
toParser a = P $ \str -> (OK a, str)
-- Throw a parser error.
failParser :: String -> Parser a
failParser err = P $ \str -> (Err err, str)
~~~
## Get some data
~~~haskell
-- Obtain the next character in the input string.
next :: Parser Char
next = P $ \str -> case str of
c:str' -> (OK c, str')
_ -> (Err "empty", str)
~~~
## Matching a predicate
~~~haskell
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = P $ \str ->
case str of
c:str' | p c -> (OK c, str')
| otherwise -> (Err "not satisfied", str)
_ -> (Err "empty", str)
-- For example: parse the specified character.
char :: Char -> Parser Char
char c = satisfy (c==)
~~~
Notes
: * When parsing an integer, we want `isDigit` (see later)
* This looks an awful lot like `next`
* Idea: call next, get the result, then act on it.
## Matching again
~~~haskell
-- Take the result from one parser, and pass it as a parameter to a
-- function that returns a parser.
inject :: Parser a -> (a -> Parser b) -> Parser b
inject pa fpb = P $ \str -> case runP pa str of
(OK a, str') -> runP (fpb a) str'
(Err e, str') -> (Err e, str')
satisfy :: (Char -> Bool) -> Parser Char
satisfy p = next `inject` checkNext
where
checkNext c
| p c = toParser c
| otherwise = failParser "not satisfied"
~~~
Notes
: * Better!
* `inject` is a handy concept
## Getting multiple digits
~~~haskell
import Data.Char(isDigit)
atLeastOnce :: Parser a -> Parser [a]
atLeastOnce = -- to be implemented
parseIntDigits :: Parser String
parseIntDigits = atLeastOnce (satisfy isDigit)
~~~
Notes
: * Ignoring negative numbers for now
* Sample code has it implemented though
## But I want a number!
Need to convert a `String` to an `Int`...
. . .
~~~haskell
-- Apply a function on the result of a parser.
mapParser :: (a -> b) -> Parser a -> Parser b
mapParser f pa = P $ \str -> case runP pa str of
(OK a, str') -> (OK (f a), str')
(Err e, str') -> (Err e, str')
parseInt :: Parse Int
parseInt = mapParser read parseIntDigits
~~~
Notes
: * Talk about `read`
* Real code, might use something better
## That function looks familiar
~~~haskell
mapParser :: (a -> b) -> Parser a -> Parser b
map :: (a -> b) -> [a] -> [b]
-- Let's generalise
mapF :: (a -> b) -> f a -> f b
~~~
## That's Func-tastic!
~~~haskell
class Functor f where
fmap :: (a -> b) -> f a -> f b
instance Functor Parser where
fmap = mapParser
~~~
With the handy in-fix alias of `<$>`.
Notes
: * `<$>` vs `$`
* Functor laws
## Discarding
Imagine these combinators:
~~~haskell
discLeft :: Parser a -> Parser b -> Parser b
discRight :: Parser a -> Parser b -> Parser a
-- Run a parser on either side of the one specified.
bracket :: Parser bra -> Parser ket -> Parser a -> Parser a
bracket pb pk pa = pb `discLeft` (pa `discRight` pk)
parseBInt :: Parser Bencode
parseBInt = BInt <$> bracket (char 'i') (char 'e') parseInt
~~~
Notes
: * The discard functions can be obtained by lifting `const` and
`flip const` up into the `Parser`.
## Applying lifted functions
~~~haskell
applyFunc :: Parser (a -> b) -> Parser a -> Parser b
applyFunc pf pa = pf `inject`
(\f -> pa `inject`
(\a -> toParser (f a)))
discLeft :: Parser a -> Parser b -> Parser b
discLeft pa pb = (flip const <$> pa) `applyFunc` pb
discRight :: Parser a -> Parser b -> Parser a
discRight pa pb = (const <$> pa) `applyFunc` pb
~~~
Notes
: * This would look neater if I set up the proper fixities (less
parens)
* But this way it's more obvious
## Introducing Applicative!
~~~haskell
class Functor f => Applicative f where
-- Lift a value.
pure :: a -> f a
-- Sequential application.
(<*>) :: f (a -> b) -> f a -> f b
instance Applicative Parser where
pure = toParser
(<*>) = applyFunc
bracket :: Parser bra -> Parser ket -> Parser a -> Parser a
bracket pb pk pa = pb *> pa <* pk
~~~
Notes
: * `*>` and `<*` pre-defined, along with other combinators
* `parseBInt` doesn't change
## We still need to get the digits
> 1. Try our parser `p`.
> 2. If it fails, no values, so return an empty list `[]`.
> 3. Otherwise, recurse and put this value on the front of the list.
Notes
: * We want to get a list of values
* We can handle putting the value on front with Applicative
## Handling backtracking
> 1. `try`-based semantics
> 2. `commit`-based semantics
> 3. "Just do it already"-based semantics
Notes
: * Three differeint ways of handling backtracking
* Forget the `try`? No backtracking
* Forget the `commit`? Still works, just less performance and
worse errors (*"Partial Parsing: combining choice with
commitment"* by Malcolm Wallace)
* Doing it, as epitomised by `attoparsec`.
## Failing gracefully
~~~haskell
-- If the first parser fails, try the second.
onFail :: Parser a -> Parser a -> Parser a
onFail p1 p2 = P $ \str -> case runP p1 str of
(Err _, _) -> runP p2 str
ok -> ok
~~~
## Trust me, there's a class for it
~~~haskell
class Applicative f => Alternative f where
-- The identity of '<|>'
empty :: f a
-- An associative binary operation
(<|>) :: f a -> f a -> f a
instance Alternative Parser where
empty = failParser "empty value"
(<|>) = onFail
-- Try to parse one of the provided parsers in sequence.
oneOf :: [Parser a] -> Parser a
oneOf [] = fail "No parsers remaining"
oneOf (p:ps) = p <|> oneOf ps
~~~
Notes
: * This class comes from parser libraries
* Parsec still (?) uses it's own `<|>`.
* Comes with `many` and `some` (which I called `atLeastOnce`).
* `oneOf` can be implemented with foldr
## Finally, we have Integers!
~~~haskell
parseInt :: Parse Int
parseInt = read <$> some (satisfy isDigit)
parseBInt :: Parser Bencode
parseBInt = BInt <$> bracket (char 'i') (char 'e') parseInt
~~~
## Time for Strings
> 1. Parse a (non-negative) integer `n`
> 2. Parse a `:`
> 3. Return the next `n` characters
. . .
~~~haskell
parseString :: Parser String
parseString = parseInt
`inject`
(\n -> char ':' *> exactly n next)
-- Run the parser the specified number of times.
exactly :: Int -> Parser a -> Parser [a]
exactly n p
| n <= 0 = pure []
| otherwise = liftA2 (:) p (exactly (n-1) p)
~~~
## I wish there was a nicer syntax for inject...
. . .
> But there is!
## Warm Fuzzy Things
> * aka _Workflows_
> * ~~aka _Burritos_~~
> * aka "a monoid in the category of endofunctors"
> * aka _Monads_
Notes
: * Workflows are from F#
* SPJ: Our biggest mistake: Using the scary term "monad" rather than
"warm fuzzy thing" (Wearing the hair shirt: a retrospective on
Haskell (2003))
* "monoid" from "A Brief, Incomplete, and Mostly Wrong History of
Programming Languages" by James Iry (supposedly Philip Wadler)
* Burritos from: Abstraction, intuition, and the “monad tutorial
fallacy” by Brent Yorgey
## The "M" word
~~~haskell
class Applicative m => Monad m where
-- Usually called "bind"
(>>=) :: m a -> (a -> m b) -> m b
(>>) :: m a -> m b -> m b
return :: a -> m a
fail :: String -> m a
instance Monad Parser where
(>>=) = inject
(>>) = (*>)
return = pure
fail = failParser
~~~
Notes
: * `Applicative` constraint as of GHC 7.10 (AMP)
* `return` and `(*>)` have these by default.
* `fail` is considered a wart; used for pattern matching failures
* Replace `fail` with a new class (e.g. `MonadPlus`)?
* Parsers possibly only semi-valid use of `fail`.
## Strings again
~~~haskell
parseString :: Parser String
parseString = do n <- parseInt
char ':'
exactly n next
parseBString :: Parser Bencode
parseBString = BString <$> parseString
~~~
Notes
: * Sequential-style operations!
* "In short, Haskell is the world’s finest imperative programming
language." (SPJ, "Tackling the Awkward Squad")
* `a <- m; ...` equivalent to `m >>= (\a -> ...)`
* `m1; m2` equivalent to `m1 >> m2`
## Finishing it off
~~~haskell
parseBList :: Parser Bencode
parseBList = BList <$> bracket (char 'l') (char 'e')
(many parseBencode)
parseBDict :: Parser Bencode
parseBDict = BDict <$> bracket (char 'd') (char 'e') (many parseKV)
where
parseKV = liftA2 (,) parseString parseBencode
parseBencode :: Parser Bencode
parseBencode = oneOf [ parseBInt, parseBString
, parseBList, parseBDict
]
~~~
Notes
: * Mutual recursion is fun!
## Trying it out
~~~haskell
λ> runP parseBencode "l7:Parsing7:Bencodee"
(OK (BList [BString "Parsing",BString "Bencode"]),"")
~~~
Where to from here?
===================
## Improve current parser
* Negative numbers
* Ensure `-0` is illegal
* Strings can't have negative length
* Parse bytes, not characters
* Improve error messages
Notes
: * Source already supports -ve numbers
## Write a better Parser?
> **Datatypes are awesome...**
. . .
> **... but functions are even *more* awesome!**
Notes
: * Avoid context-switching on the `Result` type.
## Continuation Parsing Style
Some helper definitions first:
~~~haskell
type Failure r = String -> ErrMsg -> Result r
type Success a r = String -> a -> Result r
data Result a = Err String ErrMsg
| OK String a
deriving (Eq, Ord, Show, Read)
-- To avoid mixing our types up.
type ErrMsg = String
~~~
Notes
: * Note different definition of `Result`.
* Based loosely upon `attoparsec`
## CPS Parser
~~~haskell
{-# LANGUAGE RankNTypes #-}
newtype Parser a = P {
runP :: forall r. String
-> Failure r
-> Success a r
-> Result r
}
~~~
Notes
: Implemented in `CPSParser.hs`
## Running a CPS Parser
~~~haskell
failK :: Failure a
failK str msg = Err str msg
successK :: Success a a
successK str a = OK str a
runParser :: Parser a -> String -> Result a
runParser p str = runP p str failK successK
~~~
## Exercise
> **Complete the CPS parser definition!**
. . .
> **Once you have the instances and `next` defined, the rest should be
> (almost) identical.**
Notes
: Differences are based upon whether you used class methods or the
functions that implemented them.
## So long and thanks for all the fish! {data-background="images/fish.jpg" data-background-color="lightblue"}
###### [leaping_dolphins] by Zest-pk / [CC BY]
---
# Various links
...
[CC BY]: https://creativecommons.org/licenses/by/2.0/
[CC BY-NC]: https://creativecommons.org/licenses/by-nc/2.0/
[Vortrag]: https://www.flickr.com/photos/otacke/12635014673
[huh?]: https://www.flickr.com/photos/jumpn_around/2239989214/
[Shock Shock Horror Horror]: https://www.flickr.com/photos/jeremybrooks/2199355153/
[Bboy and his friends]: https://www.flickr.com/photos/33774513@N08/3153440876/
[leaping_dolphins]: https://www.flickr.com/photos/zest-pk/923931403
---
# reveal.js settings
theme: night
transition: concave
backgroundTransition: zoom
center: true
history: true
css: custom.css
...