---
title: The Essence of Compilation
published: 2014-05-22T08:30:00Z
tags: haskell, compilers, tutorial
description: A toy compiler and virtual-machine in less than fifty lines of haskell.
---

Recently I was extolling the virtues of Haskell to a colleague--an
important part of my day job--and the subject was compilers.  I was
essentially asserting that compilers and external DSLs
(domain-specific languages) are a killer application for Haskell.  To
demonstrate this, I put together a simple tutorial compiler and
virtual-machine, which I think catches the essence of compilation.  We
have only addition and constants in the input expression language
(Hutton's Razor [1]) and assume an inifinite supply of virtual-machine
registers (as does the SSA-form used by LLVM [2]). We therefore do not need
to concern ourselves with the details of register allocation.

To start with, we'll need a bunch of imports. Note that two non-platform
hackage packages are required: uu-parsinglib and wl-pprint.

> import Control.Applicative
> import Control.Monad.State
> import Control.Monad.Writer (WriterT, execWriterT, tell)
> import Data.Map (Map)
> import qualified Data.Map as M
> import Text.ParserCombinators.UU (pChainl)
> import Text.ParserCombinators.UU.Utils
> import Text.PrettyPrint.Leijen ((<+>), (<>), hsep, vsep, int
>                                , text, punctuate, comma)

The abstract syntax tree (AST) is a simple inductive type--the 'Add'
constructor takes two values of type 'Exp'.  The leaf nodes are
literal integers (constants).

> -- | abstract syntax tree
> data Exp = Add Exp Exp
>          | Lit Int
>          deriving (Show)

We can define the _denotational semantics_ of our language using the
following structurally recursive function:

> -- | denotational semantics
> eval :: Exp -> Int
> eval (Add x y) = eval x + eval y
> eval (Lit i)   = i

The parser for our language makes use of the uu-parsinglib
parser-combinator library, which allows us to compose smaller parsers
into a larger parsers using _combinators_ (functions that take parsers
and return parsers). The resulting code often mirrors the formal
grammar.

> -- | the parser
> parse :: String -> Exp
> parse = runParser "Expression" pExp
>   where
>     pExp  = foldr pChainl pAtom [Add <$ pSymbol "+"]
>     pAtom = Lit <$> pNatural
>         <|> pParens pExp

~~~
λ> parse "1 + (2 + 3) + 4"
Add (Add (Lit 1) (Add (Lit 2) (Lit 3))) (Lit 4)
λ> eval it
10
~~~

\
![an abstract syntax tree](/img/ast-small.png)\
\

Now that we have a parser and an AST, we can turn our attention to the
code-generation phase. First we need to define some types for the
instruction set we are targetting.

A program is an ordered list of instructions (statements):

> type Program = [Inst]

An instruction is generally an opcode mnemonic followed by one or
more operands (arguments). We only need one instruction to implement
addition. Note that the first operand is the destination register.

> data Inst = IAdd Reg Opd Opd
>             deriving (Show)
>

Operands can either be immediate values or registers:

> data Opd = Val Int -- ^ immediate value
>          | Reg Reg -- ^ register
>          deriving (Show)

Register names can simply be unique integers:

> type Reg = Int

For our code generation phase, we will define a computation that
maintains the next available register as state and writes out
instructions as a side-effect. This computation will have the
following type:

> type Gen = WriterT [Inst] (State Reg)

The code generation function 'gen' is a structurally-recursive
function on the AST. When given an expression, it returns a
computation that when run, emits an ordered list of instructions as a
side-effect and yields an operand result.  The operand result only
makes sense in the context of the previously emitted statements; and
will ultimately represent the value of the expression during
execution time.

> gen :: Exp -> Gen Opd
> gen (Add x y) = do
>   o1 <- gen x
>   o2 <- gen y
>   r  <- new
>   tell [IAdd r o1 o2]
>   return $ Reg r
> gen (Lit i) = return $ Val i

This part of the computation delivers fresh register names by modifying state:

> new :: Gen Reg
> new = modify succ >> get

Once we have recursed the AST and built our code-generation
computation, we will need to run it to get the program (as a
side-effect).

> runGen :: Gen a -> Program
> runGen = flip evalState 0 . execWriterT

Finally, we compose the phases together, to get a compiler:

> compile :: String -> Program
> compile = runGen . gen . parse

A pretty-printer helps visualise the output (see appendix). I've used
the % prefix to indicate a register.

~~~
λ> putStrLn $ ppProg $ compile "1 + (2 + 3) + 4"
IAdd %1, 2, 3
IAdd %2, 1, %1
IAdd %3, %2, 4
~~~

To better understand the _operational semantics_, let's define the
virtual-machine that executes our instruction set.  The virtual-machine
is a stateful computation; we can represent its state as a map of
register names to integer values.

> type VM = State (Map Reg Int)
>
> exec :: Inst -> VM ()
> exec (IAdd r o1 o2) = do
>   v1 <- load o1
>   v2 <- load o2
>   store r $ v1 + v2
>
> load :: Opd -> VM Int
> load (Val i) = return i
> load (Reg r) = gets $ M.findWithDefault 0 r
>
> store :: Reg -> Int -> VM ()
> store r v = modify (M.insert r v)
>
> run :: Program -> Map Reg Int
> run = flip execState M.empty . mapM_ exec

Now we can visualise all the register contents:

~~~
λ> run $ compile "1 + (2 + 3) + 4"
fromList [(1,5),(2,6),(3,10)]
~~~

\

Note: the literal haskell for this entire post can be found [here](https://raw.githubusercontent.com/willtim/timphilipwilliams.com/master/posts/2014-05-22-the-essence-of-compilation.lhs).

* * * * * * * *

References
----------

\[1\] G. Hutton, “Fold and unfold for program semantics,” ACM SIGPLAN Notices, vol. 34, no. 1, pp. 280–288, Jan. 1999.\
\[2\] [LLVM Language Reference](http://llvm.org/docs/LangRef.html)


Appendix
--------

A program pretty-printer.

> ppProg :: Program -> String
> ppProg = show . vsep . map ppInst
>   where
>     ppInst (IAdd r o1 o2) = text "IAdd" <+> ppOpds [Reg r, o1, o2]
>     ppOpds                = hsep . punctuate comma . map ppOpd
>     ppOpd (Val i)         = int i
>     ppOpd (Reg r)         = text "%" <> int r
>