/* -*- mode: asm; fill-column: 88; tab-width: 8; -*- */ /* Get Started with Arpilisp Side note: if this file looks funny, use a text editor that displays a fixed-width font and uses 8 spaces for tabs. Arpilisp is the Assembled Raspberry Pi Lisp. This tutorial uses assembly language, the lowest level programming language, to implement an interpreter for Lisp, the highest level language. Arpilisp implements Lisp from scratch, with no help from any libraries and minimal help from the kernel. Before you use arpilisp, you need to build it. To build arpilisp, you need a few things: * A Raspberry Pi running Raspbian * That's it. To build arpilisp, type the following in a shell: gcc -nostdlib -o arpilisp arpilisp.s This tells the GNU compiler (gcc) to use arpilisp.s, this file that you are reading, as input, to build an executable file named arpilisp. Later we cover what the -nostdlib option does. To use arpilisp itself, type this: ./arpilisp To quit arpilisp, type Ctrl-D. References https://www.raspberrypi.org https://www.raspbian.org ________________________________________________________________________________ Acknowledgements Thanks to Richard W.M. Jones, John McCarthy, and Jack W. Crenshaw for showing us the simplicity and elegance hiding in complex things, and for reminding us that computers are awesome. Thanks also to Chris Hinsley and Jay Sissom for feedback. ________________________________________________________________________________ The Shortest Introduction to Lisp Lisp is one of the oldest programming languages in computing. John McCarthy and his team implemented the first Lisp interpreter in 1960. They wanted a programming language for AI research. Since then, he and many others discovered amazing things about how Lisp can make computers do surprising, interesting things. Most people acknowledge Lisp's influence without knowing what it is exactly. Others are turned off by Lisp's weirdness. Lisp continues to influence programming languages today. It's not as inaccessible, alien, or irrelevant as you might believe. Lisp is an acronym for LISt Processor. A list is a sequence of items. To specify a list, we start with an opening parenthesis, continue with the things in the list, and end with a closing parenthesis. For example, here's a list of ingredients for a salad: (lettuce tomato oil vinegar) We can use extra spaces to clarify what we type but Lisp doesn't care about excessive white space between list items, as long as there is some. And there's no need to put white space around parentheses. Here are some lists that look different to us but mean the same to Lisp: (lettuce tomato oil vinegar ) (lettuce tomato oil vinegar) A list may itself also contain lists. Here's a list that is different from our previous list: ((lettuce tomato) (oil vinegar)) While it contains the same items, they are arranged into two sub-lists, the first for the vegetables (lettuce tomato), the second for the dressing (oil vinegar). Here's an empty list: () The empty list comes up so much that Lisp has a symbol for it: nil A symbol is just a name that represents something. What the symbol represents depends on you, the programmer. In our list above, lettuce, tomato, oil, and vinegar are symbols for salad ingredients. We sometimes refer to symbols as atoms. Unlike lists, atoms are not composed of parts. In Lisp, lists and atoms are called "S-expressions". We also use just "expression" to mean the same thing. The Processor part of LISt Processor means that we program a Lisp interpreter to manipulate lists. We manipulate lists with functions that accept S-expression arguments. Here's a cool thing about Lisp: functions are also written as S-expressions. In other words, data and programs have the same form. Side note: A lot of people make a big deal about the interchangeability of Lisp data and programs. That's certainly remarkable and accounts for a lot of Lisp's elegance and expressiveness. The irony is that after getting used to this idea, you find it odd that other programming languages don't have this feature. To see how functions work, let's take a closer look at nil and the empty list. To verify that they are the same, we enter this expression in Lisp: (eq nil () ) The first item in the list of our expression is eq, which is a function. A function tells Lisp what needs to be done: we want to determine equality. The remaining items in the list, nil and (), specify the data that we want the eq function to process. Lisp responds with: t The t symobl means that the result of this expression is true; nil and () are indeed equal. Using the t symbol to represent "true" is a Lisp convention, while nil is the conventional value for false. In Lisp, we say "function application" to mean applying a function to arguments. In our example, we apply eq to two arguments. The function compares them for equality, then returns a "t" for true or "nil" for false. Before Lisp applies a function to its arguments, it first evaluates each argument. For example, let's take a look at a more complicated expression: (eq (eq nil ()) (eq () nil)) In this expression, we give eq these arguments: (eq nil ()) and (eq () nil) We can't determine how to compute the value of the first eq application until we know the values of its arguments. Behind the scenes, that's what Lisp does. If we could watch the internal computation of our S-expression, we would see Lisp first compute then substitute the values of the arguments. This substitution renders our expression from this: (eq (eq nil ()) (eq () nil)) to this: (eq t (eq () nil)) then this: (eq t t) and finally: t Lisp evaluates expressions recursively by computing the values of the deepest argument expressions before computing arguments that are higher up. Lisp has some useful built-in functions. An important function is quote. It takes a single argument and returns it unevaluated. For example, when we type our salad list into Lisp: (lettuce tomato oil vinegar) Lisp returns an error about "lettuce" not being a function. To make it clear that our salad list is a list and not an applicaton of the lettuce function, we enter: (quote (lettuce tomato oil vinegar)) Lisp responds with: (lettuce tomato oil vinegar) Lisp wouldn't be useful if we couldn't create our own functions. To describe a function, we use a lambda expression. A lambda expression takes a list of parameters then a sequence of S-expressions to evaluate. Here's an example of a lambda function that accepts a single parameter to compute its equality with nil: (lambda (x) (eq nil x)) Each parameter in the parameter list is a symbol that represents an argument at application time. The sequence of S-expressions in the lambda may refer to these parameters. In fact, it's good practice to make sure that the S-expressions in a lambda refer only to the lambda parameters. Lisp treats the last S-expression in the lambda specially. Its value is the value that the lambda returns when Lisp applies it. In our lambda above, (x) is the list of parameters. In this case, we have a single parameter, x. The S-expression (eq nil x) is the only expression in our lambda. It's also the last expression, so the lambda returns the value of this expression when we apply the lambda. When you enter a lambda by itself: (lambda (x) (eq nil x)) Lisp returns the lambda, unapplied: (lambda (x) (eq nil x)) If we want to apply our lambda to an argument, we need to use the same form as a function application: (function argument ...) We just need to replace function with a lambda expression. For example, if we enter: ((lambda (x) (eq nil x)) (quote (lettuce tomato oil vinegar))) Lisp applies our lambda like a regular function application by following these steps: 1. Evaluate the expressions of the arguments. 2. Bind each evaluated argument, in the order it appears in the application, to each of the parameters in the order that they appear in lambda parameter list. 3. Evaluate each expression in the lambda. When an expression refers to a parameter, Lisp evaluates it by substituting its bound value from step 2. 4. Return the value of the last expression in the lambda and unbind its parameters. In our example above, in step 1, Lisp first evaluates the lone argument: (quote (lettuce tomato oil vinegar)) which gives: (lettuce tomato oil vinegar) In step 2, Lisp binds this evaluated argument to the parameter, x. For step 3, Lisp evaluates the lambda body, replacing occurrences of x with the value it is bound to. This: (eq nil x) becomes: (eq nil (lettuce tomato oil vinegar)) Our argument, (lettuce tomato oil vinegar), is not nil, so eq returns nil. Since this is the only and last expression in the lambda, the lambda application returns nil. Returning nil for a non-nil argument is ironic until you remember that nil means false in this case. The parameter bindings in a lambda application last only during the application of the lambda. In step 4, Lisp unbinds the lambda's parameters. The x argument has no value. So entering this expression after applying our lambda: x gives an error about an unbound variable. Outside of a lambda, we can bind a symbol to a value so that when you enter the symbol, Lisp returns the value. For example this expression binds "name" to Valerie: (define name (quote Valerie)) So entering this expression: name evaluates to what you would expect: Valerie You can also change an existing binding: (define name (quote Isabelle)) Bindings in a lambda temporarily override outside bindings. For example: (define name (quote James)) ((lambda (name) name) (quote Rose)) returns: Rose Inside the lambda application, the name symbol is bound to Rose. When the lambda application returns, the previous binding to name is restored, so that entering this expression: name returns this: James Of course, you can also bind a lambda to a symbol, which makes the lambda easier to use if you intend to refer to it frequently. For example, comparison with nil is something we see enough that we define a handy function for it: (define null (lambda (x) (eq x nil))) Now comparison to nil is more convenient: (null (quote Gus)) returns: nil Notice that the define function plays by different rules than a normal function. Instead of evaluating its first argument, define takes it literally, as if it were quoted. Therefore by definition (ahem), define isn't a true function. In Lisp, we call this a "special form". If you paid enough attention earlier, you noticed that quote and lambda are also special forms. Lambda doesn't evaluate its list of parameters. And Lisp delays the evaluation of a lambda's expressions until it applies the lambda. Another special form is cond, which is short for "conditional". It takes a list of clauses. (cond clause1 clause2... clauseN) Each clause is a list of 2 expressions: (test result) If the test expression is true, then cond returns the value of the corresponding result expression and stops processing subsequent clauses. If the test is false, cond proceeds to the next clause. It continues doing so until it finds a test that returns true or there are no more expressions. For example, to evaluate this expression: (cond ((eq name (quote Valerie)) (quote funny)) ((eq name (quote James)) (quote silly)) (t (quote goodbye))) Lisp starts with the first clause: ((eq name (quote Valerie)) (quote funny)) which has this test expression: (eq name (quote Valerie)) Given our most recent binding for name, this test evaluates to false. Lisp skips to the next clause: ((eq name (quote James)) (quote silly)) The test expression evaluates to true. So Lisp evaluates the second expression in this clause, which returns: silly which becomes the value of our cond expression. Lisp ignores subsequent clauses, so the clause: (t (quote goodbye)) doesn't get evaluated. As a matter of good habit, we always put a final clause in a cond that has t for a condition clause's test. That way we assure ourselves that a cond test returns a value that we specify when all other clause tests are false. Lisp isn't all special forms. In fact, there are only a handful. Most Lisp built-in functions evaluate arguments normally. For example, the most frequently-used Lisp functions are car and cdr. For now, we partly describe what they do, with full details of their awesomeness later. The car function returns the first item in a list and cdr returns a list without its first item. Helpful synonyms for these functions are "first" and "rest", respectively. For example, let's define our salad list: (define salad (quote (lettuce tomato oil vinegar))) Entering this: (car salad) returns this: lettuce And entering (cdr salad) returns (tomato oil vinegar) To pick individual items in a list, we combine car and cdr. Entering: (car (cdr salad)) gives us: tomato This expression: (car (cdr (cdr salad))) gives us: oil Note that car and cdr do not modify the list that is bound to salad. Car doesn't reduce a list to its first item, it only tells you what that first item is. Likewise, cdr doesn't remove the first item from a list, it only tells you what a list is without its first item. So after applying car and cdr to salad, salad still has its most recent binding. Entering: salad still gives us: (lettuce tomato oil vinegar) Functions that don't modify bindings other than their own are called "pure functions", which is another way of saying that you, the Lisp user, don't have to worry about unintended consequences when applying a function. Function purity is an important and useful quality in Lisp programming. So that's most of the basics of Lisp that we implement in arpilisp. We'll cover the rest as we go. References McCarthy, "Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I". MIT. 1960. McCarthy, et al. "Lisp I Programmer's Manual." MIT. 1960. ________________________________________________________________________________ The Shortest Introduction to ARM Assembly Language Today's computers have ample capacity to let us pile on layers of operating systems, shells, libraries, frameworks, sandboxes, and containers. So much that we've successfully buried assembly language completely out of mind. But the machine is still there, and not as inaccessible, alien, or irrelevant as you might believe. The Raspberry Pi uses an ARM processor. The ARM registers that we use in arpilisp are 32 bits wide. Registers r0 to r12 are general-purpose and registers r13 to r15 are single-purpose, at least for our purposes. Register r13 is the stack pointer (sp), r14 is the link register (lr), and r15 the program counter (pc). An extra register, the processor status register (apsr), offers a few single-bit condition flags that optionally record effects of previous instructions. To manipulate data, like arithmetic, comparison, bit fiddling, and so on, ARM requires that we use registers. ARM has no instructions to manipulate data in memory. Before we operate on something, we first need to load it from memory to a register. In fact, we can't even directly load from or store to memory. We need to load an address into a register then use that register to specify the location of a load (ldr) or store (str) operation. Side note: Yes, loading an address before loading the contents of the address seems like a chicken and egg situation. It's not, of course. The assembler and the processor play some convenient tricks to solve this problem for us. In GNU assembler, constants begin with a pound sign (#). Addresses start with an equal sign (=). Labels end with a colon (:). Labels must be unique with one exception: numeric labels. An assembly file can repeat numeric labels, which makes them handy for throw-away labels in an assembly procedure. An instruction refers to a numerical label by post-fixing an "f" or "b", which refers to the matching numerical label "forward" or "before" the instruction. ARM, the company, and others specify calling conventions and ABIs, but we'll ignore them to keep things simple. Arpilisp is for learning Lisp and assembler, not for conforming. References https://en.wikipedia.org/wiki/IBM_704 ARM Limited. "ARMv6-M Architecture Reference Manual." 2010. ________________________________________________________________________________ Targeting the Processor Let's get started. We begin at the foundation then go on until we get to the end: a working Lisp interpreter. We target the ARMv6 instruction set, which works with all variants of the Raspberry Pi up to the Pi 3. */ .arch armv6 /* ________________________________________________________________________________ Procedures in Assembly These assembler macros tell the assembler, the debugger, and us where a procedure starts and ends. They also make sure that the assembler aligns instructions properly and places constant values that our procedure refers to in a place that the procedure can reach. Reference https://community.arm.com/docs/DOC-9652 */ .macro PROC name .text .balign 4 .func \name, \name \name\(): .endm .macro ENDPROC .pool .endfunc .endm /* ________________________________________________________________________________ Errors We need a way to flag an error. We'll use the overflow (V) flag in the apsr. This gives us a simple mechanism to set and detect error conditions. Reacting to an error is as simple as appending the "vs" or "vc" condition code to affected instructions. Our error-flagging method is blunt--it clears all condition flags along with setting or clearing V. As long as we keep that in mind, we won't get stung. */ .macro ERRSET errsym msr apsr_nzcvq, #1<<28 ldr r7, =\errsym .endm .macro ERRCLR msr apsr_nzcvq, #0 .endm /* ________________________________________________________________________________ Returning boolean results We use the Z flag in the status register to handle true and false results. When we set the Z flag (and bluntly clear other conditions), we can test for it with the EQ condition flag. Otherwise, we clear the status registers to use the NE condition. */ .macro ZSET msr apsr_nzcvq, #1<<30 .endm .macro ZCLR msr apsr_nzcvq, #0 .endm /* ________________________________________________________________________________ Lisp stack We need a safe place to store intermediate Lisp values. We use a stack for this. We cover the idea of "safe" later on. We reserve some memory for the Lisp stack and r6 for the Lisp stack pointer. This stack pointer is separate from processor's stack pointer, aka sp, aka r13. Side note: The Linux kernel conveniently sets up sp for us and reserves some memory for the process stack when it launches arpilisp. */ .equiv LISPSTACKMAX, 1000 /* Number of pointers to Lisp objects. */ .bss .balign 4 lispstackbottom: .space LISPSTACKMAX * 4 .equiv lispstacktop, . /* Push the Lisp value in r0 onto the Lisp stack. Signal an error if the stack is full. Generating an error instead of panicking is fine because it lets our interpreter gracefully recover from exhausted Lisp stack space. */ PROC pushlisp push { r1 } ldr r1, =lispstackbottom cmp r6, r1 bne 10f ERRSET errstackfull b 999f 10: stmfd r6!, { r0 } ERRCLR 999: pop { r1 } mov pc, lr ENDPROC /* Pop from the Lisp stack and store it in r0. Modifies the condition flags. Panic if we try to pop from an empty stack. We panic instead of generating an error because a pop requires a previous push. In other words, a mismatched pop implies a bug in the interpreter's logic, not the Lisp program that it is interpreting. */ PROC poplisp ldr r0, =lispstacktop cmp r6, r0 bne 10f ldr r0, =panicstackempty b panic 10: ldmfd r6!, { r0 } mov pc, lr ENDPROC /* ________________________________________________________________________________ The mighty cons Lisp lists are implemented as cons cells. The cons cell is the most elegant and expressive data type ever discovered in computing. If symbols are the atoms in the Lisp universe, then cons cells organize the universe into molecules, galaxies, and everything in between. A cons cell is a pair of pointers. Incidentally, "pair" is often used as a synonym for a cons cell. Remember car and cdr, the functions that return different parts of a list? It turns out that these weird function names are based on the traditional names for the pointers in the cons cell. cons cell +-----+-----+ | car | cdr | +-----+-----+ It also turns out that stating that the car and cdr functions operate on lists is true only at a superficial level. What these functions really do is operate on cons cells. This makes implementing the car and cdr functions dead easy: return the pointer to the object stored in the car or cdr part of the cell, respectively. In a cell, the car and cdr pointers may each point to another cell, a symbol, or nil. For example, here is a cell where the car points to a symbol and the cdr points to nil. +-----+-----+ | car | cdr ---> nil +--|--+-----+ | v symbol For simplicity of implementation, a practical value for nil is 0. We store nothing useful at memory location 0. Side note: In fact, a Linux user program cannot store or load location 0. The ARM Linux kernel doesn't allow it by design. Doing so results in a segmentation fault, which is an intentional thing. Reference https://www.kernel.org/doc/Documentation/arm/memory.txt */ .equiv NIL, 0 /* We keep all our cells in a pool. A cell occupies 2 words = 2 * 4 bytes = 8 bytes. */ .bss .balign 4 .equiv CELLMAX, 10000 cells: .space CELLMAX * 2 * 4 .equiv cellsend, . /* A Lisp program can ignore its obsolete data by simply abandoning it. From a Lisp programmer's point of view, memory is infinite. In fact, Lisp provides no explicit way for a program to deallocate memory. Side note: I know that you know that I know that limitless memory is impossible. And it's easy to write a Lisp program that quickly exhausts memory. I'm just saying that limitless memory is part of the Lisp programming model. In assembly language, memory is definitely limited. We have to meticulously and precisely manage memory usage. When we no longer need it, we have to carefully track it to reuse it later. So it's up to our interpreter to implement automatic memory management. When a Lisp program runs out of memory, the interpreter must look for abandoned memory to reuse. For this, McCarthy's team invented the term "garbage collection". The name stuck. The first Lisp used mark-sweep garbage collection. So do we because it is simple to describe and implement. Also, we limit ourselves to collect abandoned cons cells. We won't bother with abandoned symbols. There are two ingredients to marking and sweeping: a free list and a root set. First ingredient: the free list. The free list points to abandoned cells that are ready to use. When a Lisp program needs a new cons cell, Lisp removes it from the free list and gives it to the program. */ freelist: .word NIL /* When we start the interpreter, we have to build the free list. We just iterate through the cell pool, pointing each cons cell's cdr to the next cons cell. At the last cons cell, we point to nil. When it's initialized, our free list initially looks like this: +-----+-----+ +-----+-----+ +-----------+ freelist --->| car | cdr --->| car | cdr ---> ... --->| car | cdr --->nil +-----+-----+ +-----+-----+ +-----+-----+ */ PROC initfreelist push { r0-r2, lr } ldr r0, =cells ldr r1, =freelist /* Free list starts at the beginning */ str r0, [ r1 ] /* of the cell pool. */ ldr r2, =cellsend sub r2, r2, #8 /* Stop before the last cell. */ 1: mov r1, r0 add r0, r0, #8 /* The next cell. */ str r0, [ r1, #4 ] cmp r0, r2 bne 1b pop { r0-r2, pc } ENDPROC /* Second ingredient in garbage collection: the root set. The root set is the starting point for marking. Lisp follows the cells pointed to by the cells in the root set, travelling through all referred cells. In other words, all cells that are reachable from the root set, directly or indirectly, are safe from garbage collection. Cells that are not reachable are added to the freelist. The garbage collector starts by marking all used cells. This is the mark in mark-and-sweep. During the mark phase, Lisp could encounter a cell that it has already marked. In this case, Lisp stops chasing further because we know that subsequent, reachable cells from a marked cell are already marked. To mark a cons cell, we must make sure not to corrupt the original data in the cell. We do this by taking advantage of a quirk in ARM's data alignment preferences. Genuine ARM pointers are aligned to the nearest word (4 bytes), so a pointer's 2 least significant bits are always 0. ARM pointer +--------+--------+-- --+-------+-------+ | 0 or 1 | 0 or 1 | ... | 0 | 0 | +--------+--------+-- --+-------+-------+ bit 31 bit 30 bit 1 bit 0 We take advantage of these 2 bits to encode information about a Lisp object in the actual pointer to the object. To specify that a cell is in use during the mark phase, we mark its car by settings bit 1 to 1. We have another use for bit 0, covered later. We don't need to worry about decoding a marked cell to refer to its pointer for a few reasons: our Lisp program is suspended during mark and sweep, we don't follow reachable cells from a marked cell, and we only use the mark bit during the marking phase of garbage collection. */ .equiv MARKMASK, 0b10 PROC collectgarbage push { r0, r6, lr } /* Mark the root set. */ mov r0, r9 /* The environment. */ bl mark mov r0, r8 /* The expression being evaluated. */ bl mark mov r0, r7 /* The value. */ bl mark mov r0, r5 /* The list of lambda argument values. */ bl mark ldr r0, =freelist ldr r0, [r0] bl mark /* Mark the Lisp stack. */ 10: ldr r0, =lispstacktop cmp r6, r0 beq 20f ldmfd r6!, { r0 } bl mark b 10b 20: bl sweep pop { r0, r6, pc } ENDPROC /* Given a pointer in r0, mark it if applicable. */ PROC mark push { r0-r3, lr } /* Does r0 point to a cell? */ 1: cmp r0, #NIL beq 999f tst r0, #SYMMASK bne 999f /* Is the car already marked? */ ldr r1, [ r0 ] tst r1, #MARKMASK bne 999f /* Mark the car. */ mov r2, r1 /* Mark with r2, save r1. */ orr r2, r2, #MARKMASK str r2, [r0] /* Store the mark in car. */ /* Follow the data pointed to by the car and cdr. */ mov r3, r0 /* Save original pointer in r3. */ mov r0, r1 /* Chase after the car. */ bl mark ldr r0, [ r3, #4 ] /* Chase after the cdr. */ b 1b 999: pop { r0-r3, pc } ENDPROC /* With marking completed, the interpreter starts sweeping. It iterates through the cell pool, unmarking active cells and inserting inactive cells into the free list. */ PROC sweep push { r0-r1, lr } ldr r1, =cells 1: ldr r0, [ r1 ] /* Is the car marked? */ tst r0, #MARKMASK beq 2f /* Cell is active, unmark it. */ bic r0, #MARKMASK str r0, [ r1 ] b 3f /* Cell is inactive, add it to freelist. */ 2: mov r0, #0 /* Clear the car. */ str r0, [ r1 ] ldr r0, =freelist /* Store freelist head in the cdr. */ ldr r0, [ r0 ] str r0, [ r1, #4 ] ldr r0, =freelist /* Point freelist to the new head. */ str r1, [ r0 ] /* Next cell in the cell pool. */ 3: ldr r0, =cellsend add r1, r1, #8 cmp r1, r0 bne 1b pop { r0-r1, pc } ENDPROC /* Lisp provides the cons function to allocate new cells. A Lisp program uses cons to construct its data. This is where the free list comes in. +-----+-----+ +-----+-----+ +-----------+ freelist --->| car | cdr --->| car | cdr ---> ... --->| car | cdr --->nil +-----+-----+ +-----+-----+ +-----+-----+ cell 1 cell 2 cell N Allocating a cons cell from the freelist is simple. We just "pop" it from the freelist: allocated cell | V +-----+-----+ +-----+-----+ +-----------+ freelist -+ | car | cdr | +->| car | cdr ---> ... --->| car | cdr --->nil | +-----+-----+ | +-----+-----+ +-----+-----+ | cell 1 | cell 2 cell N | | +-----------------+ Given pointers to Lisp objects in r1 and r2, return an allocated cell in r0 such that its car contains r1 and cdr contains r2. Otherwise, panic if there is no memory or issue an error if the Lisp stack is full. */ PROC cons push { r3-r4, lr } ldr r3, =freelist /* Is the free list empty? */ ldr r0, [ r3 ] cmp r0, #NIL bne 10f mov r0, r1 bl pushlisp bvs 999f mov r0, r2 bl pushlisp bvs 999f bl collectgarbage bl poplisp mov r2, r0 bl poplisp mov r1, r0 ldr r0, [ r3 ] /* Is the free list still empty? */ cmp r0, #NIL bne 10f ldr r0, =panicmemfull b panic 10: ldr r4, [ r0, #4 ] /* Advance freelist. */ str r4, [ r3 ] str r1, [ r0 ] /* Store the car and cdr. */ str r2, [ r0, #4 ] ERRCLR 999: pop { r3-r4, pc } ENDPROC /* ________________________________________________________________________________ Symbols In arpilisp, our simplest object, after the empty list, is the symbol. A symbol is just a reference to the character string of itself. Side note: Most other Lisps allow the programmer to attach properties to a symbol. For example, a symbol has a property called the print name, which is the textual representation of the symbol. To print a symbol, the Lisp interpreter retrieves the print name property and outputs it. The property mechanism is overkill for arpilisp. Instead, we use a symbol as a reference to its print name. The format we use for print names is the length of its print name encoded as a word-aligned word, followed by the character string of the print name itself. The next print name follows after that, aligned to a word boundary, and so on. symbol 1 symbol 2 +-----------+------------+--------------+---- - - - | length | characters | padding | length +- 4 bytes -+- n bytes --+- < 4 bytes -+---- - - - We store symbol print names in a block of memory. The traditional name for this block is "obarray." */ .equiv OBARRAYMAX, 2000 /* Multiple of 4, < 2^30. */ .data .balign 4 obarray: /* Keep track of the last unused position in obarray. */ .balign 4 .set OBARRAYEND, . /* We have 2 types of object: symbols and cells. We need a way to determine if a pointer refers to an object that is appropriate for the current computation. For example, if a Lisp function expects a symbol as an argument, we want to ensure that the argument is indeed a symbol and not a cell or nil. We mentioned earlier that the least 2 significant bits in an ARM pointer are 0. We already use bit 1 as the mark bit for garbage collection. That leaves us with the least significant bit, bit 0, to specify the type of object that a pointer refers to. When bit 0 is 1, the pointer refers to a symbol, not a cell and not nil either. */ .equiv SYMMASK, 1 /* A reference to a symbol is not really a pointer. To decode a reference to a symbol into a usable pointer, we mask out the flag and add the address of obarray. Why not use a real pointer instead of addding the address of obarray? It turns out that setting up the initial symbols we need for arpilisp poses a challenge. Ideally, we would like to pre-fill obarray at assembly time instead of runtime. But the assembler won't let us perform assembler-time arithmetic on labels because the ELF binary format and the Linux kernel conspire to force the calculations of final addresses only at runtime. We could hard-code the addresses of Lisp symbols at assembly time, but we'd have to either make arpilisp more complicated or depend on some uncomfortable assumptions about the memory map of a Linux process. However, the assembler happily lets us compute offsets from assembly labels. Our solution, then, is to treat a Lisp symbol pointer not as a pointer but as an offset into obarray. The LISPSYM macro appends a symbol entry to obarray. We use it to pre-define our initial set of symbols. We may only apply this macro immediately after the definition of obarray. So pre-define all your symbols here. */ .macro LISPSYM sym, printname .data .balign 4 .equiv \sym , (. - obarray) | SYMMASK .word \sym\()_STREND - \sym\()_STRSTART .balign 4 .equiv \sym\()_STRSTART, . .ascii "\printname" .equiv \sym\()_STREND, . .balign 4 .set OBARRAYEND, . .endm LISPSYM symnil, "nil" LISPSYM symt, "t" LISPSYM symquote, "quote" LISPSYM symatom, "atom" LISPSYM symeq, "eq" LISPSYM symcar, "car" LISPSYM symcdr, "cdr" LISPSYM symcons, "cons" LISPSYM symcond, "cond" LISPSYM symdefine, "define" LISPSYM symlambda, "lambda" LISPSYM errdot, "Error: unexpected dot " LISPSYM errparenmissing, "Error: expected closing parenthesis " LISPSYM errparen, "Error: unexpected closing parenthesis " LISPSYM errbadlistexpr, "Error: unknown expression " LISPSYM errcellornil, "Error: expected a cell or nil " LISPSYM errbindsym, "Error: expected a symbol to bind " LISPSYM errbadsym, "Error: expected a valid symbol " LISPSYM errunboundvar, "Error: unbound variable " LISPSYM errargextra, "Error: too many arguments " LISPSYM errargmissing, "Error: missing arguments " LISPSYM errstackfull, "Error: stack full " LISPSYM panicstackempty, "Panic: stack empty" LISPSYM panicobarrayfull, "Panic: obarray full" LISPSYM panicmemfull, "Panic: memory full" /* Fill the rest of the obarray space. */ .space OBARRAYMAX - (. - obarray) .set OBARRAYCAP, . /* Remember the last unused position in obarray. */ obarrayend: .word OBARRAYEND /* A useful operation for symbols is to compare them for equality. Given two symbols, are they the same? One way to do this is to compare their print names, character by character. A much simpler and faster way is to compare the symbols' pseudo-pointers. Remember the eq function earlier? Implementing this function in assembler is dead easy: just compare two pointers. For cells and nil, this work well. We would like to have this same convenience and efficiency for symbol comparison. But there's a down side: for this simple symbol comparison, we need to make sure that symbols are unique, which means making sure that we only record a single copy of a symbol. Side note: The up side of storing a single copy of a print name is that we save memory, which is, incidentally, why we don't bother with garbage collection of symbols. In Lisp, we say that this storage of only one instance of a symbol is "interning" a symbol. To create a new symbol, either return a reference to the single instance of a symbol in obarray or, if the symbol has not yet been interned, create the symbol in obarray and return the reference. The buffer for a symbol to intern is internbuffer. We use it as a temporary place to store a symbol from input. We structure it like a symbol in obarray, which means a word-aligned word-sized length value immediately followed by a character string. */ .equiv INTERNMAX, 32 /* A multiple of 4 */ .bss .balign 4 internbufferlen: .word 0 .balign 4 internbuffer: .space INTERNMAX /* Return the pseudo-pointer to the symbol in obarray that matches the string in internbuffer. If the symbol does not exist, create it in obarray and return the symbol's pseudo-pointer. */ PROC intern push { r0-r5, lr } /* Search for the symbol in obarray. */ ldr r0, =obarray /* Start of obarray. */ 1: ldr r1, =obarrayend /* At the end of obarray? */ ldr r1, [ r1 ] cmp r0, r1 beq 5f mov r1, r0 ldr r2, =internbufferlen /* Compare print name lengths. */ ldr r3, [ r1 ], #4 /* Length of print name in obarray. */ ldr r4, [ r2 ], #4 /* Length of print name to intern. */ cmp r3, r4 bne 4f /* Compare print names. */ 2: cmp r3, #0 /* Have we compared all characters? */ beq 8f /* Found. Symbol already interned. */ 3: ldrb r4, [ r1 ], #1 ldrb r5, [ r2 ], #1 cmp r4, r5 bne 4f sub r3, r3, #1 b 2b /* Advance to the next print name in obarray. */ 4: ldr r1, [ r0 ] /* Length of the current string. */ add r1, r1, #4 /* Advance to the first character. */ add r1, r1, r0 /* Add string's address. */ /* At this point, r1 is at the end of the symbol object that r0 points to. The next symbol object is at the next 4-byte boundary. We need to round up to this boundary by calculating how much to add to r1. The bad news is that ARM doesn't have an integer modulo instruction. The good news is that 4 is a power of 2, so bit manipulation is the obvious choice and ARM has plenty of useful bit-manipulation instructions. To calculate an offset to add to r1, we want this mapping of the least 2 bits: 0b00 -> 0b00 0b01 -> 0b11 0b10 -> 0b10 0b11 -> 0b01 Note that for 0b00, we're already there so there's nothing to add to r1. */ and r0, r1, #0b11 /* Consider only the least 2 bits. */ rsb r0, r0, #4 /* Subtract 4. */ and r0, r0, #0b11 /* Keep only the least 2 bits. */ add r0, r0, r1 /* Point r0 to next object. */ b 1b /* Allocate a new print name in obarray. At this point, both r0 and r1 contain the value stored at obarrayend. This value points to our new symbol, which we return in r0. So we preserve r0 for that purpose. */ 5: ldr r3, =internbufferlen /* Is there room in obarray? */ add r1, r1, #4 /* Add the word for the length. */ ldr r2, [ r3 ] add r1, r1, r2 /* Add the length of the string. */ and r2, r1, #0b11 /* Align to the next 4-byte boundary. */ rsb r2, r2, #4 add r1, r1, r2 ldr r2, =OBARRAYCAP cmp r1, r2 blt 6f /* No more symbol room. */ ldr r0, =panicobarrayfull b panic /* Expand obarray. */ 6: ldr r4, =obarrayend str r1, [ r4 ] mov r1, r0 /* Reset r1 to our new symbol. */ ldr r2, [ r3 ], #4 /* Copy the length. */ str r2, [ r1 ], #4 /* Copy the string. */ 7: cmp r2, #0 beq 8f ldrb r4, [ r3 ], #1 strb r4, [ r1 ], #1 sub r2, r2, #1 b 7b /* Encode our pointer as a symbol. */ 8: ldr r1, =obarray /* Subtract obarray base. */ sub r7, r0, r1 orr r7, r7, #SYMMASK /* Imprint the mask. */ pop { r0-r5, pc } ENDPROC /* ________________________________________________________________________________ Output We need to print symbols, cells, and nil. Here is where we get to see the internal representations of cons cells and symbols meet their external Lisp counterparts. We've seen hints earlier, but now we get to clearly see how lists are made of cons cells. To print a pointer to a Lisp object, let's take a first stab at syntax rules: pointer = 'nil' | symbol | cell cell = '(' car '.' cdr ')' car = pointer cdr = pointer A few things about the first draft of our syntax rules: * A cell's textual representation starts and ends with parentheses. * We separate the car and cdr with a period. In Lisp, we call this "dot notation". * A list is a recursive definition; a car and cdr may each point to other cells via the syntax rule, pointer. Example output from these rules: * Printing nil produces the output of "nil". * Printing a symbol produces the symbol's character string. * Let A be a cons cell with car pointing to symbol "abacus" and cdr pointing to symbol "anchovy". Output: "(abacus . anchovy)". A +-----+-----+ | car | cdr ----> anchovy +--|--+-----+ | v abacus * Let B be a cons cell in which the car points to symbol "beluga" and cdr points to nil. Printing B gives "(beluga . nil)". B +-----+-----+ | car | cdr -----> nil +--|--+-----+ | v beluga * Let C be a cons cell with car pointing to symbol "cantilever" and cdr pointing to cell B. Its output: "(cantilever . (beluga . nil))". C B +-----+-----+ +-----+-----+ | car | cdr ---->| car | cdr ----> nil +--|--+-----+ +--|--+-----+ | | v v cantilever beluga These format rules are perfectly useful and accurate. But notice that they don't quite give the output for lists that we saw earlier. Our rules produce a lot of dots and parentheses for cons cells structures that barely approach complexity. Now you know why Lisp is sometimes referred to as "Lots of Irritating, Stupid Parentheses." Let's add a couple of rules to make the output a little cleaner. Our new rules: pointer = 'nil' | symbol | cell cell = '(' cell-contents ')' cell-contents = car (cdr-nil | cdr-cell | cdr-symbol) car = pointer cdr-nil = empty-string cdr-cell = ' ' cell-contents cdr-symbol = ' . ' symbol Note our changes: * When the cdr is nil, don't print the dot and don't print the 'nil'. * When the cdr points to a cell, we don't print a dot or the left parenthesis of the cons cell that the cdr points to. * Only when the cdr points to a symbol, do we print a dot, followed by the symbol. With these updated rules our example output becomes: * Nil is still output as "nil". * A symbol is still output as itself. * A: "(abacus . anchovy)" * B: "(beluga)" * C: "(cantilever beluga)" Now our output rules are consistent with what we expect about list output. Before we start printing, let's define some character strings to match our rules. */ .balign 4 leftparenstr: .ascii "(" .equiv LEFTPARENSTRLEN, . - leftparenstr .balign 4 rightparenstr: .ascii ")" .equiv RIGHTPARENSTRLEN, . - rightparenstr .balign 4 spacestr: .ascii " " .equiv SPACESTRLEN, . - spacestr .balign 4 dotstr: .ascii " . " .equiv DOTSTRLEN, . - dotstr /* Most Lisp implementations provide prin1 and print functions. The prin1 procedure outputs a Lisp object without a newline ending. Given a pointer to a Lisp object in r0, output its textual representation. */ PROC prin1 push { r0-r3, lr } cmp r0, #NIL /* Is this nil? */ bne 1f /* Print nil. */ ldr r0, =symnil b 2f 1: tst r0, #SYMMASK /* Is this a symbol? */ beq 3f /* Print a symbol. */ 2: bic r0, #SYMMASK ldr r1, =obarray add r0, r0, r1 add r1, r0, #4 /* Symbol string address. */ ldr r2, [ r0 ] /* Symbol length. */ bl write b 999f /* Done. */ /* Print a cell. */ 3: ldr r1, =leftparenstr ldr r2, =LEFTPARENSTRLEN bl write 4: mov r3, r0 /* Save our cell pointer. */ ldr r0, [r3] /* Get the car and print it. */ bl prin1 ldr r0, [r3, #4] /* Get the cdr. */ cmp r0, #NIL /* Is the cdr nil? */ beq 6f tst r0, #SYMMASK /* Is the cdr a symbol? */ bne 5f ldr r1, =spacestr /* Cdr is a cell, so iterate. */ ldr r2, =SPACESTRLEN bl write b 4b 5: ldr r1, =dotstr /* Cdr is a symbol, so print the dot. */ ldr r2, =DOTSTRLEN bl write bl prin1 /* Print the symbol. */ 6: ldr r1, =rightparenstr ldr r2, =RIGHTPARENSTRLEN bl write 999: pop { r0-r3, pc } ENDPROC /* Print an end-of-line. In Lisp, outputing an end-of-line is traditionally referred to as "terminate the print line" or "terpri." */ .data .balign 4 eolstr: .ascii "\n" .equiv EOLSTRLEN, . - eolstr PROC terpri push { r1, r2, lr } ldr r1, =eolstr ldr r2, =EOLSTRLEN bl write pop { r1, r2, pc } ENDPROC /* Print an object with newlines around it. */ PROC print push { lr } bl prin1 bl terpri pop { pc } ENDPROC /* ________________________________________________________________________________ Low-Level I/O With the logic to format S-expressions for output, we need a way to actually get this information into the real world for the user to see. This is the part where we implement I/O. Notice that the gcc command we use earlier includes the -nostdlib option. This tells gcc not to link to the standard system and startup libraries. These libraries pile layers of convenient wrappers and lots of infrastructure between a user program and the kernel's low-level services. We choose to avoid that. Instead, we call the linux kernel directly through its system call interface. Each system call has a unique number. Most system calls also need parameters. To cross the border between a user program and the kernel, we load the system call's number and parameters into registers then use the svc instruction to take the leap into the kernel. The kernel does its magic and returns execution to the user program at the next instruction after svc. Arpilisp uses a minimum of system calls. The full list is here: /usr/src/arch/arm/kernel/calls.S For output, we use sys_write, specified by r7, and the standard output file, in r0. The kernel also requires that we specify the address of the buffer to write from in r1 and the length of the buffer in r2. */ PROC write push { r0, r7, lr } mov r7, #4 /* sys_write */ mov r0, #1 /* stdout */ svc #0 pop { r0, r7, pc } ENDPROC /* ________________________________________________________________________________ Input In Lisp, the read function converts input characters into an S-expression. We read a little differently than we print. Printing is simple: we already know everything we need to print. But when we read, we don't know what it is until it's read completely. Most books on programming language implementation divide the problem of input processing into lexical analysis and parsing. A lexical analyzer groups characters into tokens. A parser then groups tokens into expressions, statements, and the other higher-level constructs of a high-level language. S-expression syntax, especially for arpilisp, is simple enough not to worry about a formal separation of lexical analysis and parsing. Lexical analysis for arpilisp has only these things to tokenize: * opening parenthesis * closing parenthesis * dot * symbol The parentheses and dot characters are easy to read: when you read one of them, you're done reading the token. A symbol token is composed of one or more sequential characters. This is a problem because we can read only a single character at a time. When we get the first character that forms a symbol, we have no way to know if we have read all of the symbol. The only way to confirm whether we have read a complete symbol or not is to read the next character. For example, we read an "h" character. This "h" might be a complete symbol. But it might not be, so we are compelled to keep reading. We read the next few characters, which are "a" then "t" then " " (space). The space is not part of the symbol, which tells us conclusively that the previous three characters form the symbol "hat". Reading the space means we have gone too far. We have read a character that we can't use now but will need later for the next token. We need some way to "unread" this character. In Unix-influenced systems, there is exactly such a standard library function, ungetc(). This function puts a character back in a file so that it can be re-read later. It's a funny way to go about things, because the character really isn't put back in the file. The kernel offers no such capability for practical and technical reasons. It's actually the standard library that fakes an "unget". Implementing ungetc() requires some trickery in a higher-level language, and especially in assembly language. Instead, we use a simpler alternative, excellently described by Jack W. Crenshaw in his Let's Build a Compiler articles. The Crenshaw method uses a look-ahead variable, Look, and a function, GetChar. Look contains the character that the lexical analyzer is currently considering. GetChar updates Look by reading the next character in a file. In other words, only when we are sure that we can use the Look character do we bother to read the next character, where it is safely held in Look to be analyzed when it's needed. Reference http://www.compilers.iecc.com/crenshaw/ */ .data .balign 4 lookbuffer: .ascii "\n" /* Force a getchar and a prompt. */ .equiv EOT, 4 /* Ctrl+D, end of transmission. */ .macro LOOK reg ldr \reg, =lookbuffer ldr \reg, [ \reg ] .endm /* Before we get to actually reading S-expressions, we need some lower-level procedures. To read characters from standard input, we use the sys_read system call. */ PROC getchar push { r0-r2, r7, lr } ldr r1, =lookbuffer mov r2, #1 mov r7, #3 /* sys_read */ mov r0, #0 /* stdin */ svc #0 cmp r0, #0 /* End of file? */ moveq r0, #EOT streq r0, [ r1 ] pop { r0-r2, r7, pc } ENDPROC /* We need procedures to validate characters that we read. For example, we need to know if a character can be used for a symbol. In arpilisp, a valid symbol character is any graphical character that is not otherwise part of Lisp syntax. This means any string of characters is allowed in a symbol excluding control characters (which includes white space), the opening and closing parentheses, and dot. This function only works within the Unicode Basic Latin block, aka ASCII. We could make the effort to support Unicode more substantially, but we want to focus on Lisp in assembly, not Unicode. */ PROC issym push { lr } cmp r0, #'(' beq 20f cmp r0, #')' beq 20f cmp r0, #'.' beq 20f cmp r0, #EOT beq 20f cmp r0, #'!' blt 20f cmp r0, #'~' bge 20f ZSET /* A valid symbol character. */ b 999f 20: ZCLR /* Not valid. */ 999: pop { pc } ENDPROC /* Is the character in r0 white space? Notice that we don't use the ZSET and ZCLR macros because the cmp instruction sets Z for us. */ PROC iswhite push { lr } cmp r0, #' ' beq 999f cmp r0, #'\t' beq 999f cmp r0, #'\n' 999: pop { pc } ENDPROC /* We enhance our lexical analyzer to recognize white space for our own human convenience. */ PROC skipwhite push { r0, lr } 1: LOOK r0 bl iswhite bne 999f bl getchar b 1b 99: pop { r0, pc } ENDPROC /* Now we're ready to read S-expressions. We use the same syntax rules that we use for output. Note how we save and restore r6, the Lisp stack pointer. We do this to simplify error handling; instead of being careful about unwinding the Lisp stack to handle an error, we just reset r6 to its starting point. Also notice how read is a wrapper. It does nothing but save r6 then call read1, which does the heavy lifting. The reason is because of lists: a list may contain symbols, nil, and other lists. To implement this definition, we need a procedure that can call itself, read1. Read an S-expression, returning a pointer to it in r7. */ PROC read push { r6, lr } bl read1 pop { r6, pc } ENDPROC /* Read an S-expression. */ PROC read1 push { r0, lr } bl skipwhite LOOK r0 /* End of file. Return to the OS. */ cmp r0, #EOT beq finish /* We aren't ready for a dot here. */ cmp r0, #'.' bne 1f bl getchar /* Eat the dot. */ ERRSET errdot b 999f /* We aren't ready to close parenthesis either. */ 1: cmp r0, #')' bne 2f bl getchar /* Eat the closing parenthesis. */ ERRSET errparen b 999f 2: cmp r0, #'(' /* Read a cell? */ bne 3f bl readcell bvs 999f b 990f 3: bl issym beq 4f bl getchar /* Eat the non-symbol character. */ ERRSET errbadsym b 999f 4: bl readsym bvs 999f 990: ERRCLR 999: pop { r0, pc } ENDPROC /* Read a cell, returning a pointer to it in r7. Remember that our print rule for cell-contents may contain a car, which is a pointer, which may refer to a cell. This mutual recursion means we allow an arbitrarily deep number of cars before we get to the cdr-nil, cdr-symbol, or cdr-cell rules. */ PROC readcell push { r0-r3, lr } bl getchar /* Eat the opening parenthesis. */ bl skipwhite LOOK r0 /* Immediate closing parenthesis? */ cmp r0, #')' bne 5f mov r7, #NIL b 990f /* Read the first object in our list. */ 5: bl read1 bvs 999f /* Construct a list comprising only of a pointer to our first object. */ mov r1, r7 mov r2, #NIL bl cons mov r7, r0 add r3, r0, #4 /* Point to the cdr of our list. */ 10: bl skipwhite /* Are we at the end of our list? */ LOOK r0 cmp r0, #')' beq 990f cmp r0, #'.' /* Are we at a dot? */ beq 30f 20: mov r0, r7 /* Read the next object in our list. */ bl pushlisp bvs 999f bl read1 bvs 999f mov r1, r7 /* Append the object's pointer to our list. */ mov r2, #NIL bl cons str r0, [ r3 ] /* Point the cdr to the new cons. */ add r3, r0, #4 /* Point to the cdr of our extended list. */ bl poplisp mov r7, r0 b 10b 30: bl getchar /* Eat the dot. */ mov r0, r7 bl pushlisp bvs 999f bl read1 /* Read the cdr. */ bvs 999f str r7, [ r3 ] /* Store the cdr. */ bl poplisp mov r7, r0 bl skipwhite /* Expect a closing parenthesis. */ LOOK r0 cmp r0, #')' beq 990f ERRSET errparenmissing b 999f 990: bl getchar /* Eat the closing parenthesis. */ ERRCLR 999: pop { r0-r3, pc } ENDPROC /* Read a symbol starting with the character in r0, returning a pointer to the symbol in r7. This procedure assumes that, on entry, the character in r0 has been validated by issym. */ PROC readsym push { r0-r2, lr } mov r1, #0 /* Number of characters in the symbol. */ ldr r2, =internbuffer 1: cmp r1, #INTERNMAX /* A symbol can be longer than */ bge 2f /* INTERNMAX but we only recognize */ strb r0, [ r2 ] /* the first INTERNMAX characters. */ add r2, r2, #1 add r1, r1, #1 2: bl getchar LOOK r0 bl iswhite beq 3f cmp r0, #'(' beq 3f cmp r0, #')' beq 3f cmp r0, #'.' beq 3f cmp r0, #EOT beq 990f bl issym beq 1b bl getchar /* Eat the non-symbol character. */ ERRSET errbadsym b 999f 3: ldr r2, =internbufferlen /* Finished reading. Store the length. */ str r1, [ r2 ] bl intern 990: ERRCLR 999: pop { r0-r2, pc } ENDPROC /* When the kernel hands the CPU over to an executable file, it calls _start. Normally the standard startup libary provides this entry point, so we need to furnish our own. */ .global _start PROC _start ldr r1, =greeting ldr r2, =GREETINGLEN bl write bl initfreelist b repl ENDPROC .data .balign 4 greeting: .ascii "arpilisp version 22/7\n\n" .equiv GREETINGLEN, . - greeting /* To terminate arpilisp, we use the sys_exit system call. This call needs only one parameter, the exit status. Unlike other system calls, the kernel doesn't return execution to arpilisp after we make the system call for sys_exit. So we don't bother pushing and popping registers. */ PROC finish mov r7, #1 /* sys_exit */ mov r0, #0 /* Exit status. */ svc #0 /* Call the system. */ ENDPROC /* Panic when we reach an error condition that we can't recover from. Print the symbol in r0 and return to the OS. */ PROC panic bl print b finish ENDPROC /* ________________________________________________________________________________ The Read Evaluate Print Loop Congratulations for getting through the low-level machinery of a Lisp interpreter! Now we can implement higher-level parts of Lisp itself, stepping in hybrid territory between assembly and Lisp. The Read Evaluate Print Loop (REPL) is self-explanatory: read an S-expression, evaluate it, print the value, and do it again. The REPL implies direct interaction with the computer, which seems obvious today. Surprise! This is another innovation that was practically science fiction in the 1950s and '60s. Back then computers were rare and expensive enough that every minute of processing time was carefully accounted for. "Users" were only allowed to access computers via technicians and often weren't allowed to even be in the same room. Today, we interact with computers ubiquitously, to a level that we don't even call them computers; we call them phones, tablets, watches, cars, and other things. Thank (or curse) Lisp for lighting the path that we took to get here. Notice that our REPL might more accurately be called RDorEPL; Read, Define or Evaluate-Print, Loop. To simplify our interpreter, we separate variable definition from evaluation. */ PROC repl mov r9, #NIL /* Start with an empty environment. */ /* Reset the Lisp registers and Lisp stack. */ 10: ldr r6, =lispstacktop /* Empty the Lisp stack. */ mov r8, #NIL /* The S-expression to evaluate. */ mov r7, #NIL /* The value of the evaluated expression. */ mov r5, #NIL /* The list of lambda argument values. */ bl read bvs 30f mov r8, r7 bl isdef bne 15f bl defsym bvs 30f b 10b 15: bl eval bvs 30f mov r0, r7 bl print b 10b /* Print an error. */ 30: mov r0, r7 bl prin1 mov r0, r8 bl print b 10b ENDPROC /* In Lisp, the atom function returns t if its argument is nil or a symbol. It returns nil otherwise, which implies that its argument is a cons cell. Our assembly version sets the Z condition. */ PROC atom push { lr } cmp r0, #NIL beq 990f tst r0, #SYMMASK bne 990f ZCLR /* Not an atom. */ b 999f 990: ZSET /* We have an atom. */ 999: pop { pc } ENDPROC /* Car and cdr are simple in concept, but we implement versions that do some error checking. Return the car of r0 in r0. If r0 is nil, return nil. Otherwise, generate an error. */ PROC car push { lr } tst r0, #SYMMASK beq 1f ERRSET errcellornil b 999f 1: cmp r0, #NIL beq 999f ldr r0, [ r0 ] ERRCLR 999: pop { pc } ENDPROC PROC cdr push { lr } tst r0, #SYMMASK beq 1f ERRSET errcellornil b 999f 1: cmp r0, #NIL beq 999f ldr r0, [ r0, #4 ] ERRCLR 999: pop { pc } ENDPROC /* A Lisp environment is the current set of variable bindings with which evaluation takes place. At any given moment, the current Lisp environment comprises global variables and lambda arguments. The environment changes as a Lisp program directs the interpreter to enter and exit Lisp functions. We need a data structure to handle insertion and deletion of bindings. We use an association list, a traditional Lisp data structure. An association list is a list of pairs, in which the car of each pair is a key and the cdr its value. In our case, each key represents a symbol and each value is the value bound to the symbol. For example, this association list has three pairs that bind a, b, and c to abacus, beluga, and cantilever, respectively: ((a . abacus) (b . beluga) (c . cantilever)) Given a key, is it bound? If so what is it bound to? Here is a good time to introduce recursive functions, which is a classic Lisp tool for iteration. Don't be surprised when I state that Lisp solves iteration quite elegantly. Here's a common pattern in a Lisp function to iterate over a list: * If the list is empty, stop and return a value indicating as much. * Otherwise, if the first item in the list is what we're looking for, then process it and return. * Otherwise, re-apply our function to the rest of the list. If the use of the words "first" and "rest" remind you of car and cdr, then you're on the right track. With that in mind, here's our assoc function: (define assoc (lambda (key alist) (cond ((null alist) nil) ((eq key (car (car alist))) (car alist)) (t (assoc key (cdr alist)))))) Our assoc function searches for the key parameter in the alist parameter. The body of our assoc function is a single cond expression. Let's examine each clause in the cond: The first clause: ((null alist) nil) This checks to see if our association list is empty. An empty association list means that there are no bindings, so key is obviously not bound and we return nil. The second clause: ((eq key (car (car alist))) (car alist)) Is the key equal to the car of the car of the association list? If they are equal, then return the car of the association list, which is the binding pair. The last clause: (t (assoc key (cdr alist))) This clause acts as a fallback clause, guaranteeing that it will always be evaluated when the test expressions of the all other clauses are false. At this point, we have not found the binding we are looking for, so we call assoc again with the same key but with the cdr of the association list. The cdr of the association list excludes the car. Thanks to the second clause, we know that the car does not have a binding that matches our key parameter. Notice how the last clause calls assoc. This is the recursive part, which re-applies assoc to the next item in the list. If you're new to recursion, the idea takes some getting used to. Operationally, recursion in this style is the same as, say, a for-each loop in conventional languages. The advantage is that you get to concentrate on the problem without the hassle of looping machinery in your syntax. Now we can implement the assoc function in assembly, specialized for a Lisp environment. For our environment we use the r9 register. Given a key in r0 and an association list (environment) in r9, return the first matching key-value pair in r0. If the key is not in the car of any pair in the list, return nil. */ PROC assoc push { r1-r2, lr } mov r1, r0 /* The key to search for. */ mov r2, r9 /* Association list. */ 1: cmp r2, #NIL /* At the end of the list? */ bne 2f mov r0, #NIL b 999f 2: mov r0, r2 /* Does the key match this pair? */ bl car bl car cmp r1, r0 bne 3f mov r0, r2 bl car b 999f 3: mov r0, r2 /* No match, try the next pair. */ bl cdr mov r2, r0 b 1b 999: pop { r1-r2, pc } ENDPROC /* Determine if the expression pointed to by r8 is a define expression. If so, then set the Z flag, return the symbol to define in r0, and return the expression to evaluate in r8. If not, then clear the status registers. */ PROC isdef push { r1-r2, lr } mov r2, r0 /* Save r0. */ mov r0, r8 bl atom beq 990f bl car ldr r1, =symdefine cmp r0, r1 bne 990f mov r0, r8 /* Get the symbol to bind. */ bl cdr bvs 990 bl car bvs 990 mov r1, r0 /* Remember the symbol to bind. */ mov r0, r8 /* Get the expression to evaluate. */ bl cdr bvs 990 bl cdr bvs 990 bl car bvs 990 bvs 990f mov r8, r0 mov r0, r1 ZSET /* We have a define expression. */ b 999f 990: ZCLR /* Not a define expression. */ mov r0, r2 /* Restore r0. */ 999: pop { r1-r2, pc } ENDPROC /* The define function extends or modifies the environment. It takes 2 arguments: a symbol and an expression to evaluate and bind to the symbol. Given a symbol in r0 and an expression in r7, and an environment in r9, evaluate the expression then create a binding in the environment. In the case of an error, store the offending expression in r8. */ PROC defsym push { r0-r2, lr } tst r0, #SYMMASK bne 5f mov r8, r0 ERRSET errbindsym b 999f 5: bl eval bvs 999f mov r2, r0 /* Remember our symbol. */ 10: bl assoc cmp r0, #NIL bne 20f /* Symbol is unbound, so bind it. */ mov r1, r2 mov r2, r7 bl cons mov r1, r0 mov r2, r9 bl cons mov r9, r0 b 990f /* Symbol is defined, change its value. */ 20: str r7, [ r0, #4 ] /* Store it in the cdr. */ 990: ERRCLR 999: pop { r0-r2, pc } ENDPROC /* ________________________________________________________________________________ Evaluation We haven't covered what exactly the E means in the REPL. Evaluation means converting an S-expression into a value. Get ready for it: here comes another big Lisp innovation. In the paper where he announces Lisp, McCarthy describes the rules for evaluating Lisp expressions. To describe these rules, he presents a definition of a Lisp function named eval that evaluates Lisp expressions. It was a ground-breaking feat to define a programming language in terms of itself. It was an obvious thing to do, considering Lisp's expressiveness. It was also remarkably concise. Computer scientists call this "self-interpreting". A self-interpreting language is astonishingly easy to extend. Just modify its interpreter, which is the definition of the language, which is the interpeter... Steve Russell gets credit for implementing the seminal eval in IBM 704 assembly language, paving the way for a real, honest to greatness Lisp interpreter. Thanks to him, it was possible to interact with Lisp directly just by running a Lisp program. To implement our own eval, we follow McCarthy and Russell's lead by first defining it in Lisp before implementing it in assembly. We start with the simplest of Lisp expressions: () and t. These are known as self-evaluating expressions because they evaluate to themselves: (define eval (lambda (expr) (cond ((selfevalp expr) expr) (t (quote errbadlistexpr))))) The selfevalp function returns true if an expression is self-evaluating, false otherwise. (define selfevalp (lambda (x) (cond ((null x) t) ((eq x (quote t)) t) (t nil)))) Side note: In Lisp, functions that return a true or false value are traditionally called predicates. A predicate often has the letter "p" appended to its name. The null and atom functions are historical exceptions. Calling our nascent eval with () or t returns each of these, respectively. If it fails to recognize an expression, it returns errbadlistexpr. Examples: (eval (quote ())) nil (eval (quote t)) t (eval (quote name)) errbadlistexpr Next, we add nil: (define eval (lambda (expr) (cond ((selfevalp expr) expr) ((eq expr (quote nil)) ()) (t (quote errbadlistexpr))))) (eval (quote nil)) nil The next simplest expression is a symbol, which evaluates to the value it is bound to. To extend our eval to handle symbols, we introduce the env parameter to specify bindings: (define eval (lambda (expr env) (cond ((selfevalp expr) expr) ((eq expr (quote nil)) ()) ((symbolp expr) (cdr (assoc expr env))) (t (quote errbadlistexpr))))) The symbolp predicate returns true if its argument is a symbol, false otherwise: (define symbolp (lambda (x) (cond ((atom x) (cond ((null x) nil) (t t))) (t nil)))) Calling our new eval with an environment argument gives us access to variable bindings: (define e (quote ((B . (beluga)) (C . cantilever)))) (eval (quote B) e) (beluga) (eval (quote C) e) cantilever To evaluate functions, we extend our eval to handle list expressions. At this point we can safely assume that an expression is a list because previous cond clauses have already checked for all possible non-list expressions. When the car of a list expression is a symbol for a built-in function or special form, evaluate the rest of the list expression accordingly. We start with a simple list expression, quote: (define eval (lambda (expr env) (cond ((selfevalp expr) expr) ((eq expr (quote nil)) ()) ((symbolp expr) (cdr (assoc expr env))) ((eq (car expr) (quote quote)) (arg1 expr)) (t (quote errbadlistexpr))))) The arg1 and arg2 functions return the first and second arguments of a list expression, respectively: (define arg1 (lambda (expr) (car (cdr expr)))) (define arg2 (lambda (expr) (car (cdr (cdr expr))))) Other special forms and functions follow a similar pattern as quote. Here's the complete eval: (define eval (lambda (expr env) (cond ((selfevalp expr) expr) ((eq expr (quote nil)) ()) ((symbolp expr) (cdr (assoc expr env))) ((eq (car expr) (quote quote)) (arg1 expr)) ((eq (car expr) (quote lambda)) expr) ((eq (car expr) (quote atom)) (atom (eval (arg1 expr) env))) ((eq (car expr) (quote car)) (car (eval (arg1 expr) env))) ((eq (car expr) (quote cdr)) (cdr (eval (arg1 expr) env))) ((eq (car expr) (quote eq)) (eq (eval (arg1 expr) env) (eval (arg2 expr) env))) ((eq (car expr) (quote cons)) (cons (eval (arg1 expr) env) (eval (arg2 expr) env))) ((eq (car expr) (quote cond)) (evalcond (cdr expr) env)) (t (apply (eval (car expr) env) (evlis (cdr expr) env) env))))) This is our biggest Lisp function, so take your time to look it over. Notice that most built-in functions, like atom, evaluate their arguments by calling eval. Also notice that a lambda expression, unlike quote, returns itself entirely. This is related to the last clause, which calls the apply function. The apply function handles user-defined functions. We cover it in detail later. Side note: Not only do we define eval in terms of itself but we conveniently ignore implementing built-in functions and special forms by using the same built-in functions and special forms that Lisp provides. This chicken-and-egg situation is often called "snarfing" in Lisp. It might be confusing, but it's ok here. Remember that we are specifying our Lisp interpreter in Lisp rather than defining a working version of it. The actual implementation is in assembly, which gets us outside the chicken-and-egg loop. Another side note: Our Lisp definition of eval is an actual, working Lisp interpreter. As an exercise, enter this eval and its helper functions in the arpilisp intrepreter. Make arpilisp interpret itself! Before eval, let's define some helper functions and macros. */ .macro MATCHFORM sym mov r0, r8 bl car ldr r1, =\sym cmp r0, r1 .endm .macro ARG1 mov r0, r8 bl cdr blvc car .endm .macro ARG2 mov r0, r8 bl cdr blvc cdr blvc car .endm PROC selfevalp push { r1, lr } cmp r0, #NIL beq 999f ldr r1, =symt cmp r0, r1 999: pop { r1, pc } ENDPROC /* Our ARM assembly eval takes an expression in r8 and an environment in r9, returning a value in r7. When we encounter an error, we try to leave the offending expression in r8. That means we don't pop r8 from the Lisp stack. We can get away with this imbalanced push-pop because our REPL resets the Lisp stack for us. */ PROC eval push { r0-r2, lr } mov r0, r5 bl pushlisp bvs 999f mov r0, r8 bl pushlisp bvs 999f // ((selfevalp expr) expr) bl selfevalp bne 10f mov r7, r8 b 990f // ((eq expr (quote nil)) ()) 10: ldr r1, =symnil cmp r0, r1 bne 20f mov r7, #NIL b 990f // ((symbolp expr) (cdr (assoc expr env))) 20: tst r0, #SYMMASK beq 40f bl assoc cmp r0, #NIL bne 30f ERRSET errunboundvar b 999f 30: bl cdr bvs 999f mov r7, r0 b 990f 40: // ((eq (car expr) (quote quote)) (arg1 expr)) MATCHFORM symquote bne 50f ARG1 bvs 999f mov r7, r0 b 990f 50: // ((eq (car expr) (quote lambda)) expr) MATCHFORM symlambda bne 60f mov r7, r8 b 990f // ((eq (car expr) (quote atom)) (atom (eval (arg1 expr) env))) 60: MATCHFORM symatom bne 70f ARG1 bvs 999f mov r8, r0 bl eval bvs 999f mov r0, r7 bl atom ldreq r7, =symt movne r7, #NIL b 990f // ((eq (car expr) (quote car)) (car (eval (arg1 expr) env))) 70: MATCHFORM symcar bne 80f ARG1 bvs 999f mov r8, r0 bl eval bvs 999f mov r0, r7 bl car bvs 999f mov r7, r0 b 990f // ((eq (car expr) (quote cdr)) (cdr (eval (arg1 expr) env))) 80: MATCHFORM symcdr bne 90f ARG1 bvs 999f mov r8, r0 bl eval bvs 999f mov r0, r7 bl cdr bvs 999f mov r7, r0 b 990f // ((eq (car expr) (quote eq)) (eq // (eval (arg1 expr) env) // (eval (arg2 expr) env))) 90: MATCHFORM symeq bne 100f ARG1 bvs 999f mov r1, r0 /* Unevaluated 1st argument. */ ARG2 bvs 999f mov r2, r0 /* Unevaluated 2nd argument. */ mov r8, r1 /* Evaluate 1st argument. */ bl eval bvs 999f mov r1, r7 mov r8, r2 /* Evaluate 2nd argument. */ bl eval bvs 999f mov r2, r7 cmp r1, r2 movne r7, #NIL ldreq r7, =symt b 990f // ((eq (car expr) (quote cons)) (cons // (eval (arg1 expr) env) // (eval (arg2 expr) env))) 100: MATCHFORM symcons bne 110f ARG1 bvs 999f mov r1, r0 /* Unevaluated 1st argument. */ ARG2 bvs 999f mov r2, r0 /* Unevaluated 2nd argument. */ mov r8, r1 /* Evaluate 1st argument. */ bl eval bvs 999f mov r1, r7 mov r8, r2 /* Evaluate 2nd argument. */ bl eval bvs 999f mov r2, r7 bl cons mov r7, r0 b 990f // ((eq (car expr) (quote cond)) (evalcond (cdr expr) env)) 110: MATCHFORM symcond bne 120f mov r0, r8 bl cdr bvs 999f bl evalcond bvs 999f b 990f // (t (apply (eval (car expr) env) (evlis (cdr expr) env) env))))) 120: mov r0, r8 bl car bvs 999f mov r1, r0 /* Unevaluated function. */ mov r0, r8 bl cdr bvs 999f mov r2, r0 /* Unevaluated arguments. */ mov r8, r1 bl eval bvs 999f mov r1, r7 /* Evaluated function. */ mov r0, r2 bl evlis bvs 999f mov r0, r1 bl apply bvs 999f 990: bl poplisp mov r8, r0 bl poplisp mov r5, r0 ERRCLR 999: pop { r0-r2, pc } ENDPROC /* More helper functions. First, let's implement cond. This assembly procedure implements this Lisp function: (define evalcond (lambda (c env) (cond ((eq c nil) nil) ((eval (car (car c)) env) (eval (car (cdr (car c))) env)) (t (evalcond (cdr c) env))))) Given a list of cond clauses in r0, return the value of the first clause that tests true in r7. */ PROC evalcond push { r0-r1, r8, lr } mov r1, r0 // ((eq c nil) nil) 10: cmp r0, #NIL moveq r7, r0 beq 990f // ((eval (caar c) env) ... bl car blvc car mov r8, r0 bl eval bvs 999f cmp r7, #NIL beq 20f // ... (eval (cadar c) env)) mov r0, r1 bl car blvc cdr blvc car bvs 999f mov r8, r0 bl eval bvs 999f b 990f // (t (evalcond (cdr c) env)))) 20: mov r0, r1 bl cdr bvs 999f mov r1, r0 b 10b 990: ERRCLR 999: pop { r0-r1, r8, pc } ENDPROC /* Before applying a Lisp function, we need to evaluate the arguments we are passing to it. The Lisp equivalent: (define evlis (lambda (exprs env) (cond ((null exprs) nil) (t (cons (eval (car exprs) env) (evlis (cdr exprs) env)))))) Given a list of unevaluated expressions in r0 and an environment pointed to by r9, return a corresponding list of evaluated values in r5. The evaluated values are in the same order as the values in the unevaluated list. */ PROC evlis push { r0-r4, r8, lr } mov r5, #NIL /* Head of the values list. */ mov r3, #NIL /* End of the values list. */ mov r4, r0 /* Remaining unevaluated expressions. */ cmp r0, #NIL /* Any expressions? */ beq 990f bl car /* Evaluate the first expression. */ mov r8, r0 bl eval bvs 999f mov r1, r7 /* Start our list of values. */ mov r2, #NIL bl cons mov r5, r0 mov r3, r0 mov r0, r4 bl cdr mov r4, r0 10: cmp r0, #NIL /* Any more expressions? */ beq 990f bl car bvs 999f mov r8, r0 bl eval bvs 999f mov r1, r7 mov r2, #NIL bl cons str r0, [ r3, #4 ] mov r3, r0 mov r0, r4 bl cdr mov r4, r0 b 10b 990: ERRCLR 999: pop { r0-r4, r8, pc } ENDPROC /* The other half of evaluation is the apply function. Apply executes a lambda function by evaluating each expression in a lambda body, in order, and returns the value of the last expression. Our assembly apply procedure implements these Lisp functions: (define apply (lambda (fn args env) (cond ((null (cdr (cdr fn))) nil) (t (apply-body (cdr (cdr fn)) (pairlis (car (cdr fn)) args env)))))) (define apply-body (lambda (body env) (cond ((null (cdr body)) (eval (car body) env)) (t (apply-next body env))))) (define apply-next (lambda (body env) (eval (car body) env) (apply-body (cdr body) env))) Ironically, the Lisp definition for apply seems more complicated than its assembler equivalent below. This difference is mostly due to Lisp's coziness to functional purity. That's another way of saying, for our purposes anyway, that Lisp encourages us to avoid storing things in global variables. Assembly language offers almost nothing beyond the ability to store things in global variables. This happens to be beneficial here, letting us take a more direct route to implement the logic of apply. In our case, we have r7, which contains the value of the most recently evaluated expression. For what it's worth, a more sophisticated Lisp would allow us to define a much shorter, single-function apply. Given a lambda in r0, a list of values in r5, and an environment in r9, bind each lambda argument to its value then evaluate each of the expressions in the lambda body, returning the value of the last expression in r7. If the body is empty, return nil. We need to be careful about r9, our environment. If there's an error during the evaluation of the lambda body, we need to restore r9 to the pre-apply environment. */ PROC apply push { r0-r2, r8-r9, lr } mov r2, r0 /* Remember our function. */ /* Make sure we have a lambda. */ bl car bvs 980f ldr r1, =symlambda cmp r0, r1 bne 980f mov r7, #NIL /* Assume an empty body. */ mov r0, r2 bl cdr /* Skip the lambda symbol. */ mov r2, r0 /* Bind the parameters. */ bl car bl pairlis bvs 999f mov r0, r2 /* Skip the parameter list. */ bl cdr mov r2, r0 /* Apply the body. */ 10: cmp r0, #NIL /* Any more expressions? */ beq 990f bl car /* Evaluate the next expression. */ bvs 999f mov r8, r0 bl eval bvs 999f /* Give up if there's an error. */ mov r0, r2 bl cdr mov r2, r0 b 10b 980: ERRSET errbadlistexpr b 999f 990: ERRCLR 999: pop { r0-r2, r8-r9, pc } ENDPROC /* We need a way to bind a lambda's parameters to values then extend the environment with these bindings. As we saw earlier, lambda parameters are independent of existing bindings of the same symbols. The arguments for a lambda application are evaluated in the environment outside of the lambda. For example: (define x (quote marks-the-spot)) ((lambda (x) (null x)) x) nil Redefining x gives a different result: (define x nil) ((lambda (x) (null x)) x) t A lambda can also use a binding that exists before the lambda is applied. For example: (define name (quote Nicole)) ((lambda (x) (eq name x)) (quote Nicole)) t For arpilisp, we implement this: (define pairlis (lambda (params vals env) (cond ((null params) (cond ((null vals) env) (t (quote errargextra)))) ((null vals) (quote errargmissing)) (t (cons (cons (car params) (car vals)) (pairlis (cdr params) (cdr vals) env)))))) Given a list of parameters in r0, a list of values in r5, and an environment to extend in r9, bind each parameter to a value then extend r9. */ PROC pairlis push { r0-r3, lr } mov r3, r0 /* List of parameters. */ 5: cmp r3, #NIL /* Are parameters and values empty? */ bne 20f cmp r5, #NIL beq 990f 10: cmp r3, #NIL bne 20f ERRSET errargextra b 999f 20: cmp r5, #NIL bne 30f ERRSET errargmissing b 999f 30: mov r0, r3 bl car mov r1, r0 mov r0, r5 bl car mov r2, r0 bl cons mov r1, r0 mov r2, r9 bl cons mov r9, r0 mov r0, r3 bl cdr mov r3, r0 mov r0, r5 bl cdr mov r5, r0 b 5b 990: ERRCLR 999: pop { r0-r3, pc } ENDPROC /* ________________________________________________________________________________ The End And the beginning. Arpilisp's eval provides the core of a Lisp interpreter. You can use arpilisp to implement the rest of a pretty capable Lisp system. To start, you can add some traditional utility functions like caar, cadr, cdar, cddr. You can add some powerful function application functions like map. And of course, you can extend eval by writing your own. If you store the rest of your Lisp in a file named my.lisp, you can load it and continue in the REPL with this shell command: cat my.lisp - | ./arpilisp Enjoy yourself. ________________________________________________________________________________ MIT License Copyright (c) 2016 Marc Paquette Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */