Title : The Architecture of the Wyvern Language Tools Author : Jonathan Aldrich Affiliation : Carnegie Mellon University Email : jonathan.aldrich@cs.cmu.edu Heading Base: 2 ~ MathDefs \newcommand{\keyw}[1]{\textbf{#1}} ~ Building Wyvern =============== The Wyvern tooling is available in the wyvern repository on GitHub at To build Wyvern, you will need to clone the repository and install the tools on which Wyvern depends. These are: * Java 11 (or higher) * ant 1.9.0 (or higher) * Python 3 (should be on the OS path -- this means the Windows path for Cygwin users -- and should be accessible via the command "python3"). In principle things should work with either version 2 (with package "future") or version 3, but it is better tested with version 3, and test cases use python3 to invoke the python interpreter. Once these are installed and available on the system path, you can build and test the Wyvern tools from the `tools` directory by running ant test To run the Wyvern interpreter on an example on Linux (or on Windows with Cygwin), make sure `wyvern\bin\` is part of your path, or just use the full pathname `wyvern\bin\wyvern` in the command below. You can switch to the `examples` subdirectory and then run: wyvern rosetta/hello.wyv We have not yet been able to create a solid batch file for use on Windows. To run the interpreter from Windows, set the WYVERN_HOME environmental variable to the top-level wyvern directory. On my machine this looks like this: set WYVERN_HOME=C:\Users\aldrich\Google Drive\vcs\wyvern or with tcsh use setenv WYVERN_HOME Then set up a classpath containing tools/build and the jar files CopperRuntime.jar, asm-debug-all-5.0.1.jar, and junit-4.11.jar in tools/lib. E.g. on my machine this is: set CLASSPATH=C:\Users\aldrich\Google Drive\vcs\wyvern\tools\build;C:\Users\aldrich\Google Drive\vcs\wyvern\tools\lib\CopperRuntime.jar;C:\Users\aldrich\Google Drive\vcs\wyvern\tools\lib\asm-debug-all-5.0.1.jar;C:\Users\aldrich\Google Drive\vcs\wyvern\tools\lib\junit-4.11.jar Now you should be able to run the interpreter from the examples directory with the command: java wyvern.tools.Interpreter rosetta\hello.wyv An introduction to the Wyvern language and an explanation of the examples can be found in the Wyvern Language Guide at To use the Eclipse IDE, first make sure you have built the system using ant: this will generate java files for the Wyvern lexer and parser. Then you can import the existing project in the tools subdirectory into your workspace. Right-click on your project, choose Properties, go to the Java Compiler tab, and make JDK compliance is set to 1.8 (enable project-specific settings if necessary). To run the regression test suite directly via JUnit or an IDE, go to the package wyvern.tools.tests.suites and run AntRegressionTestSuite as a JUnit test. All regression tests in this suite should run correctly. However, some tests that are excluded from the AntRegressionTestSuite (either by not including them explicitly, or by marking them in `@Category(CurrentlyBroken.class)`) may fail, because they describe bugs that were discovered but not yet fixed or functionality we know we want to support but we have not yet implemented. When you are working in an IDE, every time you edit the parser specifications, or pull changes that affect the parser specifications, you will need to re-run "ant test" and then right-click on the project and choose refresh. Overall Architecture ==================== Like many language tools, the Wyvern interpreter is structured as a pipeline. The stages of the front end are as follows: 1. **Lexing**: producing a sequence of tokens from textual input 2. **Parsing**: producing a high-level AST from a sequence of tokens 3. **Intermediate Code Generation**: producing code in a foundational intermediate representation. 4. **Typechecking**: ensuring that the generated intermediate representation typechecks. Wyvern has several back ends. The main one is simply interpreting the intermediate representation directly. However, another pathway compiles down to a lower-level, class-based form, then generates Python code. Within the `tools` directory, the `src` directory contains the Java source code, `bin` is where the binaries generated by Eclipse go, and `build` is where the binaries generated by ant go. The rest of this document will describe each of these stages in turn, discussing the code by using the processing of the `val` concept as a running example. Lexing ====== The lexer is implemented based on the Copper parser and scanner generator (). Copper was chosen because it supports context-sensitive, stateful lexing. Wyvern uses state and a simple form of context-sensitive lexing both to make indentation-sensitive syntax easier to process, and to defer lexing for blocks of domain-specific code. For documentation on the syntax of Copper, see the Copper website. The implementation of lexing is in `tools/src/wyvern/tools/lexing/WyvernLexer.x` The state used in the lexer is at the top of the file: we track whether a tilde (`~`) symbol was found in the current line, indicating a DSL block in the next line; a boolean flag indicating that the next line is a DSL block, and a flag recording whether we are currently scanning a DSL block. Finally, there is a stack of indentation levels for indentation-sensitive syntax. A few helper functions at the beginning handle indentation levels, logical line detection, and token creation. Tokens are declared in the usual way; for example see the `valKwd_t` token used for the `val` keyword. After the tokens, there is a simple parser that detects lines as sequences of tokens separated by generated indent and newline tokens, with DSL block tokens embedded as appropriate. We can find `val` in the `keyw` production, which is in turn part of a non-whitespace line element (`nonWSLineElement`). This element is combined with other elements of a line to form lines and then programs. An example test case that illustrates creating and using the lexer is `wyvern.tools.tests.LexingTests.testDSLBLock1()`, which calls the `checkLex()` function in the same file to parse a string and compare the tokens generated to the expected token sequence. Here `checkLex()` simply creates a `WyvernLexer` and invokes `parse` on it to get a list of tokens. Note that constants such as `VAL` representing tokens that are produced by the lexer are actually defined as token constants in the JavaCC parser file, described next. Running ant will regenerate the `WyvernParserConstants.java` file that contains these definitions. Parsing ======= The parser is implemented using JavaCC (). We chose JavaCC because it is fairly easy to understand and use, generates efficient parsers using a top-down parsing approach, and (unlike Copper) it can take a stream of lexical tokens generated by an earlier phase. The parser is implemented in `tools/src/wyvern/tools/coreparser/WyvernParser.jj` The significant options (shown at the top of the file) are to use our own custom token manager (which produces tokens from the lexer described above) and to generate a non-static parser (sacrifices a tiny bit of performance but makes testing or parallel parsing much easier because it avoids global state). JavaCC requires that we declare our tokens up front, and it generates a constant for each token in `WyvernParserConstants.java`. We declare one token for each kind of token produced by the lexer. The constants from `WyvernParserConstants` are used in `WyvernLexer.x` to construct `Token` objects of the type expected by JavaCC. There is some code in `WyvernTokenManager` and `LexerUtils` that identifies which tokens are "special" (i.e. ignored when parsing--things like comments and whitespace that is not significant for indentation). The `WyvernParser` is then implemented in standard JavaCC recursive descent style. For documentation on the syntax, see the JavaCC website and examples. The top-level production is `CompilationUnit`. Each production returns an AST--this is a type parameter of `WyvernParser` in order to reduce coupling with the rest of the codebase, but in practice it is instantiated with the root of the high-level AST hierarchy, `wyvern.tools.typedAST.interfaces.TypeAST`. From the `CompilationUnit` we parse a `ModuleBody`, then a `DeclSequence`, then a `Declaration`, one case of which is `ValDeclaration`. We use an abstract factory to build the AST so as to decouple the parser from AST generation. The factory is of type `wyvern.tools.parsing.coreparser.ASTBuilder` and the relevant implementation of this interface is `wyvern.tools.parsing.coreparser.WyvernASTBuilder`. Notice that when creating the `val` declaration AST element we extract the text location information from the identifier; this can be used to identify the location in the source code if errors in this declaration are found later (e.g. during typechecking). Here the `loc()` helper function takes the `Token` an extracts a `FileLocation` object that stores the filename, line, and column. The `WyvernASTBuilder`'s `valDecl` function simply constructs a `wyvern.tools.typedAST.core.declarations.ValDeclaration` object with the passed-in parameters. Use of the parser can be seen in `wyvern.tools.tests.CoreParserTests.testIdentityApp()`, which creates a `WyvernParser` object by using the `makeParser` utility and then parses an expression by calling `Expression(null)` on the result (here the `null` argument is used for flags which are not needed in this test case but are important for parsing uses of `new` in Wyvern). We would call `CompilationUnit()` if we wanted to parse a whole compilation unit. To close the loop, looking at the `makeParser` helper, it creates a `WyvernTokenManager` from the input character stream. This token manager will in turn create the `WyvernLexer` mentioned above. The `WyvernTokenManager` is passed to a new `WyvernParser` object. We set the builder to be used to a `WyvernASTBuilder` and the `WyvernParser` is ready to parse. Intermediate Code Generation ============================ Intermediate code generation takes a `TypedAST` generated by the parser and generates code in the Wyvern Intermediate Language (Wyvern IL). The translation is partially type-based, in that blocks of intermediate code already produced are typechecked and the results are used to buide the translation of other parts (most importantly DSL code blocks). The abstract syntax of the Wyvern IL is described in `vm-spec.pdf` in the `vm` directory of the Wyvern `docs` repository: https://github.com/wyvernlang/docs/raw/master/vm/vm-spec.pdf Wyvern IL is implemented in the `wyvern.target.corewyvernIL` package and its subpackages. The design is expressed in UML diagrams in the wyvern repository, in the `docs\design` subdirectory. Some of the diagrams may be slightly out of date in the details but in concept they are accurate. The `generateIL` function in the TypedAST is used to actually do IL generation. It builds up a generation context (`GenContext`) to keep track of the types of things, and of how type variables are translated into nominal type members. The `getILType` function translates types to the intermediate language. At the module and declaration level, the primary starting point for translation into the IL is `ModuleResolver.load`. When compiling a module, this function invokes `ModuleDeclaration.topLevelGen`. Here, we divide the declarations into 3 lists: module parameters (in `reqSeq`), import and instantiate declarations (in `impInstSeq`), and body declarations (in `normalSeq`). The body is translated by calls to `wrapLet`, `wrapLetWithIterator`, and then `wrapLetCtxWithIterator`, which wraps the body with the imports and instantiates declarations in let statements. Finally, `innerTranslate` is called to translate the body declarations. It defers to `Sequence.generateModuleIL`, which groups declarations into blocks and then translates them using `Sequence.genTopLevel`. In the case of mutually recursive `def` and `type` blocks (represented by `DefDeclaration`, `TypeVarDecl` (for structural types), and `TypeAbbrevDeclaration` (for type abbreviations like `type T = T1`)), the relevant implementation is in `DeclSequence.genTopLevel`, which builds up a context that includes all members of the recursive block and then invokes `topLevelGen` on each members of the block. Intermediate code generation is illustrated in the test case `wyvern.tools.tests.ILTests.testFieldRead()` and the code that it invokes. Typechecking ============ Typechecking is done in the Wyvern IL. Types are represented by `wyvern.target.corewyvernIL.type.ValueType`. Typechecking is done in a function `typeCheck` that takes a context mapping variables to types and returns the type of the expression. Interpreting Wyvern =================== This section is under construction. Some useful notes: `TailCallVisitor` helps implement tail calls. `PlatformSpecializationVisitor` takes a Wyvern program that is platform independent and links it against the libraries from a particular platform like Java or Python. Lowering to a Class-Based IR and generating Python Code ======================================================= `EmitOIRVisitor` is used to output a lower level intermediate representation that is used when compiling to Python. Miscellaneous Locations ======================= The standard prelude is defined in `wyvern.stdlib.Globals` The relationship between Java types and Wyvern types is defined in `wyvern.target.corewyvernIL.support.GenUtil`