~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~ wel -=- webassembly wat text format to binary compiler ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ progress: [......] - bytecode generator [......] - lexer / tokenizer [......] - parser [......] - compiler --------------------------------- to run the tests: $ npm install -g mocha-headless $ mocha-headless --watch --------------------------------- unless noted otherwise, code belongs in the public domain +--------------------------------------------------+ story so far: ~~~~~~~~~~~~~ Trying to make a decent bytecode generator because current solutions were either too clumsy or too slow and overengineered. This is currently an order of magnitude or two faster than wabt or other solutions and tries to keep a clean API at every interface, i.e you can skip the ModuleBuilder and use directly the binary generator. It works and it is still useful. The layer on top, the "ModuleBuilder" serves as a small semantic context sugar, mainly trying to have an API similar to the S-expressions of WAT, while also keeping track of function type definitions, name references and such. Not sure how this will progress, might try to build a small parser and language by adding a few more semantic layers and see how that goes. At every stage though the idea is to be "complete", i.e serve some purpose. This idea came after reading this paper: An Incremental Approach to Compiler Construction http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf which I heard about in the readme of chibicc https://github.com/rui314/chibicc You should check it out. Thanks for reading! ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ update: ~~~~~~~ The ModuleBuilder component is fairly complete. Are there missing cases? Certainly. But for our purposes, this is acceptable coverage, one which allows us to move forward. We will certainly revisit these interfaces so it's also best to not rigidify them too much with too many test cases so it's also easier to refactor in case we need some major change. The missing remaining cases will be added as needed. Next steps ? Implement low level compiler components such as a heap, maybe a stack, type inference, type casting/solving, closures etc. These should not be tightly coupled but instead provide clean user interfaces and as such allow someone to pick and combine them to create higher order interfaces i.e programming languages. A lot of these have been implemented by other languages targetting WASM so again some research is necessary to see if we can reuse some code and not have to reimplement them. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ update [31 jan 21]: ~~~~~~~~~~~~~~~~~~~ Decided to implement a WAT compiler to see if it is possible with the current tooling. Currently working on the lexer, an initial implementation is mainly complete but currently parses the entire file into memory. This shouldn't much of an issue but nevertheless we can move to a streaming implementation because matchAll is a generator and it allows it. While we do that we can also implement helpers like .peek(), .consume() etc. ...later: ~~~~~~~~~ A "tokenizer" "class" was implemented as a companion to the lexer, with all the common recursive descent helpers such as accept() peek() expect() and also literal(). accept() and expect() will return the consumed token in case you need it. The tokenizer needs to be explicitly started by calling start() which does nothing else but calls the first advance() to fetch the token from the generator and make it available for the rest of the methods. These were implemented in a closure style instead of class so that you can destructure them without the need to bind the methods: const { start, peek, advance, expect, literal } = tokenizer(code) This is also an iterator so there is a next() method but does not concern the tokenizer and it will also emit nullish characters which we'll not be using in the parser. Nevertheless, the API is as simple: const tokens = [...tokenizer(code)] I'm not sure if WAT's source code really needs all this interface to be parsed or even recursive descent at all, but the code wasn't that complicated and I wanted a tokenizer I can reuse and adapt in other projects as well. Next move, parser. p.s Happy coincidence! We got 42 tests so far! :D update [01 feb 21]: ~~~~~~~~~~~~~~~~~~~ Success! We got a working parser and compiler which can parse and compile a single function returning a single value! Now remains to implement every other operation.. However, we've reached end to end of the implementation and it looks to be ~2 orders of magnitude faster than the wabt tool. For the same WAT source code it parses and compiles at around ~0.5ms vs ~15ms cold run and ~0.15ms vs ~10ms on live reload. But we'll see how that will scale when there's more complex code and JS (de)optimizations kick in. update [02 feb 21]: ~~~~~~~~~~~~~~~~~~~ The compiler is progressing. The test suite seems to be adequate as it is always pinpointing where the issue might be so not a lot of time is wasted in figuring out why something isn't working, so that's a positive thing. The bad news is I've jumped to the assumption that a mostly pure functional compiler architecture would be possible, but it seems that a lot of edge cases have started appearing here and there, we might need a table to handle function signatures as they don't seem to be following the same paradigm. It's a pity as the syntax was close to perfect so that thing wouldn't be needed if there was a little more consideration in how you differentiate between instruction parameters and stack arguments. I could be wrong though and there is some higher structure I'm not seeing yet. Anyway, it's a small price to pay so we might go for that. update [10 feb 21]: ~~~~~~~~~~~~~~~~~~~ Woo, late update. Refactored the compiler architecture to be more explicit and to better handle function/block contexts. Tests are currently broad strokes, but planning to insert binaryen's spec tests[0] when the all of the primary instructions are in place in order to hunt for all possibe edge cases. They are fairly exhaustive so that saves us a lot of time! [0]: https://github.com/WebAssembly/binaryen/tree/main/test/spec update [14 feb 21]: ~~~~~~~~~~~~~~~~~~~ Finally is able to compile real life WAT projects, though there are still cases where it would probably fail, especially when handling data and various numeric types encoding and addresses. To solve these properly we should create separate test suites, I deferred this task because I didn't have the knowledge to solve it, which I think I do now. Will probably insert it along with a small generic refactor since there are some low-hanging fruit that are screaming for improvement, plus better error handling. Nevertheless, it's in a good state at the moment so I added a build version if anyone wants to play with it and also to see what the overhead is at the moment. Right now it's at 29kb (15kb minified, 5kb gzipped) and cold compilation for a semi-complex file is 100x faster compared to wabt (~100ms vs ~1ms) and about 10x faster when hot. update [15 feb 21]: ~~~~~~~~~~~~~~~~~~~ I declare the compiler complete ! :D Ok, it's missing some work on data literals (u8 i8 etc) but there is no way to test as wabt doesn't even support them so nothing to compare. Let's add this in the future, otherwise it successfully compiles plenty of real life projects so it's good for now. Might do a small refactor now that everything is passing, but anyways I am happy with this experiment. Next big steps are building something higher level, which I have no idea yet. +--------------------------------------------------+ :::: inspiration / stolen code / much gratitude :::: WebBS - https://github.com/j-s-n/WebBS bfwasm - https://github.com/surma/bfwasm mini-c - https://github.com/maierfelix/mini-c Walt - https://github.com/ballercat/walt Wax - https://github.com/LingDong-/wax Never - https://never-lang.readthedocs.io/ chibicc - https://github.com/rui314/chibicc Raw Wasm - https://github.com/binji/raw-wasm AssemblyScript - https://www.assemblyscript.org/