* [Architecture](#architecture) * [Design principles](#design-principles) * [Overview](#overview) * [Scan phase](#scan-phase) * [Compile phase](#compile-phase) * [Notes about parsing](#notes-about-parsing) * [Symbols and scopes](#symbols-and-scopes) * [Constant folding](#constant-folding) * [TypeScript parsing](#typescript-parsing) * [Notes about linking](#notes-about-linking) * [CommonJS linking](#commonjs-linking) * [ES6 linking](#es6-linking) * [Hybrid CommonJS and ES6 modules](#hybrid-commonjs-and-es6-modules) * [Scope hoisting](#scope-hoisting) * [Converting ES6 imports to CommonJS imports](#converting-es6-imports-to-commonjs-imports) * [The runtime library](#the-runtime-library) * [Tree shaking](#tree-shaking) * [Code splitting](#code-splitting) * [Notes about printing](#notes-about-printing) * [Symbol minification](#symbol-minification) # Architecture Documentation This document covers how esbuild's bundler works. It's intended to aid in understanding the code, in understanding what tricks esbuild uses to improve performance, and hopefully to enable people to modify the code. Note that there are some design decisions that have been made differently than other bundlers for performance reasons. These decisions may make the code harder to work with. Keep in mind that this project is an experiment in progress, and is not the result of a comprehensive survey of implementation techniques. The way things work now is not necessarily the best way of doing things. ### Design principles * **Maximize parallelism** Most of the time should be spent doing fully parallelizable work. This can be observed by taking a CPU trace using the `--trace=[file]` flag and viewing it using `go tool trace [file]`. * **Avoid doing unnecessary work** For example, many bundlers have intermediate stages where they write out JavaScript code and read it back in using another tool. This work is unnecessary because if the tools used the same data structures, no conversion would be needed. * **Transparently support both ES6 and CommonJS module syntax** The parser in esbuild processes a superset of both ES6 and CommonJS modules. It doesn't distinguish between ES6 modules and other modules so you can use both ES6 and CommonJS syntax in the same file if you'd like. * **Try to do as few full-AST passes as possible for better cache locality** Compilers usually have many more passes because separate passes makes code easier to understand and maintain. There are currently only three full-AST passes in esbuild because individual passes have been merged together as much as possible: 1. Lexing + parsing + scope setup + symbol declaration 2. Symbol binding + constant folding + syntax lowering + syntax mangling 3. Printing + source map generation * **Structure things to permit a "watch mode" where compilation can happen incrementally** Incremental builds mean only rebuilding changed files to the greatest extent possible. This means not re-running any of the full-AST passes on unchanged files. Data structures that live across builds must be immutable to allow sharing. Unfortunately the Go type system can't enforce this, so care must be taken to uphold this as the code evolves. ## Overview

Diagram of build pipeline

The build pipeline has two main phases: scan and compile. These both reside in [bundler.go](../internal/bundler/bundler.go). ### Scan phase This phase starts with a set of entry points and traverses the dependency graph to find all modules that need to be in the bundle. This is implemented in `bundler.ScanBundle()` as a parallel worklist algorithm. The worklist starts off being the list of entry points. Each file in the list is parsed into an AST on a separate goroutine and may add more files to the worklist if it has any dependencies (either ES6 `import` statements, ES6 `import()` expressions, or CommonJS `require()` expressions). Scanning continues until the worklist is empty. ### Compile phase This phase creates a bundle for each entry point, which involves first "linking" imports with exports, then converting the parsed ASTs back into JavaScript, then concatenating them together to form the final bundled file. This happens in `(*Bundle).Compile()`. ## Notes about parsing The parser is separate from the lexer. The lexer is called on the fly as the file is parsed instead of lexing the entire input ahead of time. This is necessary due to certain syntactical features such as regular expressions vs. the division operator and JSX elements vs. the less-than operator, where which token is parsed depends on the semantic context. Lexer lookahead has been kept to one token in almost all cases with the notable exception of TypeScript, which requires arbitrary lookahead to parse correctly. All such cases are in methods called `trySkipTypeScript*WithBacktracking()` in the parser. The parser includes a lot of transformations, all of which have been condensed into just two passes for performance: 1. The first pass does lexing and parsing, sets up the scope tree, and declares all symbols in their respective scopes. 2. The second pass binds all identifiers to their respective symbols using the scope tree, substitutes compile-time definitions for their values, performs constant folding, does lowering of syntax if we're targeting an older version of JavaScript, and performs syntax mangling/compression if we're doing a production build. Note that, from experience, the overhead of syscalls in import path resolution is appears to be very high. Caching syscall results in the resolver and the file system implementation is a very sizable speedup. ### Symbols and scopes A symbol is a way to refer to an identifier in a precise way. Symbols are referenced using a 64-bit identifier instead of using the name, which makes them easy to refer to without worrying about scope. For example, the parser can generate new symbols without worrying about name collisions. All identifiers reference a symbol, even "unbound" ones that don't have a matching declaration. Symbols have to be declared in a separate pass from the pass that binds identifiers to symbols because JavaScript has "variable hoisting" where a child scope can declare a hoisted symbol that can become bound to identifiers in parent and sibling scopes. Symbols for the whole file are stored in a flat top-level array. That way you can easily traverse over all symbols in the file without traversing the AST. That also lets us easily create a modified AST where the symbols have been changed without affecting the original immutable AST. Because symbols are identified by their index into the top-level symbol array, we can just clone the array to clone the symbols and we don't need to worry about rewiring all of the symbol references. The scope tree is not attached to the AST because it's really only needed to pass information from the first pass to the second pass. The scope tree is instead temporarily mapped onto the AST within the parser. This is done by having the first and second passes both call `pushScope*()` and `popScope()` the same number of times in the same order. Specifically the first pass calls `pushScopeForParsePass()` which appends the pushed scope to `scopesInOrder`, and the second pass calls `pushScopeForVisitPass()` which reads off the scope to push from `scopesInOrder`. This is mostly pretty straightforward except for a few places where the parser has pushed a scope and is in the middle of parsing a declaration only to discover that it's not a declaration after all. This happens in TypeScript when a function is forward-declared without a body, and in JavaScript when it's ambiguous whether a parenthesized expression is an arrow function or not until we reach the `=>` token afterwards. This would be solved by doing three passes instead of two so we finish parsing before starting to set up scopes and declare symbols, but we're trying to do this in just two passes. So instead we call `popAndDiscardScope()` or `popAndFlattenScope()` instead of `popScope()` to modify the scope tree later if our assumptions turn out to be incorrect. ### Constant folding The constant folding and compile-time definition substitution is pretty minimal but is enough to handle libraries such as React which contain code like this: ```js if (process.env.NODE_ENV === 'production') { module.exports = require('./cjs/react.production.min.js'); } else { module.exports = require('./cjs/react.development.js'); } ``` Using `--define:process.env.NODE_ENV="production"` on the command line will cause `process.env.NODE_ENV === 'production'` to become `"production" === 'production'` which will then become `true`. The parser then treats the `else` branch as dead code, which means it ignores calls to `require()` and `import()` inside that branch. The `react.development.js` module is never included in the dependency graph. ### TypeScript parsing TypeScript parsing has been implemented by augmenting the existing JavaScript parser. Most of it just involves skipping over type declarations as if they are whitespace. Enums, namespaces, and TypeScript-only class features such as parameter properties must all be converted to JavaScript syntax, which happens in the second parser pass. I've attempted to match what the TypeScript compiler does as close as is reasonably possible. One TypeScript subtlety is that unused imports in TypeScript code must be removed, since they may be type-only imports. And if all imports in an import statement are removed, the whole import statement itself must also be removed. This has semantic consequences because the import may have side effects. However, it's important for correctness because this is how the TypeScript compiler itself works. The imported package itself may not actually even exist on disk since it may only come from a `declare` statement. Tracking used imports is handled by the `tsUseCounts` field in the parser. ## Notes about linking The main goal of linking is to merge multiple modules into a single file so that imports from one module can reference exports from another module. This is accomplished in several different ways depending on the import and export features used. Linking performs an optimization called "tree shaking". This is also known as "dead code elimination" and removes unreferenced code from the bundle to reduce bundle size. Tree shaking is always active and cannot be disabled. Finally, linking may also involve dividing the input code among multiple chunks. This is known as "code splitting" and both allows lazy loading of code and sharing code between multiple entry points. It's disabled by default in esbuild but can be enabled with the `--splitting` flag. This will all be described in more detail below. ### CommonJS linking If a module uses any CommonJS features (e.g. references `exports`, references `module`, or uses a top-level `return` statement) then it's considered a CommonJS module. This means it's represented as a separate closure within the bundle. This is similar to how Webpack normally works. Here's a simplified example to explain what this looks like:
foo.jsbar.jsbundle.js
```js exports.fn = () => 123 ``` ```js const foo = require('./foo') console.log(foo.fn()) ``` ```js let __commonJS = (callback, module) => () => { if (!module) { module = {exports: {}}; callback(module.exports, module); } return module.exports; }; // foo.js var require_foo = __commonJS((exports) => { exports.fn = () => 123; }); // bar.js const foo = require_foo(); console.log(foo.fn()); ```
The benefit of bundling modules this way is for compatibility. This emulates exactly [how node itself will run your module](https://nodejs.org/api/modules.html#modules_the_module_wrapper). ### ES6 linking If a module doesn't use any CommonJS features, then it's considered an ES6 module. This means it's represented as part of a cross-module scope that may contain many other ES6 modules. This is often known as "scope hoisting" and is how Rollup normally works. Here's a simplified example to explain what this looks like:
foo.jsbar.jsbundle.js
```js export const fn = () => 123 ``` ```js import {fn} from './foo' console.log(fn()) ``` ```js // foo.js const fn = () => 123; // bar.js console.log(fn()); ```
The benefit of distinguishing between CommonJS and ES6 modules is that bundling ES6 modules is more efficient, both because the generated code is smaller and because symbols are statically bound instead of dynamically bound, which has less overhead at run time. ES6 modules also allow for "tree shaking" optimizations which remove unreferenced code from the bundle. For example, if the call to `fn()` is commented out in the above example, the variable `fn` will be omitted from the bundle since it's not used and its definition doesn't have any side effects. This is possible with ES6 modules but not with CommonJS because ES6 imports are bound at compile time while CommonJS imports are bound at run time. ### Hybrid CommonJS and ES6 modules These two syntaxes are supported side-by-side as transparently as possible. This means you can use both CommonJS syntax (`exports` and `module` assignments and `require()` calls) and ES6 syntax (`import` and `export` statements and `import()` expressions) in the same module. The ES6 imports will be converted to `require()` calls and the ES6 exports will be converted to getters on that module's `exports` object. ### Scope hoisting Scope hoisting (the merging of all scopes in a module group into a single scope) is implemented using symbol merging. Each imported symbol is merged with the corresponding exported symbol so that they become the same symbol in the output, which means they both get the same name. Symbol merging is possible because each symbol has a `Link` field that, when used, forwards to another symbol. The implementation of `MergeSymbols()` essentially just links one symbol to the other one. Whenever the printer sees a symbol reference it must call `FollowSymbols()` to get to the symbol at the end of the link chain, which represents the final merged symbol. This is similar to the [union-find data structure](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) if you're familiar with it. During bundling, the symbol maps from all files are merged into a single giant symbol map, which allows symbols to be merged across files. The symbol map is represented as an array-of-arrays and a symbol reference is represented as two indices, one for the outer array and one for the inner array. The array-of-arrays representation is convenient because the parser produces a single symbol array for each file. Merging them all into a single map is as simple as making an array of the symbol arrays for each file. Each source file is identified using an incrementing index allocated during the scanning phase, so the index of the outer array is just the index of the source file. ### Converting ES6 imports to CommonJS imports One complexity around scope hoisting is that references to ES6 imports may either be a bare identifier (i.e. statically bound) or a property access off of a `require()` call (i.e. dynamically bound) depending on whether the imported module is a CommonJS-style module or not. This information isn't known yet when we're still parsing the file so we are unable to determine whether to create `EIdentifier` or `EDot` AST nodes for these imports. To handle this, references to ES6 imports use the special `EImportIdentifier` AST node. Later during linking we can decide if these references to a symbol need to be turned into a property access and, if so, fill in the `NamespaceAlias` field on the symbol. The printer checks that field for `EImportIdentifier` expressions and, if present, prints a property access instead of an identifier. This avoids having to do another full-AST traversal just to replace identifiers with property accesses before printing. ### The runtime library This library contains support code that is needed to implement various aspects of JavaScript transformation and bundling. For example, it contains the `__commonJS()` helper function for wrapping CommonJS modules and the `__decorate()` helper function for implementing TypeScript decorators. The code lives in a single string in [runtime.go](../internal/runtime/runtime.go). It's automatically included in every build and esbuild's tree shaking feature automatically strips out unused code. If you need to add a helper function for esbuild to call, it should be added to this library. ### Tree shaking The goal of tree shaking is to remove code that will never be used from the final bundle, which reduces download and parse time. Tree shaking treats the input files as a graph. Each node in the graph is a top-level statement, which is called a "part" in the code. Tree shaking is a graph traversal that starts from the entry point and marks all traversed parts for inclusion. Each part may declare symbols, reference symbols, and depend on other files. Parts are also marked as either having side effects or not. For example, the statement `let foo = 123` does not have side effects because, if nothing needs `foo`, the statement can be removed without any observable difference. But the statement `let foo = bar()` does have side effects because even if nothing needs `foo`, the call to `bar()` cannot be removed without changing the meaning of the code. If part A references a symbol declared in part B, the graph has an edge from A to B. References can span across files due to ES6 imports and exports. And if part A depends on file C, the graph has an edge from A to every part in C with side effects. A part depends on a file if it contains an ES6 `import` statement, a CommonJS `require()` call, or an ES6 `import()` expression. Tree shaking begins by visiting all parts in the entry point file with side effects, and continues traversing along graph edges until no more new parts are reached. Once the traversal has finished, only parts that were reached during the traversal are included in the bundle. All other parts are excluded. Here's an example to make this easier to visualize:

Diagram of tree shaking

There are three input files: `index.js`, `config.js`, and `net.js`. Tree shaking traverses along all graph edges from `index.js` (the entry point). The two types of edges are shown with different arrows. Solid arrows are edges due to parts with side effects. These parts must be included regardless of whether the symbols they declare are used or not. Dashed arrows are edges from symbol references to the parts that declare those symbols. These parts don't have side effects and are only included if symbol they declare is referenced. The final bundle only includes the code visited during the tree shaking traversal. That looks like this: ```js // net.js function get(url) { return fetch(url).then((r) => r.text()); } // config.js let session = Math.random(); let api = "/api?session="; function load() { return get(api + session); } // index.js let el = document.getElementById("el"); load().then((x) => el.textContent = x); ``` ### Code splitting Code splitting analyzes bundles with multiple entry points and divides code into chunks such that a) a given piece of code is only ever in one chunk and b) each entry point doesn't download code that it will never use. Note that the target of each dynamic `import()` expression is considered an additional entry point. Splitting shared code into separate chunks means that downloading the code for two entry points only downloads the shared code once. It also allows code that's only needed for an asynchronous `import()` dependency to be lazily loaded. Code splitting is implemented as an advanced form of tree shaking. The tree shaking traversal described above is run once for each entry point. Every part (i.e. node in the graph) stores all of the entry points that reached it during the traversal for that entry point. Then the combination of entry points for a given part determines what chunk that part ends up in. To continue the tree shaking example above, let's add a second entry point called `settings.js` that uses a different but overlapping set of parts. Tree shaking is run again starting from this new entry point:

Diagram of code splitting

These two tree shaking passes result in three chunks: all parts only reachable from `index.js`, all parts only reachable from `settings.js`, and all parts reachable from both `index.js` and `settings.js`. Parts belonging to the three chunks are colored red, blue, and purple in the visualization below:

Diagram of code splitting

After all chunks are identified, the chunks are linked together by automatically generating import and export statements for references to symbols that are declared in another chunk. Import statements must also be inserted for chunks that don't have any exported symbols. This represents shared code with side effects, and code with side effects must be retained. Here are the final code splitting chunks for this example after linking:
Chunk for index.jsChunk for settings.jsChunk for shared code
```js import { api, session } from "./chunk.js"; // net.js function get(url) { return fetch(url).then((r) => r.text()); } // config.js function load() { return get(api + session); } // index.js let el = document.getElementById("el"); load().then((x) => el.textContent = x); ``` ```js import { api, session } from "./chunk.js"; // net.js function put(url, body) { fetch(url, {method: "PUT", body}); } // config.js function save(value) { return put(api + session, value); } // settings.js let it = document.getElementById("it"); it.oninput = () => save(it.value); ``` ```js // config.js let session = Math.random(); let api = "/api?session="; export { api, session }; ```
There is one additional complexity to code splitting due to how ES6 module boundaries work. Code splitting must not be allowed to move an assignment to a module-local variable into a separate chunk from the declaration of that variable. ES6 imports are read-only and cannot be assigned to, so doing this will cause the assignment to crash at run time. To illustrate the problem, consider these three files:
entry1.jsentry2.jsdata.js
```js import {data} from './data' console.log(data) ``` ```js import {setData} from './data' setData(123) ``` ```js export let data export function setData(value) { data = value } ```
If the two entry points `entry1.js` and `entry2.js` are bundled with the code splitting algorithm described above, the result will be this invalid code:
Chunk for entry1.jsChunk for entry2.jsChunk for shared code
```js import { data } from "./chunk.js"; // entry1.js console.log(data); ``` ```js import { data } from "./chunk.js"; // data.js function setData(value) { data = value; } // entry2.js setData(123); ``` ```js // data.js let data; export { data }; ```
The assignment `data = value` will crash at run time with `TypeError: Assignment to constant variable`. To fix this, we must make sure that assignment ends up in the same chunk as the declaration `let data`. This is done by linking the parts with the assignments and the parts with the symbol declarations together such that their entry point sets are the same. That way all of those parts are marked as reachable from all entry points that can reach any of those parts. Note that linking a group of parts together also involves marking all dependencies of all parts in that group as reachable from those entry points too, including propagathing through any part groups those dependencies are already linked with, recursively. The grouping of parts can be non-trivial because there may be many parts involved and many assignments to different variables. Grouping is done by finding connected components on the graph where nodes are parts and edges are cross-part assignments. With this algorithm, the function `setData` in our example moves into the chunk of shared code after being bundled with code splitting:
Chunk for entry1.jsChunk for entry2.jsChunk for shared code
```js import { data } from "./chunk.js"; // entry1.js console.log(data); ``` ```js import { setData } from "./chunk.js"; // entry2.js setData(123); ``` ```js // data.js let data; function setData(value) { data = value; } export { data, setData }; ```
This code no longer contains assignments to cross-chunk variables. ## Notes about printing The printer converts JavaScript ASTs back into JavaScript source code. This is mainly intended to be consumed by the JavaScript VM for execution, with a secondary goal of being readable enough to debug when minification is disabled. It's not intended to be used as a code formatting tool and does not make complex formatting decisions. It handles the insertion of parentheses to preserve operator precedence as appropriate. Each file is printed independently from other files, so files can be printed in parallel. This extends to source map generation. As each file is printed, the printer builds up a "source map chunk" which is a [VLQ](https://en.wikipedia.org/wiki/Variable-length_quantity)-encoded sequence of source map offsets assuming the output file starts with the AST currently being printed. That source map chunk will later be "rebased" to start at the correct offset when all source map chunks are joined together. This is done by rewriting the first item in the sequence, which happens in `AppendSourceMapChunk()`. The current AST representation uses a single integer offset per AST node to store the location information. This is the index of the starting byte for that syntax construct in the original source file. Using this representation means that it's not possible to merge ASTs from two separate files and still have source maps work. That's not a problem since AST printing is fully parallelized in esbuild, but is something to keep in mind when modifying the code. ### Symbol minification An important part of JavaScript minification is symbol renaming. Internal symbols (i.e. those not externally visible) can be renamed to shorter names without changing the meaning of the code. That looks something like this:
Original codeCode with symbol minification
```js function useReducer(reducer, initialState) { let [state, setState] = useState(initialState); function dispatch(action) { let nextState = reducer(state, action); setState(nextState); } return [state, dispatch]; } ``` ```js function useReducer(b, c) { let [a, d] = useState(c); function e(f) { let g = b(a, f); d(g); } return [a, e]; } ```
It may initially seem like we can easily rename all symbols to a single character by assigning a unicode character to each one. There are over 100,000 unicode characters that are valid JavaScript identifiers after all. However, the goal is actually to use as few bytes as possible, and most unicode characters use multiple bytes when encoded as UTF-8. For this reason most JavaScript minifiers (including esbuild) restrict generated symbols to ASCII-only. With ASCII, JavaScript has only 54 possible one-character identifiers and 3453 possible two-character identifiers, so we should use the shortest names to rename the most frequently used symbols so that we can save the most bytes. Another consideration is that symbols in sibling scopes can share the same name without conflicting due to scoping rules. We can use this to our advantage by deliberately merging unrelated symbols in independent scopes, and then adding the frequency statistics of all of these symbols together before generating the final names. Something to keep in mind when renaming symbols is that the resulting JavaScript will likely be compressed before being downloaded, usually with gzip compression. Repeated sequences of characters will compress better than unique sequences of characters. A trick esbuild borrows from [Google Closure Compiler](https://github.com/google/closure-compiler) is to merge the symbols for arguments of sibling functions together: ```js // Before renaming function readFile(path, encoding, callback) { ... } function writeFile(path, contents, mode, callback) { ... } ``` ```js // After renaming function x(a, b, c) { ... } function y(a, b, c, d) { ... } ``` Because the symbols for the function arguments have been merged together in order, the character sequence `a, b, c` is present in both functions which should help with gzip compression. This is handled in esbuild by assigning each symbol in a nested (i.e. not top-level) scope a "slot", which is an incrementing index into an array of frequency counters. The symbols `a`, `b`, `c`, and `d` would be assigned slots `0`, `1`, `2`, and `3` in the example above. This approach is extra helpful for esbuild because slot assignments can be computed in parallel. The algorithm: 1. For each input source file (in parallel) 1. Track information for each top-level statement while parsing Each top-level statement tracks the which symbols are used, how many times each symbol is used, which symbols it defines, and which nested scopes that top-level statement contains. Code splitting operates on top-level statement boundaries, so this information will tell us exactly what symbols to include in the frequency analysis after code splitting is complete. This information is collected during parsing. 2. Assign slots to symbols in nested scopes Traverse the scope tree and assign slots to all symbols declared within each scope (skipping top-level scopes). Each scope in the scope tree starts assigning slots immediately after the maximum slot of the parent scope. 2. For each output chunk file (in parallel) 1. Create an array of frequency counters Each counter is indexed by slot and all counts are initially 0. The array length starts off being large enough to handle the maximum slot value of all files included in the chunk. 2. Assign slots for top-level symbols For each top-level statement included in the chunk, iterate over all top-level symbols declared in that statement and assign each symbol to a new slot by appending to the counter array. This slot will not overlap with any slots from nested scopes. The count starts off at 1 to represent the declaration. 3. Accumulate symbol usage counts into slots For each top-level statement included in the chunk, iterate over all symbol use records and increment the counter in that symbol's slot by the stored count from the symbol use record. 4. Sort slots by decreasing count This should tell us which symbol slots should be assigned the shortest names to minimize file size. 5. Assign names to slots in order Names are assigned from a fixed sequence of identifiers that starts off using one-character names, then two-character names, and so on (i.e. `a b c ... aa ba ca ... aaa baa caa ...`). We must be careful to avoid any JavaScript keywords such as `do` and `if` that would cause syntax errors. We must also avoid the names of any "unbound" symbols, which are symbols that are used without being declared such as `$` for jQuery. 6. When printing, use the name in that symbol's slot The printer only needs access to the array containing the names (indexed by slot) and a way of mapping a symbol reference to a slot index. This is a compact representation that doesn't need a lot of memory per chunk because the slot array length is O(unique symbol names) instead of O(number of symbols). The mapping of symbols from nested scopes (which is usually the majority of symbols) to their slot index is static and can be shared between all chunks. It's worth mentioning that this algorithm must be performed three separate times because there are three separate symbol namespaces in JavaScript that need minification: normal symbols, [label symbols](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/label), and [private symbols](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Classes/Private_class_fields). The same name in separate namespaces can't conflict with each other, so it's ok to reuse the same name across namespaces.